7th Multimodal Learning and Applications Workshop





In conjunction with CVPR 2024.

Seattle, WA

June 18th 2024 (Full day)

7th Multimodal Learning and Applications Workshop (MULA 2024)

In recent years, the utilization of big data has greatly advanced Computer Vision and Machine Learning applications. However, the majority of these tasks have focused on only one modality, such as the visual one, with only a few incorporating multiple modalities like audio or thermal. Additionally, the handling of multimodal datasets remains a challenge, particularly in the areas of data acquisition, synchronization, and annotation. As a result, many research investigations have been limited to a single modality, and even when multiple modalities are considered independently, performance tends to suffer when compared to an integrated multimodal learning approach.

Recently, there has been a growing focus on leveraging the synchronization of multimodal streams to enhance the transfer of semantic information. Various works have successfully utilized combinations such as audio/video, RGB/depth, RGB/Lidar, visual/text, text/audio, and more, achieving exceptional outcomes. Additionally, intriguing applications have emerged, employing self-supervised methodologies that enable multiple modalities to learn associations without manual labeling. This approach yields more advanced feature representations as compared to individual modality processing. Moreover, researchers have explored training paradigms that allow neural networks to perform well even when one modality is absent due to sensor failure, impaired functioning, or unfavorable environmental conditions. These topics have garnered significant interest in the computer vision community, particularly in the field of autonomous driving. Furthermore, recent attention has been directed towards the fusion of language (including Large Language models) and vision, such as in the generation of images/videos from text (e.g., DALL-E, text2video), audio (wav2clip), or vice versa (image2speech). Exploiting multimodal scenarios, diffusion models have also emerged as a fascinating framework to explore.

The information fusion from multiple sensors is a topic of major interest also in industry, the exponential growth of companies working on automotive, drone vision, surveillance or robotics are just a few examples. Many companies are trying to automate processes, by using a large variety of control signals from different sources. The aim of this workshop is to generate momentum around this topic of growing interest, and to encourage interdisciplinary interaction and collaboration between computer vision, multimedia, remote sensing, and robotics communities, that will serve as a forum for research groups from academia and industry.

We expect contributions involving, but not limited to, image, video, audio, depth, IR, IMU, laser, text, drawings, synthetic, etc. Position papers with feasibility studies and cross-modality issues with highly applicative flair are also encouraged. Multimodal data analysis is a very important bridge among vision, multimedia, remote sensing, and robotics, therefore we expect a positive response from these communities.

Potential topics include, but are not limited to:

  • Multimodal learning
  • Cross-modal learning
  • Self-supervised learning for multimodal data
  • Multimodal data generation and sensors
  • Unsupervised learning on multimodal data
  • Cross-modal adaptation
  • Multimodal data fusion and data representation
  • Multimodal transfer learning and Domain Adaptation
  • Multimodal scene understanding
  • Image and video synthesis by multimodal data
  • Multimodal diffusion models 
  • LLM in multimodal tasks 
  • Vision and Language
  • Vision and Sound
  • Vision + X
  • Multimodal biomedical analysis
  • Multimodal applications (e.g. drone vision, autonomous driving, industrial inspection, etc.)
  • Fairness and privacy in multimodal learning and applications

Submission

Papers will be limited to 8 pages according to the CVPR format (c.f. main conference authors guidelines). All papers will be reviewed by at least two reviewers with double blind policy. Papers will be selected based on relevance, significance and novelty of results, technical merit, and clarity of presentation. Papers will be published in CVPR 2024 workshop proceedings.

All the papers should be submitted using CMT website https://cmt3.research.microsoft.com/MULA2024.

Important Dates

  • Deadline for submission: March 2nd, 2024 - 23:59 Pacific Standard Time
  • ---EXTENDED---
  • Deadline for submission: March 11th, 2024 - 23:59 Pacific Standard Time
  • Notification of acceptance April 8th, 2024
  • Camera Ready submission (strict!) deadline: April 14th, 2024
  • Workshop date: June 18th, 2024 (Full day)

Camera Ready Submission instructions

  • CVPR Workshops 2024 Camera-Ready Submission Instructions Link
  • Submission Site Link

Program

Room: Summit 320

09:00-09:15 - Welcome from organizers and openings remarks

09:15-10:00 - Keynote 1 - Massimiliano Mancini (University of Trento)

TITLE: Opening multimodal doors with language: from recognition to bias detection

10:00-10:25 - Coffee break

10:25-11:00 - Oral Session 1 - Vision & Depth

(ID 09) (10-min presentation + 5-min Q&A)- RGB-D Cube R-CNN: 3D Object Detection with Selective Modality Dropout Jens Piekenbrinck ( et al.
(ID 21) (5-min presentation + 5-min Q&A)- LAformer: Trajectory Prediction for Autonomous Driving with Lane-Aware Scene Constraints Mengmeng Liu et al.
(ID 29) (5-min presentation + 5-min Q&A)- Multi-Modal Fusion of Event and RGB for Monocular Depth Estimation Using a Unified Transformer-based Architecture Anusha Devulapally et al.

11:00-11:45 - Keynote 2 - Dima Damen (University of Bristol and Google DeepMind)

TITLE: On Video, Audio and Language Multi-Modality in Egocentric Vision

11:45-12:30 - Oral Session 2 - Vision & Audio

(ID 04) (10-min presentation + 5-min Q&A)- Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection Ayush Ghadiya et al.
(ID 14) (5-min presentation + 5-min Q&A)- Listen Then See: Video Alignment with Speaker Attention Aviral Agrawal et al.
(ID 24) (5-min presentation + 5-min Q&A)- VMCML: Video and Music Matching via Cross-Modality Lifting Yi-Shan Lee et al.
(ID 30) (5-min presentation + 5-min Q&A)- Exploring the Role of Audio in Video Captioning Yuhan Shen et al.

12:30-14:00 - Lunch

14:00-14:45 - Keynote 3 - Cees Snoek (University of Amsterdam)

TITLE: Multimodal Learning Under Visually Challenging Conditions

14:45-15:20 - Oral Session 3 - Vision & Language (1)

(ID 11) (10-min presentation + 5-min Q&A)- Multimodal Understanding of Memes with Fair Explanations Yang Zhong et al.
(ID 06) (5-min presentation + 5-min Q&A)- De-noised Vision-language Fusion Guided by Visual Cues for E-commerce Product Search Zhizhang Hu et al.
(ID 23) (5-min presentation + 5-min Q&A)- ZInD-Tell: Towards Translating Indoor Panoramas into Descriptions Tonmoay Deb et al.

15:20-15:40 - Coffee break

15:40-16:25 - Keynote 4 - Laura Leal-Taixe (NVIDIA and Technical University of Munich)

TITLE: Open-world segmentation and Tracking in 3D

16:25-17:00 - Oral Session 4 - Vision & Language (2)

(ID 05) (10-min presentation + 5-min Q&A)- Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning Zaber Ibn Abdul Hakim et al.
(ID 18) (5-min presentation + 5-min Q&A)- InVERGe: Intelligent Visual Encoder for Bridging Modalities in Report Generation Ankan Deria et al.
(ID 28) (5-min presentation + 5-min Q&A)- AIGeN: An Adversarial Approach for Instruction Generation in VLN Niyati Rawal et al.

17:00-17:45 - Keynote 5 - Gül Varol (École des Ponts ParisTech)

TITLE: AutoAD Trilogy: Audio Description Generation for Movies

17:45-18:00 - Closing Remarks

Invited Speakers

Dima Damen is a Professor of Computer Vision at the University of Bristol and Senior Research Scientist at Google DeepMind. Dima is currently an EPSRC Fellow (2020-2025), focusing her research interests in the automatic understanding of object interactions, actions and activities using wearable visual (and depth) sensors. She is best known for her leading works in Egocentric Vision, and has also contributed to novel research questions including mono-to-3D, video object segmentation, assessing action completion, domain adaptation, skill/expertise determination from video sequences, discovering task-relevant objects, dual-domain and dual-time learning as well as multi-modal fusion using vision, audio and language.

Laura Leal-Taixé is a Senior Research Manager at NVIDIA and also an Adjunct Professor at the Technical University of Munich (TUM), leading the Dynamic Vision and Learning group. From 2018 until 2022, she was a tenure-track professor at TUM. Before that, she spent two years as a postdoctoral researcher at ETH Zurich, Switzerland, and a year as a senior postdoctoral researcher in the Computer Vision Group at the Technical University in Munich. She obtained her PhD from the Leibniz University of Hannover in Germany, spending a year as a visiting scholar at the University of Michigan, Ann Arbor, USA. She pursued B.Sc. and M.Sc. in Telecommunications Engineering at the Technical University of Catalonia (UPC) in her native city of Barcelona. She went to Boston, USA to do her Masters Thesis at Northeastern University with a fellowship from the Vodafone foundation. She is a recipient of the Sofja Kovalevskaja Award of 1.65 million euros in 2017, the Google Faculty Award in 2021, and the ERC Starting Grant in 2022.

Cees G.M. Snoek is a full professor in computer science at the University of Amsterdam, where he heads the Video & Image Sense Lab. He is also a director of three public-private AI research labs: QUVA Lab with Qualcomm, Atlas Lab with TomTom and AIM Lab with the Inception Institute of Artificial Intelligence. At University spin-off Kepler Vision Technologies he acts as Chief Scientific Officer. Professor Snoek is also the director of the ELLIS Amsterdam Unit and scientific director of Amsterdam AI, a collaboration between government, academic, medical and other organisations in Amsterdam to help the city develop and deploy responsible AI. He received the M.Sc. degree in business information systems (2000) and the Ph.D. degree in computer science (2005) both from the University of Amsterdam, The Netherlands.

Gül Varol is a permanent researcher (~Assist. Prof.) in the IMAGINE team at École des Ponts ParisTech. Previously, she was a postdoctoral researcher at the University of Oxford (VGG), working with Andrew Zisserman. She obtained her PhD from the WILLOW team of Inria Paris and École Normale Supérieure (ENS). Her thesis, co-advised by Ivan Laptev and Cordelia Schmid, received the PhD awards from ELLIS and AFRIF. During her PhD, she spent time at MPI, Adobe, and Google. Prior to that, she received her BS and MS degrees from Boğaziçi University. She regularly serves as an Area Chair at major computer vision conferences, and will serve as a Program Chair at ECCV'24. She has co-organized a number of workshops at CVPR, ICCV, ECCV, and NeurIPS. Her research interests cover vision and language applications, including video representation learning, human motion synthesis, and sign languages.

Massimiliano Mancini is an ELLIS member and an Assistant Professor at the University of Trento. He completed his Ph.D. at the Sapienza University of Rome, co-advised by Barbara Caputo and Elisa Ricci. During his Ph.D., he was part of the TeV lab at Fondazione Bruno Kessler, the VANDAL lab at the Italian Institute of Technology, and a visiting student at the KTH Royal Institute of Technology. After his Ph.D., he joined the University of Tübingen as a postdoc in the Explainable Machine Learning group led by Zeynep Akata. He serves as area chair for major conferences in the field (CVPR, ECCV, NeurIPS, ICRA) and as an associate/area editor for CVIU and TMLR. His research focuses on efficient transfer learning, cross-domain generalization, continual learning, automatic bias identification, and compositional reasoning.

Organizers

Paolo Rota

Università di Trento, Italy

Pietro Morerio

Istituto Italiano di Tecnologia, Italy

Michael Ying Yang

University of Bath, UK

Bodo Rosenhahn

Institut für Informationsverarbeitung, Leibniz-Universität Hannover, Germany

Vittorio Murino

Istituto Italiano di Tecnologia & Università di Verona, Italy

Acknowledgments

We gratefully acknowledge our reviewers

Old Editions

  • 1st edition @ ECCV 2018 - Munich, Germany, Link
  • 2nd edition @ CVPR 2019 - Long Beach, Link
  • 3rd edition @ CVPR 2020 - VIRTUAL, Link
  • 4th edition @ CVPR 2021 - VIRTUAL, Link
  • 5th edition @ CVPR 2022 - New Orleans, Link
  • 5th edition @ CVPR 2023 - Vancouver, Link

Contacts

For additional info please contact us here