In conjunction with CVPR 2026.
Denver, CO
June 4th (Full day)
In recent years, the utilization of big data has greatly advanced Computer Vision and Machine Learning applications. However, the majority of these tasks have focused on only one modality, such as the visual one, with only a few incorporating multiple modalities like audio or thermal. Additionally, the handling of multimodal datasets remains a challenge, particularly in the areas of data acquisition, synchronization, and annotation. As a result, many research investigations have been limited to a single modality, and even when multiple modalities are considered independently, performance tends to suffer when compared to an integrated multimodal learning approach.
Recently, there has been a growing focus on leveraging the synchronization of multimodal streams to enhance the transfer of semantic information. Various works have successfully utilized combinations such as audio/video, RGB/depth, RGB/Lidar, visual/text, text/audio, and more, achieving exceptional outcomes. Additionally, intriguing applications have emerged, employing self-supervised methodologies that enable multiple modalities to learn associations without manual labeling. This approach yields more advanced feature representations as compared to individual modality processing. Moreover, researchers have explored training paradigms that allow neural networks to perform well even when one modality is absent due to sensor failure, impaired functioning, or unfavorable environmental conditions. These topics have garnered significant interest in the computer vision community, particularly in the field of autonomous driving. Furthermore, recent attention has been directed towards the fusion of language (including Large Language models) and vision, such as in the generation of images/videos from text (e.g., DALL-E, text2video), audio (wav2clip), or vice versa (image2speech). Exploiting multimodal scenarios, diffusion models have also emerged as a fascinating framework to explore.
The information fusion from multiple sensors is a topic of major interest also in industry, the exponential growth of companies working on automotive, drone vision, surveillance or robotics are just a few examples. Many companies are trying to automate processes, by using a large variety of control signals from different sources. The aim of this workshop is to generate momentum around this topic of growing interest, and to encourage interdisciplinary interaction and collaboration between computer vision, multimedia, remote sensing, and robotics communities, that will serve as a forum for research groups from academia and industry.
We expect contributions involving, but not limited to, image, video, audio, depth, IR, IMU, laser, text, drawings, synthetic, etc. Position papers with feasibility studies and cross-modality issues with highly applicative flair are also encouraged. Multimodal data analysis is a very important bridge among vision, multimedia, remote sensing, and robotics, therefore we expect a positive response from these communities.
Potential topics include, but are not limited to:
Papers will be limited to 8 pages according to the CVPR format (c.f. main conference authors guidelines also for what concerns dual and double submission). All papers will be reviewed by at least two reviewers with double blind policy. Papers will be selected based on relevance, significance and novelty of results, technical merit, and clarity of presentation. Papers will be published in CVPR 2026 workshop proceedings.
All the papers should be submitted using CMT website https://cmt3.research.microsoft.com/MULA2026.
Room: 111
08:45-09:00 - Welcome from organizers and openings remarks
09:00-09:45 - Keynote 1 - Hedvig Kjellström
TITLE: Multiple Modalities and Multiple Factors - Two Perspectives on Multimodal Estimation and Synthesis
ABSTRACT: In my research group we focus on computer modeling of human and animal behavior and communication in video, with applications in a range of topics such as medical diagnostics, equine biomechanics, music aesthetics, and human-robot interaction.
A central aspect in a majority of projects is multimodality - either in the form of multiple input modalities such as images and audio, or in the form of multiple underlying factors that are partly correlated and together give rise to observations.
In my talk I will give two examples, firstly, a project on co-speech gesture generation, and secondly, a 3D estimation method with disentanglement of underlying factors.
09:45-10:00 - Oral Session -
10:00-10:30 - Coffee Break
10:30-11:15 - Keynote 2 - Georgia Gkioxari
11:15-12:00 - Keynote 3 - Andrei Bursuc
12:00-12:30 - Oral Session
12.30-14.00 - Lunch Break
14:00-14:45 - Keynote 4 - Yuki Asano
14:45-15:30 - Keynote 5 - Ranjay Krishna
15:30-16:15 - Keynote 6 - Lorenzo Baraldi
TITLE: From Retrieval to Reflection to Reasoning: Rethinking Knowledge in Multimodal Foundation Models
ABSTRACT: Multimodal Large Language Models have made impressive progress in connecting perception to language, yet they remain fragile when factual precision and verifiability matter: hallucinations, knowledge gaps, and staleness are intrinsic to systems that store everything inside their parameters.
In this talk, I will trace a research trajectory that argues for a different design principle, that knowledge should not only be stored, but accessed, verified, and used on demand. I will walk through three complementary stages that we have explored in our lab.
First, retrieval, with hierarchical multimodal RAG pipelines and recurrence-enhanced multi-level retrievers that ground generation in external evidence. Second, reflection, with self-reflective tokens that allow the model to decide when retrieval is needed and to assess the relevance of what it retrieves.
Third, reasoning, with an approach which trains the generator to produce explicit reasoning trajectories over filtered evidence, optimized via a critic-guided RL objective. Along the way, I will also discuss why evaluation has to evolve in parallel.
Finally, I will close with open questions on the next generation of foundation models: how to control retrieval and reasoning, how we keep these systems sustainable, and what role abstraction-based, natively multimodal architectures may play.
16:15-17:00 - Oral Session
17.00-17.20 - Spotlight session
17.20-17.25 - Closing Remarks
17.25-18.30 - Poster Session
Hedvig Kjellström is a Professor in the Department of Robotics, Perception and Learning at KTH Royal Institute of Technology, Sweden, and also affiliated with Swedish e-Science Research Centre and Max Planck Institute for Intelligent Systems, Germany. She received an MSc in Engineering Physics and a PhD in Computer Science from KTH in 1997 and 2001, respectively, and thereafter worked at the Swedish Defence Research Agency, before returning to a faculty position at KTH. Her present research focuses on methods for enabling artificial agents to interpret human and animal behavior. These ideas are applied in the study of human aesthetic bodily expressions such as in music and dance, modeling and interpreting human communicative behavior, and the understanding of animal behavior and experiences. In order to accomplish this, methods are developed for agents to perceive the world and build representations of it through vision. Hedvig has received several prizes for her research, including the 2010 Koenderink Prize for fundamental contributions in computer vision. She has written around 150 papers in the fields of computer vision, machine learning, robotics, information fusion, cognitive science, speech, and human-computer interaction. She is mostly active within computer vision, where she is an Editor-in-Chief for CVIU, a Program Chair for CVPR 2025, and regularly serves as Area Chair for the major conferences.
Georgia Gkioxari is an Assistant Professor of Computing + Mathematical Sciences at Caltech and a William H. Hurt scholar. She is also a visiting researcher at Meta AI in the Embodied AI team. From 2016 to 2022, she was a research scientist at Meta's FAIR team. She received my PhD from UC Berkeley, where she was advised by Jitendra Malik. She did her bachelors in ECE at NTUA in Athens, Greece, where she worked with Petros Maragos. She is the recipient of the PAMI Young Researcher Award (2021).
Andrei Bursuc is a Senior Research Scientist and Deputy Scientific Director at valeo.ai and Research Associate at the Astra Inria project team in Paris working on advancing autonomous driving. His research spans reliable multi-sensor perception, uncertainty estimation, self-supervised and foundation-model learning, and video/world-modeling methods for autonomous systems. Previously he was a research scientist at Safran Tech in the aerospace industry. Prior to that he was a postdoctoral researcher at Inria Paris, within the Willow project team, and Inria Rennes within the LinkMedia team. He earned his PhD from Ecole des Mines Paris and Alcatel-Lucent Bell Labs France, focusing on visual content indexing and retrieval. Andrei is a member of the ELLIS society and teaches at Ecole Polytechnique and Ecole Normale Supérieure in Paris.
Yuki Asano is a Full Professor at the University of Technology Nuremberg, where he leads the Fundamental AI (FunAI) Lab. His research interests are in computer vision and machine learning, with a specialized focus on self-supervised and multimodal learning. Prior to his current role, he led the QUVA Lab at the University of Amsterdam in close collaboration with Qualcomm AI Research. He earned his PhD from the renowned Visual Geometry Group (VGG) at the University of Oxford.
Ranjay Krishna is an Assistant Professor at the Allen School of Computer Science & Engineering. He co-directs the RAIVN lab at UW and directs the PRIOR team at Ai2. His research lies at the intersection of computer vision, natural language processing, robotics, and human computer interaction. This research has received best paper, outstanding paper, and orals at CVPR, ACL, CSCW, NeurIPS, UIST, and ECCV, and has been reported by Science, Forbes, the Wall Street Journal, and PBS NOVA. His research has been supported by Google, Apple, Ai2, Amazon, Cisco, Toyota Motor Inc, Toyota Research Institute, NSF, ONR, and Yahoo. He holds a bachelor's degree in Electrical & Computer Engineering and in Computer Science from Cornell University, a master's degree in Computer Science from Stanford University and a Ph.D. in Computer Science from Stanford University.
Lorenzo Baraldi is a Tenure Track Assistant Professor at the University of Modena and Reggio Emilia. Among his research interests, he worked on Egocentric Vision and Gesture Recognition, Temporal Video Segmentation and Retrieval, Saliency, Video Captioning, Visual-Semantic alignment and Embodied AI. He is the author of more than 80 publications in international journals and conferences, and Associate Editor of Pattern Recognition Letters. He has been elected as Scholar in the ELLIS society, the European Laboratory for Learning and Intelligent Systems. Since 2021, he has been appointed as deputy director of the Interdipartimental Centre on Digital Humanities of the University of Modena and Reggio Emilia.







We gratefully acknowledge our reviewers
For additional info please contact us here