9th Multimodal Learning and Applications Workshop





In conjunction with CVPR 2026.

Denver, CO

TBD (Full day)

9th Multimodal Learning and Applications Workshop (MULA 2026)

In recent years, the utilization of big data has greatly advanced Computer Vision and Machine Learning applications. However, the majority of these tasks have focused on only one modality, such as the visual one, with only a few incorporating multiple modalities like audio or thermal. Additionally, the handling of multimodal datasets remains a challenge, particularly in the areas of data acquisition, synchronization, and annotation. As a result, many research investigations have been limited to a single modality, and even when multiple modalities are considered independently, performance tends to suffer when compared to an integrated multimodal learning approach.

Recently, there has been a growing focus on leveraging the synchronization of multimodal streams to enhance the transfer of semantic information. Various works have successfully utilized combinations such as audio/video, RGB/depth, RGB/Lidar, visual/text, text/audio, and more, achieving exceptional outcomes. Additionally, intriguing applications have emerged, employing self-supervised methodologies that enable multiple modalities to learn associations without manual labeling. This approach yields more advanced feature representations as compared to individual modality processing. Moreover, researchers have explored training paradigms that allow neural networks to perform well even when one modality is absent due to sensor failure, impaired functioning, or unfavorable environmental conditions. These topics have garnered significant interest in the computer vision community, particularly in the field of autonomous driving. Furthermore, recent attention has been directed towards the fusion of language (including Large Language models) and vision, such as in the generation of images/videos from text (e.g., DALL-E, text2video), audio (wav2clip), or vice versa (image2speech). Exploiting multimodal scenarios, diffusion models have also emerged as a fascinating framework to explore.

The information fusion from multiple sensors is a topic of major interest also in industry, the exponential growth of companies working on automotive, drone vision, surveillance or robotics are just a few examples. Many companies are trying to automate processes, by using a large variety of control signals from different sources. The aim of this workshop is to generate momentum around this topic of growing interest, and to encourage interdisciplinary interaction and collaboration between computer vision, multimedia, remote sensing, and robotics communities, that will serve as a forum for research groups from academia and industry.

We expect contributions involving, but not limited to, image, video, audio, depth, IR, IMU, laser, text, drawings, synthetic, etc. Position papers with feasibility studies and cross-modality issues with highly applicative flair are also encouraged. Multimodal data analysis is a very important bridge among vision, multimedia, remote sensing, and robotics, therefore we expect a positive response from these communities.

Potential topics include, but are not limited to:

  • Multimodal learning
  • Cross-modal learning
  • Self-supervised learning for multimodal data
  • Multimodal data generation and sensors
  • Unsupervised learning on multimodal data
  • Cross-modal adaptation
  • Multimodal data fusion and data representation
  • Multimodal transfer learning and Domain Adaptation
  • Multimodal scene understanding
  • Image and video synthesis by multimodal data
  • Multimodal diffusion models 
  • LLM in multimodal tasks 
  • Vision and Language
  • Vision and Sound
  • Vision + X
  • Multimodal biomedical analysis
  • Multimodal applications (e.g. drone vision, autonomous driving, industrial inspection, etc.)
  • Fairness and privacy in multimodal learning and applications

Submission

Papers will be limited to 8 pages according to the CVPR format (c.f. main conference authors guidelines also for what concerns dual and double submission). All papers will be reviewed by at least two reviewers with double blind policy. Papers will be selected based on relevance, significance and novelty of results, technical merit, and clarity of presentation. Papers will be published in CVPR 2026 workshop proceedings.

All the papers should be submitted using CMT website https://cmt3.research.microsoft.com/MULA2026.

Important Dates

  • Deadline for paper submission: March 9th, 2026 – Anywhere on Earth (AoE)
  • Notification of acceptance: March 20th, 2026
  • Camera Ready submission (strict!) deadline: April 5th, 2026

The Microsoft CMT service was used for managing the peer-reviewing process for this conference. This service was provided for free by Microsoft and they bore all expenses, including costs for Azure cloud services as well as for software development and support.

Program

Coming very soon!

Invited Speakers

Hedvig Kjellström is Professor in the Division of Robotics, Perception and Learning, KTH. She is also affiliated with Swedish University of Agricultural Sciences, Swedish e-Science Research Centre, and Max Planck Institute for Intelligent Systems, Germany. Her research focuses on enabling artificial agents to interpret human and animal behavior, by developing methods to build representations of the world through computer vision.

Georgia Gkioxari is an Assistant Professor of Computing + Mathematical Sciences at Caltech and a William H. Hurt scholar. She is also a visiting researcher at Meta AI in the Embodied AI team. From 2016 to 2022, she was a research scientist at Meta's FAIR team. She received my PhD from UC Berkeley, where she was advised by Jitendra Malik. She did her bachelors in ECE at NTUA in Athens, Greece, where she worked with Petros Maragos. She is the recipient of the PAMI Young Researcher Award (2021).

Serena Yeung-Levy is an Assistant Professor of Biomedical Data Science and, by courtesy, of Computer Science and of Electrical Engineering at Stanford University. Her research interests are in the areas of computer vision, machine learning, and deep learning, focusing on applications to healthcare. She leads the Medical AI and Computer Vision Lab (MARVL) at Stanford, and serves as Associate Director of Data Science for the Stanford Center for Artificial Intelligence in Medicine & Imaging (AIMI). She is also affiliated with the Stanford Clinical Excellence Research Center (CERC). Prior to that she was a Technology for Equitable and Accessible Medicine (TEAM) postdoctoral fellow at Harvard University, received her PhD from Stanford University, and spent time at Facebook AI Research and Google Cloud AI.

Yuki Asano is a Full Professor at the University of Technology Nuremberg, where he leads the Fundamental AI (FunAI) Lab. His research interests are in computer vision and machine learning, with a specialized focus on self-supervised and multimodal learning. Prior to his current role, he led the QUVA Lab at the University of Amsterdam in close collaboration with Qualcomm AI Research. He earned his PhD from the renowned Visual Geometry Group (VGG) at the University of Oxford.

Ranjay Krishna is an Assistant Professor at the Allen School of Computer Science & Engineering. He co-directs the RAIVN lab at UW and directs the PRIOR team at Ai2. His research lies at the intersection of computer vision, natural language processing, robotics, and human computer interaction. This research has received best paper, outstanding paper, and orals at CVPR, ACL, CSCW, NeurIPS, UIST, and ECCV, and has been reported by Science, Forbes, the Wall Street Journal, and PBS NOVA. His research has been supported by Google, Apple, Ai2, Amazon, Cisco, Toyota Motor Inc, Toyota Research Institute, NSF, ONR, and Yahoo. He holds a bachelor's degree in Electrical & Computer Engineering and in Computer Science from Cornell University, a master's degree in Computer Science from Stanford University and a Ph.D. in Computer Science from Stanford University.

Marcella Cornia received the Ph.D. degree from the University of Modena and Reggio Emilia. In 2020 she received the Young Researcher Award for the category "Artificial Intelligence and Big Data" and in 2021 and 2022 she was respectively awarded by CVPL, the Italian Association for Computer Vision, Pattern Recognition and Machine Learning, and by ECVA, the European Computer Vision Association, for the best Italian and European Ph.D. thesis in the Computer Vision field. She is currently an Associate Professor at the Department of Education and Humanities of the University of Modena and Reggio Emilia. She has authored or co-authored more than 80 publications in scientific journals and international conference proceedings. Her research interests include vision-and-language, multimodal learning, and saliency and attentive models. She is an ELLIS member and holds/has held the role of Associate Editor for the journals IEEE Transactions on Image Processing and the European Journal on Artificial Intelligence.

Organizers

Pietro Morerio

Istituto Italiano di Tecnologia, Italy

Paolo Rota

Università di Trento, Italy

Michael Ying Yang

University of Bath, UK

Bodo Rosenhahn

Institut für Informationsverarbeitung, Leibniz-Universität Hannover, Germany

Vittorio Murino

Istituto Italiano di Tecnologia & Università di Verona, Italy

Hao Cheng

University of Twente, The Netherlands

Benedetta Liberatori

University of Trento, Italy

Acknowledgments

We gratefully acknowledge our reviewers

Old Editions

  • 1st edition @ ECCV 2018 - Munich, Germany, Link
  • 2nd edition @ CVPR 2019 - Long Beach, Link
  • 3rd edition @ CVPR 2020 - VIRTUAL, Link
  • 4th edition @ CVPR 2021 - VIRTUAL, Link
  • 5th edition @ CVPR 2022 - New Orleans, Link
  • 6th edition @ CVPR 2023 - Vancouver, Link
  • 7th edition @ CVPR 2024 - Seattle, Link
  • 8th edition @ CVPR 2025 - Nashville, Link

Contacts

For additional info please contact us here