In conjunction with CVPR 2025.
Nashville, TN
June 11th 2025 (Morning)
In recent years, the utilization of big data has greatly advanced Computer Vision and Machine Learning applications. However, the majority of these tasks have focused on only one modality, such as the visual one, with only a few incorporating multiple modalities like audio or thermal. Additionally, the handling of multimodal datasets remains a challenge, particularly in the areas of data acquisition, synchronization, and annotation. As a result, many research investigations have been limited to a single modality, and even when multiple modalities are considered independently, performance tends to suffer when compared to an integrated multimodal learning approach.
Recently, there has been a growing focus on leveraging the synchronization of multimodal streams to enhance the transfer of semantic information. Various works have successfully utilized combinations such as audio/video, RGB/depth, RGB/Lidar, visual/text, text/audio, and more, achieving exceptional outcomes. Additionally, intriguing applications have emerged, employing self-supervised methodologies that enable multiple modalities to learn associations without manual labeling. This approach yields more advanced feature representations as compared to individual modality processing. Moreover, researchers have explored training paradigms that allow neural networks to perform well even when one modality is absent due to sensor failure, impaired functioning, or unfavorable environmental conditions. These topics have garnered significant interest in the computer vision community, particularly in the field of autonomous driving. Furthermore, recent attention has been directed towards the fusion of language (including Large Language models) and vision, such as in the generation of images/videos from text (e.g., DALL-E, text2video), audio (wav2clip), or vice versa (image2speech). Exploiting multimodal scenarios, diffusion models have also emerged as a fascinating framework to explore.
The information fusion from multiple sensors is a topic of major interest also in industry, the exponential growth of companies working on automotive, drone vision, surveillance or robotics are just a few examples. Many companies are trying to automate processes, by using a large variety of control signals from different sources. The aim of this workshop is to generate momentum around this topic of growing interest, and to encourage interdisciplinary interaction and collaboration between computer vision, multimedia, remote sensing, and robotics communities, that will serve as a forum for research groups from academia and industry.
We expect contributions involving, but not limited to, image, video, audio, depth, IR, IMU, laser, text, drawings, synthetic, etc. Position papers with feasibility studies and cross-modality issues with highly applicative flair are also encouraged. Multimodal data analysis is a very important bridge among vision, multimedia, remote sensing, and robotics, therefore we expect a positive response from these communities.
Potential topics include, but are not limited to:
Papers will be limited to 8 pages according to the CVPR format (c.f. main conference authors guidelines also for what concerns dual and double submission). All papers will be reviewed by at least two reviewers with double blind policy. Papers will be selected based on relevance, significance and novelty of results, technical merit, and clarity of presentation. Papers will be published in CVPR 2025 workshop proceedings.
All the papers should be submitted using CMT website https://cmt3.research.microsoft.com/MULA2025.
Room: 106 B
08:30-08:40 - Welcome from organizers and openings remarks
08:40-09:20 - Keynote 1 - Elisa Ricci
TITLE: Toward generalizable Vision-Language Models: Improving fine-grained understanding from limited image samples and synthetic videos
ABSTRACT: Vision-Language Models have shown impressive performance on a wide range of tasks,
yet their generalization capabilities remain a key challenge especially in fine-grained image and video understanding tasks.
In this talk, I will present two recent works that explore novel strategies to address this limitation.
First, I will consider the problem of few-shot adaptation for image recognition.
I will introduce Two-Stage Few-Shot Adaptation (2SFS), a novel and simple strategy that explicitly separates task-level feature extraction and concept specialization.
2SFS yields improved generalization capabilities over baselines and consistent gains across multiple datasets, backbones, and settings.
Second, I will present SynViTA, a novel framework for improving video-language alignment using synthetic videos.
SynViTA mitigates the noise and the distribution shift in generated video content by weighting samples based on semantic similarity
and enforcing fine-grained caption consistency, leading to consistent gains on multiple video benchmarks and downstream tasks.
09:20-10:00 - Keynote 2 - Shaogang Gong
TITLE: From Test-Time Inference to Small Data Generative Learning
ABSTRACT: Language Models (VLLM) has revolutionised machine learning in computer vision in
recent years largely due to their capacity for semantic reasoning in supporting visual interpretation in
context. Computer vision fundamentally requires answering two questions of ‘what’ and ‘where/when’.
However, VLLM multimodal foundation models are poor for solving the `where/when’ localisation
problem underpinning object detection, segmentation, video understanding, and generative synthesis
of details due to a lack of fine-grained domain specific knowledge without sufficient target domain
fine-grained training data. Moreover, increasing privacy concerns from data protection and
environmental concerns on energy consumption together with a need for incremental model
expansion in supporting decentralised and distributed user target domains of small data pose
fundamental challenges to the established wisdom for learning of a centralised single model from
exhaustive labelling. In this talk, I will present progress on exploring VLLM for test-time inference
without learning and for small data generative learning, using examples in automatic prompt control in
image segmentation by leveraging (rather than removing) VLLM hallucination for more reliable and
trustworthy semantic segmentation, and diffusion few-shot image generation with artifact detection to
overcome the limitations of LLM.
10:00-10:30 - Coffee Break
10:30-11:10 - Keynote 3 - Katerina Fragkiadaki
TITLE: Unified Vision-Language Generation and 2D/3D Understanding
ABSTRACT: Recent advances in large-scale language modeling have demonstrated significant success across various tasks,
prompting efforts to extend these capabilities to other modalities, including 2D and 3D vision.
However, this effort has been met with a variety of challenges due to fundamental differences in data representations, task-specific requirements,
and the relative scarcity of large, high-quality annotated datasets for modalities beyond text.
In this work, we present two approaches for solving these challenges.
First, we explore discrete diffusion models as a unified generative formulation in the joint text and image domain and demonstrate their advantages
over autoregressive models including improved control over quality versus diversity, joint multimodal inpainting,
and greater controllability in generation through guidance. Second, we develop a method to jointly train 2D and 3D vision-language models,
allowing for knowledge transfer from abundant 2D datasets to comparatively limited 3D tasks.
By employing a shared architecture, this approach significantly improves performance on various 3D vision-language tasks.
11:10-11:50 - Keynote 4 - Georgia Gkioxari
11:50-12:20 - Oral Session
(ID 01) - Missing Modality in Multimodal Egocentric Datasets, Merey Ramazanova et al. (8-min presentation + 2-min Q&A)
(ID 06) - SplatTouch: Explicit 3D Representation Binding Vision and Touch, Antonio Luigi Stefani et al. (8-min presentation + 2-min Q&A)
(ID 39) - LVP-CLIP: Revisiting CLIP for Continual Learning with Label Vector Pool, Yue Ma et al. (8-min presentation + 2-min Q&A)
12:20-12:30 - Closing Remarks
14.00-16.00 - Poster Session - ExHall D - poster boards #372 - #388
Elisa Ricci is a Professor at the Department of Information Engineering and Computer Science (DISI) at University of Trento and the Head of the Research Unit Deep Visual Learning at Fondazione Bruno Kessler. Elisa is also the Coordinator of Doctoral Program in Information Engineering and Computer Science at University of Trento. She is an ELLIS and a IAPR Fellow. Her research lies at the intersection of computer vision, deep learning and robotics perception. She is interested in developing novel approaches for learning from visual and multi-modal data in an open world, with particular emphasis in methods for domain adaptation, continual and self-supervised learning.
Georgia Gkioxari is an Assistant Professor of Computing + Mathematical Sciences at Caltech and a William H. Hurt scholar. She is also a visiting researcher at Meta AI in the Embodied AI team. From 2016 to 2022, she was a research scientist at Meta's FAIR team. She received my PhD from UC Berkeley, where she was advised by Jitendra Malik. She did her bachelors in ECE at NTUA in Athens, Greece, where she worked with Petros Maragos. She is the recipient of the PAMI Young Researcher Award (2021).
Sean Gong pioneered person re-identification and video behaviour analysis for law enforcement. Prof Gong is elected a Fellow of the Royal Academy of Engineering, and served on the steering panel of the UK government Chief Scientific Adviser’s Science Review on Security. He has made unique contributions to the engineering of AI video analytics for law enforcement and the security industry and was awarded an Institution for Engineering and Technology Achievement Medal for Vision Engineering for outstanding achievement and superior performance in contributing to public safety. A commercial system built on his research won an Aerospace Defence Security Innovation Award, and a Global Frost & Sullivan Award for Technical Innovation for Law Enforcement Video Forensics Technology. Gong is Professor of Visual Computation and Director of the Computer Vision Laboratory at Queen Mary University of London, a Turing Fellow of the Alan Turing Institute, a member of the UK Computing Research Committee. He founded Vision Semantics and served as the Chief Scientist of three start-ups. He is a Distinguished Scientist of Veritone. He received his DPhil from Oxford.
Katerina Fragkiadaki is a JPMorgan Chase Associate Professor of Computer Science in the Machine Learning Department at Carnegie Mellon University. She works in Artificial Intelligence at the intersection of Computer Vision, Machine Learning, Language Understanding and Robotics. Prior to joining MLD's faculty she spent three years as a post doctoral researcher first at UC Berkeley working with Jitendra Malik and then at Google Research in Mountain View working with the video group. She completed her Ph.D. in GRASP, UPenn with Jianbo Shi . She did her undergraduate studies at the National Technical University of Athens and before that she was in Crete.
We gratefully acknowledge our reviewers
For additional info please contact us here