In conjunction with CVPR 2023.
Vancouver, Canada
June 18th 2023 (Full day)Room: West 223-224
The exploitation of the power of big data in the last few years led to a big step forward in many applications of Computer Vision. However, most of the tasks tackled so far are involving visual modality only, mainly due to the unbalanced number of labelled samples available among modalities (e.g., there are many huge labelled datasets for images while not as many for audio or IMU based classification), resulting in a huge gap in performance when algorithms are trained separately.
Recently, a few works have started to exploit the synchronization of multimodal streams (e.g., audio/video, RGB/depth, RGB/Lidar, visual/text, text/audio) to transfer semantic information from one modality to another reaching surprising results. Interesting applications are also proposed in a self-supervised fashion, where multiple modalities are learning correspondences without need of manual labelling, resulting in a more powerful set of features compared to those learned processing the two modalities separately. Other works have also shown that particular training paradigms allow neural networks to perform well when one of the modalities is missing due to sensor failure or unfavorable environmental conditions. These topics are gaining lots of interest in computer vision community in the recent years.
The information fusion from multiple sensors is a topic of major interest also in industry, the exponential growth of companies working on automotive, drone vision, surveillance or robotics are just a few examples. Many companies are trying to automate processes, by using a large variety of control signals from different sources. The aim of this workshop is to generate momentum around this topic of growing interest, and to encourage interdisciplinary interaction and collaboration between computer vision, multimedia, remote sensing, and robotics communities, that will serve as a forum for research groups from academia and industry.
We expect contributions involving, but not limited to, image, video, audio, depth, IR, IMU, laser, text, drawings, synthetic, etc. Position papers with feasibility studies and cross-modality issues with highly applicative flair are also encouraged. Multimodal data analysis is a very important bridge among vision, multimedia, remote sensing, and robotics, therefore we expect a positive response from these communities.
Potential topics include, but are not limited to:
Papers will be limited to 8 pages according to the CVPR format (c.f. main conference authors guidelines). All papers will be reviewed by at least two reviewers with double blind policy. Papers will be selected based on relevance, significance and novelty of results, technical merit, and clarity of presentation. Papers will be published in CVPR 2023 workshop proceedings.
All the papers should be submitted using CMT website https://cmt3.research.microsoft.com/MULA2023.
Room: West 223-224
09:15-09:30 - Welcome from organizers and openings remarks
09:30-10:15 - Keynote 1 - Nicu Sebe (University of Trento)
10:15-10:45 - Coffee break
10:45-11:30 - Oral Session 1 (10-min presentations + 5-min Q&A)
(ID 2) - TFRGAN: Leveraging Text Information for Blind Face Restoration with Extreme Degradation - Chengxing Xie (Xidian); Qian Ning (Xidian University); Weisheng Dong (Xidian University)*; Guangming Shi (Xidian University)
(ID 3) - The MONET dataset: Multimodal drone thermal dataset recorded in rural scenarios - Luigi Riz (Fondazione Bruno Kessler); Andrea Caraffa (Fondazione Bruno Kessler); Matteo Bortolon (Fondazione Bruno Kessler;Istituto Italiano di Tecnologia (IIT);University of Trento); Mohamed Lamine Mekhalfi (Fondazione Bruno Kessler); Davide Boscaini (Fondazione Bruno Kessler); André F. Moura (INESC TEC); José Filipe Antunes (INESC TEC); André M. Dias (INESC TEC); Hugo M Silva (INESCTEC); Andreas Leonidou (The Cyprus Institute ); Christos Constantinides (CARE-C); Christos M Keleshis (The Cyprus Institute); Dante Abate (The Cyprus Institute); Fabio Poiesi (Fondazione Bruno Kessler)*
(ID 4) - SSGVS: Semantic Scene Graph-to-Video Synthesis - Yuren Cong (Leibniz University Hannover); Jinhui Yi (University of Bonn); Bodo Rosenhahn (Leibniz University Hannover); Michael Ying Yang (University of Twente)*
11:30-12:15 - Keynote 2 - Helge Rhodin (University of British Columbia)
TITLE: "Unpaired Multi-modal Learning"
ABSTRACT: Multi-modal learning becomes easy when paired data is available. However, capturing multiple modalities simultaneously can be tricky or even impossible. For example, capturing 3D human motion trajectories is best done in a studio, while the corresponding video should be captured outdoors to achieve realism. In this presentation, I will showcase various new and established scenarios I encountered, ranging from the creation of a new sign language (video + audio) to neuroscience research on mice (video + brain neural activation), as well as human motion analysis (video + pose). All of these scenarios can be reduced to a form of correspondence finding. It is not the classical image correspondence, but rather finding correspondences across domains---a challenging problem with immense untapped potential.
12:15-13:15 - Lunch
13:15-14:00 - Keynote 3 - Alireza Fathi (Google Research)
TITLE: "Retrieval and Tool Augmented Visual Language Models"
14:00-14:45 - Oral Session 2 (10-min presentations + 5-min Q&A)
(ID 5) - Multi Event Localization by Audio-Visual Fusion with Omnidirectional Camera and Microphone Array - Wenru Zheng (Tokyo Institute of Technology)*; Ryota Yoshihashi (Tokyo Institute of Technology); Rei Kawakami (Tokyo Institute of Technology); Ikuro Sato (Tokyo Institute of Technology / Denso IT Laboratory); Asako Kanezaki (Tokyo Institute of Technology)
(ID 8) - Dynamic Multimodal Fusion - Zihui Xue (The University of Texas at Austin)*; Radu Marculescu (The University of Texas at Austin)
(ID 9) - Exposing and Mitigating Spurious Correlations for Cross-Modal Retrieval - Jae Myung Kim (University of Tuebingen)*; A. Sophia Koepke (University of Tübingen); Cordelia Schmid (Inria/Google); Zeynep Akata (University of Tübingen)
14:45-15:15 - Coffee break
15:15-16:00 - Oral Session 3 (10-min presentations + 5-min Q&A)
(ID 10) - Adapting Grounded Visual Question Answering Models to Low Resource Languages - Ying Wang (New York University)*; Jonas Pfeiffer (Google Research ); Nicolas Carion (NYU); Yann LeCun (New York University); Aishwarya Kamath (New York University)
(ID 11) - SEM-POS: Grammatically and Semantically Correct Video Captioning - Asmar Nadeem (University of Surrey)*; Adrian Hilton (University of Surrey); Robert Dawes (BBC Research); Graham Thomas (BBC); Armin Mustafa (University of Surrey)
(ID 15) - Robust Multiview Multimodal Driver Monitoring System Using Masked Multi-Head Self-Attention - Yiming Ma (University of Warwick)*; Victor Sanchez (University of Warwick); Soodeh SN Nikan (Ford Motor Company); Devesh Upadhyay (Ford Motor Co.); Bhushan S Atote (University of Warwick); Tanaya Guha (University of Glasgow)
16:00-16:45 - Keynote 4 - Aleksander Hołyński ((UC Berkeley (BAIR) and Google Research))
16:45-16:50 - Closing Remarks
16:50-18:00 - Poster Session (all papers) West Exhibit Hall - poster slots #64 - #83
Nicu Sebe is a professor in Computer Science at the University of Trento, Italy, where he is the director of the Department of Information Engineering and Computer Science. He is leading the research in the areas of multimedia information retrieval and human-computer interaction in computer vision applications. He was involved in the organization of the major conferences and workshops addressing the computer vision and human-centered aspects of multimedia information retrieval. He is a fellow of IAPR and a Senior member of ACM and IEEE.
Helge Rhodin is an Assistant Professor at UBC and is affiliated with the computer vision and graphics labs. Before, he was a lecturer and postdoc at EPFL and did his PhD at the MPI for Informatics at Saarland University. Rhodin's research interests range from applications in sports, medicine, neuroscience, and augmented reality, which he works towards with fundamental contributions in 3D computer vision and self-supervised machine learning.
Alireza Fathi is currently a staff research scientist at Google Research Machine Perception team. Before joining Google, He spent a couple of years at Apple working on 3d computer vision. Before that he was a Postdoctoral Fellow in FeiFei Li's lab at Stanford. He received his Ph.D. degree from Georgia Institute of Technology, and B.Sc. degree from Sharif University of Technology.
Aleksander Hołyński is a research scientist at Google Research and a postdoctoral scholar at Berkeley AI Research, working with Alyosha Efros and Angjoo Kanazawa. Previously, He was PhD student at the University of Washington, advised by Steve Seitz, Brian Curless, and Rick Szeliski. Before the PhD, he received my B.S. at the University of Illinois at Urbana-Champaign.
We gratefully acknowledge our reviewers
For additional info please contact us here