In conjunction with CVPR 2022.
June 20th 2022 (Morning)
The exploitation of the power of big data in the last few years led to a big step forward in many applications of Computer Vision. However, most of the tasks tackled so far are involving visual modality only, mainly due to the unbalanced number of labelled samples available among modalities (e.g., there are many huge labelled datasets for images while not as many for audio or IMU based classification), resulting in a huge gap in performance when algorithms are trained separately.
Recently, a few works have started to exploit the synchronization of multimodal streams (e.g., audio/video, RGB/depth, RGB/Lidar, visual/text, text/audio) to transfer semantic information from one modality to another reaching surprising results. Interesting applications are also proposed in a self-supervised fashion, where multiple modalities are learning correspondences without need of manual labelling, resulting in a more powerful set of features compared to those learned processing the two modalities separately. Other works have also shown that particular training paradigms allow neural networks to perform well when one of the modalities is missing due to sensor failure or unfavorable environmental conditions. These topics are gaining lots of interest in computer vision community in the recent years.
The information fusion from multiple sensors is a topic of major interest also in industry, the exponential growth of companies working on automotive, drone vision, surveillance or robotics are just a few examples. Many companies are trying to automate processes, by using a large variety of control signals from different sources. The aim of this workshop is to generate momentum around this topic of growing interest, and to encourage interdisciplinary interaction and collaboration between computer vision, multimedia, remote sensing, and robotics communities, that will serve as a forum for research groups from academia and industry.
We expect contributions involving, but not limited to, image, video, audio, depth, IR, IMU, laser, text, drawings, synthetic, etc. Position papers with feasibility studies and cross-modality issues with highly applicative flair are also encouraged. Multimodal data analysis is a very important bridge among vision, multimedia, remote sensing, and robotics, therefore we expect a positive response from these communities.
Potential topics include, but are not limited to:
Papers will be limited to 8 pages according to the CVPR format (c.f. main conference authors guidelines). All papers will be reviewed by at least two reviewers with double blind policy. Papers will be selected based on relevance, significance and novelty of results, technical merit, and clarity of presentation. Papers will be published in CVPR 2022 proceedings.
All the papers should be submitted using CMT website https://cmt3.research.microsoft.com/MULA2022.
N.B. Time is CDT (Central Daylight Time);
08:30-08:40 - Welcome from organizers and openings remarks
08:40-09:10 - Keynote 1 - Cordelia Schmid - "Large-scale learning from multimodal videos"
.
09:10-10:00 - Oral Session 1 (5-min presentations + 2-min Q&A - form link for asynchronous Q&A)
(ID 03) - Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval - Mustafa Shukor, Guillaume Couairon; Asya Grechka Matthieu Cord. link
(ID 05) - Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations - Dan Oneață, Horia Cucu. link
(ID 19) - Coupling Vision and Proprioception for Navigation of Legged Robots - Zipeng Fu, Ashish Kumar, Ananye Agarwal, Haozhi Qi, Jitendra Malik, Deepak Pathak. link
(ID 27) - M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation - Vishal M. Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh J. Shah, Pankaj Wasnik, Naoyuki Onoe. link
(ID 34) - Cascaded Siamese Self-Supervised Audio to Video GAN - Nuha N Aldausari, Arcot Sowmya, Nadine Marcus, Gelareh Mohammadi. link
(ID 38) - Multi-view Multi-label Canonical Correlation Analysis for Cross-modal Matching and Retrieval - Rushil Kaushal Sanghavi, Yashaswi Verma. link
10:00-10:30 - Keynote 2 - Kate Saenko - "More Language, Less Labeling: Vision and Language Pretraining for Visual Tasks"
10:30-11:00 - Coffee break
11:00-12:15 - Oral Session 2 (5-min presentations + 2-min Q&A - form link for asynchronous Q&A)
(ID 01) - Probabilistic Compositional Embeddings for Multimodal Image Retrieval - Andrei Neculai, Yanbei Chen, Zeynep Akata. link
(ID 02) - Coarse-to-Fine Reasoning for Visual Question Answering - Binh Xuan Nguyen, Tuong Khanh Long Do, Huy Tran, Erman Tjiputra, Quang Duy Tran, Anh Nguyen. link
(ID 06) - Semantically Grounded Visual Embeddings for Zero-Shot Learning - Shah Nawaz, Jacopo Cavazza, Alessio Del Bue. link
(ID 11) - Reasoning with Multi-structure Commonsense Knowledge in Visual Dialog - Shunyu Zhang, Xiaoze Jiang, Zequn Yang, Tao Wan, Zengchang Qin. link
(ID 16) - Modulating Bottom-Up and Top-Down Visual Processing via Language-Conditional Filters - Ilker Kesen, Ozan Can, Erkut Erdem, Aykut Erdem, Deniz Yüret. link
(ID 20) - Emphasizing Complementary Samples for Non-literal Cross-modal Retrieval - Christopher L. Thomas, Adriana Kovashka. link
(ID 21) - Doubling down: sparse grounding with an additional, almost-matching caption for detection-oriented multimodal pretraining - Giacomo Nebbia, Adriana Kovashka. link
(ID 28) - The Unreasonable Effectiveness of CLIP features for Image Captioning: an Experimental Analysis - Manuele Barraco, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara. link
(ID 30) - Guiding Attention using Partial-Order Relationships for Image Captioning - Murad Popattia, Muhammad Rafi, Rizwan Qureshi, Shah Nawaz. link
(ID 33) - Learning to Ask Informative Sub-Questions for Visual Question Answering - Kohei Uehara, Nan Duan, Tatsuya Harada. link
12:15-12:45 - Keynote 3 - James Rehg - Learning to Navigate from Vision and Language
12:45-13:00 - Closing Remarks and Best Paper Award ceremony
Kate Saenko is an Associate Professor at the Department of Computer Science at Boston University, and the director of the Computer Vision and Learning Group and member of the IVC Group. She received her PhD from MIT. Previously, she was an Assistant Professor at the Department of Computer Science at UMass Lowell, a Postdoctoral Researcher at the International Computer Science Institute, a Visiting Scholar at UC Berkeley EECS and a Visiting Postdoctoral Fellow in the School of Engineering and Applied Science at Harvard University. Her research interests are in the broad area of Artificial Intelligence with a focus on Adaptive Machine Learning, Learning for Vision and Language Understanding, and Deep Learning.
James M. Rehg (pronounced "ray") is a Professor in the School of Interactive Computing at the Georgia Institute of Technology, where he co-directs Center for Health Analytics and Informatics (CHAI). He received his Ph.D. from CMU in 1995 and worked at the Cambridge Research Lab of DEC (and then Compaq) from 1995-2001, where he managed the computer vision research group. He received an NSF CAREER award in 2001 and a Raytheon Faculty Fellowship from Georgia Tech in 2005. He and his students have received best student paper awards at ICML 2005, BMVC 2010, Mobihealth 2014, and Face and Gesture 2015, and a Method of the Year Award from the journal Nature Methods. Dr. Rehg served as the Program co-Chair for ACCV 2012 and CVPR 2017 and General co-Chair for CVPR 2009. He has authored more than 200 peer-reviewed scientific papers and holds 30 issued US patents. His research interests include computer vision, machine learning, and mobile and computational health. Dr. Rehg was the lead PI on an NSF Expedition to develop the science and technology of Behavioral Imaging, the measurement and analysis of social and communicative behavior using multi-modal sensing, with applications to developmental conditions such as autism. He is currently the Deputy Director and TR&D1 Lead for the mHealth Center for Discovery, Optimization, and Translation of Temporally-Precise Interventions (mDOT), which is developing novel on-body sensing and predictive analytics for improving health outcomes. He is also currently a visiting research scientist at Meta Reality Labs Research.
Cordelia Schmid holds a M.S. degree in Computer Science from the University of Karlsruhe and a Doctorate, also in Computer Science, from the Institut National Polytechnique de Grenoble (INPG). Her doctoral thesis received the best thesis award from INPG in 1996. Dr. Schmid was a post-doctoral research assistant in the Robotics Research Group of Oxford University in 1996--1997. Since 1997 she has held a permanent research position at Inria Grenoble Rhone-Alpes, where she is a research director and directs an Inria team. Dr. Schmid has been an Associate Editor for IEEE PAMI (2001--2005) and for IJCV (2004--2012), editor-in-chief for IJCV (2013---), a program chair of IEEE CVPR 2005 and ECCV 2012 as well as a general chair of IEEE CVPR 2015 and ECCV 2020. In 2006, 2014 and 2016, she was awarded the Longuet-Higgins prize for fundamental contributions in computer vision that have withstood the test of time. She is a fellow of IEEE. She was awarded an ERC advanced grant in 2013, the Humbolt research award in 2015 and the Inria & French Academy of Science Grand Prix in 2016. She was elected to the German National Academy of Sciences, Leopoldina, in 2017. She is working for Google France starting Feb. 2018 part-time (50%).
We gratefully acknowledge our reviewers
For additional info please contact us here