In conjunction with CVPR 2021.
VIRTUAL - June 19th 2021 (Morning)NEWS! Full recording of the event is available at https://www.youtube.com/watch?v=pHuFMcaoLio&ab_channel=MichaelYang
The exploitation of the power of big data in the last few years led to a big step forward in many applications of Computer Vision. However, most of the tasks tackled so far are involving visual modality only, mainly due to the unbalanced number of labelled samples available among modalities (e.g., there are many huge labelled datasets for images while not as many for audio or IMU based classification), resulting in a huge gap in performance when algorithms are trained separately.
Recently, a few works have started to exploit the synchronization of multimodal streams (e.g., audio/video, RGB/depth, RGB/Lidar, visual/text, text/audio) to transfer semantic information from one modality to another reaching surprising results. Interesting applications are also proposed in a self-supervised fashion, where multiple modalities are learning correspondences without need of manual labelling, resulting in a more powerful set of features compared to those learned processing the two modalities separately. Other works have also shown that particular training paradigms allow neural networks to perform well when one of the modalities is missing due to sensor failure or unfavorable environmental conditions. These topics are gaining lots of interest in computer vision community in the recent years.
The information fusion from multiple sensors is a topic of major interest also in industry, the exponential growth of companies working on automotive, drone vision, surveillance or robotics are just a few examples. Many companies are trying to automate processes, by using a large variety of control signals from different sources. The aim of this workshop is to generate momentum around this topic of growing interest, and to encourage interdisciplinary interaction and collaboration between computer vision, multimedia, remote sensing, and robotics communities, that will serve as a forum for research groups from academia and industry.
We expect contributions involving, but not limited to, image, video, audio, depth, IR, IMU, laser, text, drawings, synthetic, etc. Position papers with feasibility studies and cross-modality issues with highly applicative flair are also encouraged. Multimodal data analysis is a very important bridge among vision, multimedia, remote sensing, and robotics, therefore we expect a positive response from these communities.
Potential topics include, but are not limited to:
Papers will be limited to 8 pages according to the CVPR format (c.f. main conference authors guidelines). All papers will be reviewed by at least two reviewers with double blind policy. Papers will be selected based on relevance, significance and novelty of results, technical merit, and clarity of presentation. Papers will be published in CVPR 2021 proceedings.
All the papers should be submitted using CMT website https://cmt3.research.microsoft.com/MULA2021.
N.B. Time is N. America West Time; [time in brackets is Europe (Central) daylightsaving ]. More details here
Zoom and Gatherly links will be available on June 19th
Full recording of the event is available at https://www.youtube.com/watch?v=pHuFMcaoLio&ab_channel=MichaelYang
08:00-08:10 - Welcome from organizers and openings remarks
[17:00-17:10]
08:10-08:40 - Keynote - Rogerio Schmidt Feris - "Adaptive Multimodal Learning for Efficient Video Understanding"
[17:10-17:40]
Abstract: The tremendous growth of multimodal video data in recent years has increased the demand for efficient multimodal deep neural network models, particularly in domains where real-time inference is essential. While significant progress has been made on model compression and acceleration for video understanding, most existing methods rely on one-size-fits-all models, which apply the same amount of computation for all video segments across all modalities. In this talk, I will instead cover methods that adaptively change computation depending on the content of the input. In particular, in the context of audio-visual action recognition, I will describe a method that adaptively decides which modality to use for each video segment (deciding where to look at and listen to in the video), with the goal of improving both accuracy and efficiency. Finally, I will conclude my talk by describing ongoing work that integrates this technology into a system for auto-curation of sports highlights based on multimodal video understanding.
08:40-09:35 - Oral Session I (5-min presentations)
[17:40-18:35]
(ID 02 - Poster slot 1) - Dealing with Missing Modalities in the Visual Question Answer-Difference Prediction Task through Knowledge Distillation - Jae Won Cho, Dong-Jin Kim, Jinsoo Choi, Yunjae Jung, In So Kweon
(ID 11 - Poster slot 2) - Beyond VQA: Generating Multi-word Answers and Rationales to Visual Questions - Radhika Dua, Sai Srinivas Kancheti, Vineeth N Balasubramanian
(ID 14 - Poster slot 3) - Using Text to Teach Image Retrieval - Haoyu Dong, Ze Wang, Qiang Qiu, Guillermo Sapiro
(ID 17 - Poster slot 4) - An Improved Attention for Visual Question Answering - Tanzila Rahman, Shih-Han Chou, Leonid Sigal, Giuseppe Carenini
(ID 19 - Poster slot 5) - Target-Tailored Source-Transformation for Scene Graph Generation - Wentong Liao, Cuiling Lan, Michael Ying Yang, Wenjun Zeng, Bodo Rosenhahn
(ID 25 - Poster slot 6) - Private-Shared Disentangled Multimodal VAE for Learning of Latent Representations - Mihee Lee, Vladimir Pavlovic
(ID 26 - Poster slot 7) - Editing like Humans: A Contextual, Multimodal Framework for Automated Video Editing - Sharath Koorathota, Patrick J Adelman, Kelly Cotton, Paul Sajda
(ID 30 - Poster slot 8) - Exploring the Limits of Zero-Shot Learning - How Low Can You Go? - Hemanth Dandu, Karan Sharma, Suchendra M. Bhandarkar
09:35-10:05 - Keynote - Lorenzo Torresani - "Vision using Sight...but also Sound and Speech" (link)
[18:35-19:05]
10:05-11:00 - - Oral Session II (5-min presentations)
[19:05-20:00]
(ID 01 - Poster slot 9) - Self-supervised Feature Learning by Cross-modality and Cross-view Correspondences - Longlong Jing, Ling Zhang, YingLi Tian
(ID 07 - Poster slot 10) - Adaptive Intermediate Representations for Video Understanding - Juhana Kangaspunta, AJ Piergiovanni, Rico Jonschkowski, Michael S Ryoo, Anelia Angelova
(ID 10 - Poster slot 11) - Practical Cross-modal Manifold Alignment for Robotic Grounded Language Learning - Andre T Nguyen, Luke Richards, Gaoussou Y Kebe, Edward Raff, Kasra Darvish, Francis Ferraro, Cynthia Matuszek
(ID 12 - Poster slot 12) - Progressive Knowledge-Embedded Unified Perceptual Parsing for Scene Understanding - Wenbo Zheng, Lan Yan, Chao Gou, Fei-Yue Wang
(ID 21 - Poster slot 13) - Radar Camera Fusion via Representation Learning in Autonomous Driving - Xu Dong, Binnan Zhuang, Yunxiang Mao, Langechuan Liu
(ID 24 - Poster slot 14) - Cross-modal Speaker Verification and Recognition: A Multilingual Perspective - Shah Nawaz, Muhammad Saad Saeed, Pietro Morerio, Arif Mahmood, Ignazio Gallo, Mohammad Haroon Yousaf, Alessio Del Bue
(ID 31 - Poster slot 15) - APES: Audiovisual Person Search in Untrimmed Video - Juan C Leon, Fabian Caba, Federico Perazzi, Long T Mai, Joon-Young Lee, Bernard Ghanem, Pablo Arbelaez
(ID 32 - Poster slot 16) - 3D Hand Pose Estimation via aligned latent space injection and kinematic losses - Andreas Stergioulas, Theocharis Chatzis, Dimitrios Konstantinidis, Kosmas Dimitropoulos, Petros Daras
11:00-11:30 - Keynote - Kostas Daniilidis - "Event- vs frame-based vision"
[20:00-20:30]
11:30-11:35 - Closing Remarks
[20:30-20:35]
11:35-12:00 - Poster Session (all papers)
[20:35-21:00]
Kostas Daniilidis is Professor of Computer Vision at the Computer and Information Systems Department at the University of Pennsylvania. Kostas’ research interests are in computer vision and robotic perception. His research addresses challenges in the perception of motion and space, such as the geometric design of cameras, and the interplay of geometry and appearance in perception tasks. Kostas’s research gives solutions to perceptual tasks such as panoramic vision, localization, perception of self-motion, large-scale mapping, visual location recognition, 3-D object recognition, and vision-based flocking. Applications of his research involve robot navigation, tele-immersion, and image and shape retrieval.
Lorenzo Torresani is a Professor in the Computer Science Department at Dartmouth College and a Research Scientist at Facebook AI. He received a Laurea Degree in Computer Science with summa cum laude honors from the University of Milan (Italy) in 1996, and an M.S. and a Ph.D. in Computer Science from Stanford University in 2001 and 2005, respectively. In the past, he has worked at several industrial research labs including Microsoft Research Cambridge, Like.com and Digital Persona. His research interests are in computer vision and deep learning. He is the recipient of several awards, including a CVPR best student paper prize, a National Science Foundation CAREER Award, a Google Faculty Research Award, three Facebook Faculty Awards, and a Fulbright U.S. Scholar Award.
Rogerio Schmidt Feris is a principal scientist and manager at the MIT-IBM Watson AI lab. He joined IBM in 2006 after receiving a Ph.D. from the University of California, Santa Barbara. He has also worked as an Affiliate Associate Professor at the University of Washington and as an Adjunct Associate Professor at Columbia University. He has authored over 140 technical papers and has over 40 issued patents in the areas of computer vision, multimedia, and machine learning. His current work is particularly focused on deep learning methods that are label-efficient (learning with limited labels), sample-efficient (learning with less data), and computationally efficient. I am also interested in multimodal perception methods that combine vision, sound/speech, and language.
We gratefully acknowledge our reviewers
For additional info please contact us here