4th Multimodal Learning and Applications Workshop

In conjunction with CVPR 2021.

VIRTUAL - June 19th 2021 (Morning)

4th Multimodal Learning and Applications Workshop (MULA 2021)

NEWS! Full recording of the event is available at https://www.youtube.com/watch?v=pHuFMcaoLio&ab_channel=MichaelYang

The exploitation of the power of big data in the last few years led to a big step forward in many applications of Computer Vision. However, most of the tasks tackled so far are involving visual modality only, mainly due to the unbalanced number of labelled samples available among modalities (e.g., there are many huge labelled datasets for images while not as many for audio or IMU based classification), resulting in a huge gap in performance when algorithms are trained separately.

Recently, a few works have started to exploit the synchronization of multimodal streams (e.g., audio/video, RGB/depth, RGB/Lidar, visual/text, text/audio) to transfer semantic information from one modality to another reaching surprising results. Interesting applications are also proposed in a self-supervised fashion, where multiple modalities are learning correspondences without need of manual labelling, resulting in a more powerful set of features compared to those learned processing the two modalities separately. Other works have also shown that particular training paradigms allow neural networks to perform well when one of the modalities is missing due to sensor failure or unfavorable environmental conditions. These topics are gaining lots of interest in computer vision community in the recent years.

The information fusion from multiple sensors is a topic of major interest also in industry, the exponential growth of companies working on automotive, drone vision, surveillance or robotics are just a few examples. Many companies are trying to automate processes, by using a large variety of control signals from different sources. The aim of this workshop is to generate momentum around this topic of growing interest, and to encourage interdisciplinary interaction and collaboration between computer vision, multimedia, remote sensing, and robotics communities, that will serve as a forum for research groups from academia and industry.

We expect contributions involving, but not limited to, image, video, audio, depth, IR, IMU, laser, text, drawings, synthetic, etc. Position papers with feasibility studies and cross-modality issues with highly applicative flair are also encouraged. Multimodal data analysis is a very important bridge among vision, multimedia, remote sensing, and robotics, therefore we expect a positive response from these communities.

Potential topics include, but are not limited to:

  • Multimodal learning
  • Cross-modal learning
  • Self-supervised learning for multimodal data
  • Multimodal data generation and sensors
  • Unsupervised learning on multimodal data
  • Cross-modal adaptation
  • Multimodal data fusion and data representation
  • Multimodal transfer learning
  • Multimodal scene understanding
  • Vision and Language
  • Vision and Sound
  • Multimodal applications (e.g. drone vision, autonomous driving, industrial inspection, etc.)


Papers will be limited to 8 pages according to the CVPR format (c.f. main conference authors guidelines). All papers will be reviewed by at least two reviewers with double blind policy. Papers will be selected based on relevance, significance and novelty of results, technical merit, and clarity of presentation. Papers will be published in CVPR 2021 proceedings.

All the papers should be submitted using CMT website https://cmt3.research.microsoft.com/MULA2021.

Important Dates

  • Deadline for submission: March 10th, 2021 - 23:59 Pacific Standard Time
  • ---EXTENDED---
  • Firm Deadline for submission: March 14th, 2021 - 23:59 Pacific Standard Time
  • Notification of acceptance April 8th, 2021
  • Camera Ready submission deadline: April 19th, 2021
  • Workshop date: June 19th, 2021 (Morning)


N.B. Time is N. America West Time; [time in brackets is Europe (Central) daylightsaving ]. More details here
Zoom and Gatherly links will be available on June 19th

Full recording of the event is available at https://www.youtube.com/watch?v=pHuFMcaoLio&ab_channel=MichaelYang

08:00-08:10 - Welcome from organizers and openings remarks

08:10-08:40 - Keynote - Rogerio Schmidt Feris - "Adaptive Multimodal Learning for Efficient Video Understanding"

Abstract: The tremendous growth of multimodal video data in recent years has increased the demand for efficient multimodal deep neural network models, particularly in domains where real-time inference is essential. While significant progress has been made on model compression and acceleration for video understanding, most existing methods rely on one-size-fits-all models, which apply the same amount of computation for all video segments across all modalities. In this talk, I will instead cover methods that adaptively change computation depending on the content of the input. In particular, in the context of audio-visual action recognition, I will describe a method that adaptively decides which modality to use for each video segment (deciding where to look at and listen to in the video), with the goal of improving both accuracy and efficiency. Finally, I will conclude my talk by describing ongoing work that integrates this technology into a system for auto-curation of sports highlights based on multimodal video understanding.

08:40-09:35 - Oral Session I (5-min presentations)

(ID 02 - Poster slot 1) - Dealing with Missing Modalities in the Visual Question Answer-Difference Prediction Task through Knowledge Distillation - Jae Won Cho, Dong-Jin Kim, Jinsoo Choi, Yunjae Jung, In So Kweon
(ID 11 - Poster slot 2) - Beyond VQA: Generating Multi-word Answers and Rationales to Visual Questions - Radhika Dua, Sai Srinivas Kancheti, Vineeth N Balasubramanian
(ID 14 - Poster slot 3) - Using Text to Teach Image Retrieval - Haoyu Dong, Ze Wang, Qiang Qiu, Guillermo Sapiro
(ID 17 - Poster slot 4) - An Improved Attention for Visual Question Answering - Tanzila Rahman, Shih-Han Chou, Leonid Sigal, Giuseppe Carenini
(ID 19 - Poster slot 5) - Target-Tailored Source-Transformation for Scene Graph Generation - Wentong Liao, Cuiling Lan, Michael Ying Yang, Wenjun Zeng, Bodo Rosenhahn
(ID 25 - Poster slot 6) - Private-Shared Disentangled Multimodal VAE for Learning of Latent Representations - Mihee Lee, Vladimir Pavlovic
(ID 26 - Poster slot 7) - Editing like Humans: A Contextual, Multimodal Framework for Automated Video Editing - Sharath Koorathota, Patrick J Adelman, Kelly Cotton, Paul Sajda
(ID 30 - Poster slot 8) - Exploring the Limits of Zero-Shot Learning - How Low Can You Go? - Hemanth Dandu, Karan Sharma, Suchendra M. Bhandarkar

09:35-10:05 - Keynote - Lorenzo Torresani - "Vision using Sight...but also Sound and Speech" (link)

10:05-11:00 - - Oral Session II (5-min presentations)

(ID 01 - Poster slot 9) - Self-supervised Feature Learning by Cross-modality and Cross-view Correspondences - Longlong Jing, Ling Zhang, YingLi Tian
(ID 07 - Poster slot 10) - Adaptive Intermediate Representations for Video Understanding - Juhana Kangaspunta, AJ Piergiovanni, Rico Jonschkowski, Michael S Ryoo, Anelia Angelova
(ID 10 - Poster slot 11) - Practical Cross-modal Manifold Alignment for Robotic Grounded Language Learning - Andre T Nguyen, Luke Richards, Gaoussou Y Kebe, Edward Raff, Kasra Darvish, Francis Ferraro, Cynthia Matuszek
(ID 12 - Poster slot 12) - Progressive Knowledge-Embedded Unified Perceptual Parsing for Scene Understanding - Wenbo Zheng, Lan Yan, Chao Gou, Fei-Yue Wang
(ID 21 - Poster slot 13) - Radar Camera Fusion via Representation Learning in Autonomous Driving - Xu Dong, Binnan Zhuang, Yunxiang Mao, Langechuan Liu
(ID 24 - Poster slot 14) - Cross-modal Speaker Verification and Recognition: A Multilingual Perspective - Shah Nawaz, Muhammad Saad Saeed, Pietro Morerio, Arif Mahmood, Ignazio Gallo, Mohammad Haroon Yousaf, Alessio Del Bue
(ID 31 - Poster slot 15) - APES: Audiovisual Person Search in Untrimmed Video - Juan C Leon, Fabian Caba, Federico Perazzi, Long T Mai, Joon-Young Lee, Bernard Ghanem, Pablo Arbelaez
(ID 32 - Poster slot 16) - 3D Hand Pose Estimation via aligned latent space injection and kinematic losses - Andreas Stergioulas, Theocharis Chatzis, Dimitrios Konstantinidis, Kosmas Dimitropoulos, Petros Daras

11:00-11:30 - Keynote - Kostas Daniilidis - "Event- vs frame-based vision"

11:30-11:35 - Closing Remarks

11:35-12:00 - Poster Session (all papers)

Invited Speakers

Kostas Daniilidis is Professor of Computer Vision at the Computer and Information Systems Department at the University of Pennsylvania. Kostas’ research interests are in computer vision and robotic perception. His research addresses challenges in the perception of motion and space, such as the geometric design of cameras, and the interplay of geometry and appearance in perception tasks. Kostas’s research gives solutions to perceptual tasks such as panoramic vision, localization, perception of self-motion, large-scale mapping, visual location recognition, 3-D object recognition, and vision-based flocking. Applications of his research involve robot navigation, tele-immersion, and image and shape retrieval.

Lorenzo Torresani is a Professor in the Computer Science Department at Dartmouth College and a Research Scientist at Facebook AI. He received a Laurea Degree in Computer Science with summa cum laude honors from the University of Milan (Italy) in 1996, and an M.S. and a Ph.D. in Computer Science from Stanford University in 2001 and 2005, respectively. In the past, he has worked at several industrial research labs including Microsoft Research Cambridge, Like.com and Digital Persona. His research interests are in computer vision and deep learning. He is the recipient of several awards, including a CVPR best student paper prize, a National Science Foundation CAREER Award, a Google Faculty Research Award, three Facebook Faculty Awards, and a Fulbright U.S. Scholar Award.

Rogerio Schmidt Feris is a principal scientist and manager at the MIT-IBM Watson AI lab. He joined IBM in 2006 after receiving a Ph.D. from the University of California, Santa Barbara. He has also worked as an Affiliate Associate Professor at the University of Washington and as an Adjunct Associate Professor at Columbia University. He has authored over 140 technical papers and has over 40 issued patents in the areas of computer vision, multimedia, and machine learning. His current work is particularly focused on deep learning methods that are label-efficient (learning with limited labels), sample-efficient (learning with less data), and computationally efficient. I am also interested in multimodal perception methods that combine vision, sound/speech, and language.


Michael Ying Yang

University of Twente, Netherlands

Pietro Morerio

Istituto Italiano di Tecnologia, Italy

Paolo Rota

Università di Trento, Italy

Bodo Rosenhahn

Institut für Informationsverarbeitung, Leibniz-Universität Hannover, Germany

Vittorio Murino

Istituto Italiano di Tecnologia & Università di Verona, Italy & Huawei Technologies, Ireland


We gratefully acknowledge our reviewers

    Alina Roitberg
    Andrea Pilzer
    Andrea Zunino
    Christoph Reinders
    Dayan Guan
    Giacomo Zara
    Gianluca Scarpellini
    Guanglei Yang
    Haidong Zhu
    Han Zou
    Hanno Ackermann
    Hari Prasanna Das
    Jianfei Yang
    Jiguo Li
    Kohei Uehara
    Kosmas Dimitropoulos
    Letitia E Parcalabescu
    Limin Wang
    Marco Godi
    Mengyi Zhao
    Praneet Dutta
    Ramakrishnan Kannan
    Riccardo Volpi
    Tal Hakim
    Thomas Theodoridis
    Victor G. Turrisi da Costa
    Vladimir Iashin
    Vladimir V Kniaz
    Wenbo Zheng
    Willi Menapace
    Xin Chen
    Yanpeng Cao


Old Editions


For additional info please contact us here