Multimodal Day is a research-focused mini-conference dedicated to advancing the integration of diverse data modalities in machine learning. In recent years, foundation models have propelled breakthroughs in computer vision and natural language processing. Yet, much of this progress has been confined to single-modality tasks, with limited exploration of richer, more complex modalities such as audio, video, 3D data, and beyond.
Building truly multimodal systems poses unique challenges: synchronizing heterogeneous data streams, managing complex annotations, and developing architectures capable of fusing information across domains. Many existing approaches simply extend single-modality solutions, often resulting in suboptimal performance compared to integrated multimodal strategies.
Join us to explore how multimodal learning can unlock new capabilities in AI - and to be part of a growing community bridging fields from natural language to computer vision, and audio analysis.