Research

New Benchmark Exposes Audiovisual Fusion Gap in Open AI Models

Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae LeeApr 14, 2026 · about 5 hours ago

ArXiv (cs.AI)

Read original

Researchers introduced AV-SpeakerBench, a new benchmark with 3,212 multiple-choice questions designed to evaluate how well multimodal large language models understand audiovisual human speech in real-world videos. The benchmark focuses on speaker-centric reasoning, requiring models to align who speaks, what is said, and when it occurs, rather than relying on visual context alone. Evaluations show Gemini 2.5 Pro leads significantly, while open-source models like Qwen3-Omni-30B lag primarily due to weaker audiovisual fusion capabilities rather than visual perception limitations.

TL;DR

→New benchmark AV-SpeakerBench contains 3,212 expert-curated questions focused on speaker-centric audiovisual reasoning in videos
→Benchmark design embeds audiovisual dependencies directly into question semantics to prevent visual-only solving
→Gemini 2.5 Pro achieves best results, with Gemini family consistently outperforming open-source systems
→Open models struggle primarily with audiovisual fusion rather than visual perception, revealing a specific technical gap

Why it matters

Most existing video benchmarks fail to rigorously test whether MLLMs can actually fuse audio and visual information to understand human speech in context. This benchmark fills that gap by forcing models to demonstrate genuine multimodal reasoning rather than solving tasks through visual cues alone, providing clearer signal on where current systems actually fall short in real-world audiovisual understanding.

Business relevance

For companies building multimodal AI products, this benchmark reveals that audiovisual fusion is a distinct technical challenge separate from visual perception. Teams developing video understanding features, meeting transcription tools, or video analysis platforms can use these results to identify whether their models need targeted improvements in cross-modal alignment rather than investing in visual improvements.

Key implications

→Audiovisual fusion is a distinct bottleneck in open-source models that requires targeted architectural or training improvements, not just more visual data
→Proprietary models from Google maintain substantial advantages in multimodal reasoning, suggesting either superior training data, fusion mechanisms, or both
→Speaker-centric framing of audiovisual tasks may become a standard evaluation approach, shifting how future benchmarks assess video understanding

What to watch

Monitor whether open-source model developers address the audiovisual fusion gap identified here, and track whether future MLLM releases show improvement on speaker-centric tasks. Also watch for adoption of AV-SpeakerBench in model evaluation pipelines and whether similar speaker-centric benchmarks emerge for other modalities.

vff Briefing

Weekly signal. No noise. Built for founders, operators, and AI-curious professionals.

No spam. Unsubscribe any time.