News

NVIDIA Releases Multilingual ASR Model Supporting 40 Languages

Jun 4, 2026 · about 2 months ago

NVIDIA released Nemotron 3.5 ASR, a 600M-parameter multilingual speech-to-text model that transcribes 40 language-locales from a single checkpoint in real time with native punctuation and capitalization. The model uses a Cache-Aware FastConformer-RNNT architecture to achieve low latency (0.07 seconds to final transcript) without sacrificing accuracy, and is available as open weights on Hugging Face for fine-tuning and deployment without API dependencies.

TL;DR

Nemotron 3.5 ASR supports 40 language-locales in a single 600M-parameter model, eliminating the need for separate language-specific deployments
Real-time streaming achieves 0.07 seconds latency to final transcript by caching encoder state instead of reprocessing overlapping audio chunks
Model includes punctuation and capitalization natively, removing the need for separate post-processing pipelines
Available as open weights on Hugging Face with fine-tuning capability for custom languages, domains, and accents

Why It Matters

Multilingual speech recognition has historically required stitching together multiple models or APIs, each with different latency profiles and billing structures. Nemotron 3.5 ASR consolidates this complexity into a single model that handles language switching mid-sentence and delivers production-ready output without additional post-processing, reducing infrastructure overhead for speech-enabled applications.

Business Impact

Organizations building multilingual products can reduce operational complexity and cost by deploying a single model instead of managing 40 separate integrations. The open-weights approach eliminates per-call API billing and allows companies to fine-tune the model for domain-specific vocabulary or accents, improving accuracy for specialized use cases like customer support or medical transcription.

Key Implications

Enterprises can consolidate multilingual ASR infrastructure, reducing vendor lock-in and per-call costs associated with API-based solutions
The native punctuation and capitalization eliminate the need for secondary NLP models, simplifying deployment pipelines and reducing latency
Fine-tuning capability enables customization for industry-specific terminology and regional accents without retraining from scratch
Real-time streaming with low latency opens use cases in live captioning and conversational AI that were previously impractical with traditional buffered ASR

What to Watch

Monitor adoption rates across enterprise speech applications and whether fine-tuning results meet accuracy targets for specialized domains. Track whether the model's multilingual capability reduces the fragmentation of ASR vendor ecosystems, and observe if competing models adopt similar caching architectures to match latency performance.

Voice & Video AI AI for Business Model Releases Open Source

Subscribe to the newsletter

The latest stories and analysis, delivered to your inbox.

Free. No spam. Unsubscribe any time.

Google Vids adds AI avatars for personalized video creation

Google has added personalized AI avatars to its Vids product, enabling users to create videos featuring digital versions of themselves. The feature integrates with Gemini Omni-powered tools that generate and edit videos from text prompts and reference images. This expands Google's video creation capabilities beyond text-to-video generation to include avatar-based personalization.

by Sarah Perez2 days ago· TechCrunch AI

Voice & Video AINews

Cars24 scales to 1M monthly conversations with OpenAI agents

Cars24, an automotive marketplace, deployed OpenAI-powered voice and chat agents to automate customer conversations at scale. The system handles over 1 million monthly conversation minutes and has recovered 12% of previously lost leads. The implementation extends beyond customer-facing applications, with agentic workflows now integrated across multiple teams within the company.

3 days ago· OpenAI

Voice & Video AINews

Hinge Founder Raises $18M for AI Voice Dating App Overtone

Justin McLeod, founder of dating app Hinge, has raised $18 million to launch Overtone, a new AI-powered dating service that uses voice and audio as its primary interface to deliver curated introductions. The funding marks McLeod's entry into a new venture after his previous success building Hinge into a major player in the online dating market. Overtone positions itself as audio-first and AI-enabled, differentiating from text-based dating platforms through its focus on voice-forward interactions.

by Amanda Silberling4 days ago· TechCrunch AI

Voice & Video AITrendingNews

Spotify Adds ChatGPT-Like Assistant for Premium Discovery

Spotify is launching a ChatGPT-like conversational AI assistant for Premium subscribers that allows users to discover music, podcasts, and audiobooks through natural language chat. The feature represents the streaming platform's expansion into AI-driven discovery tools. The rollout targets Spotify's paid user base, which currently relies on algorithmic playlists and search for content discovery.

by Sarah Perez5 days ago· TechCrunch AI