News

NVIDIA Claims 5x Token Cost Cuts on Blackwell via Software Stack

Amr ElmeleegyJul 1, 2026 · about 10 hours ago

NVIDIA claims its inference software stack has reduced token costs by up to 5x on the DeepSeek V4 model within one month on its Blackwell platform. The company argues that as AI moves from pilots to production, software optimization across serving, acceleration, and infrastructure layers becomes critical to cost efficiency. Leading inference providers including Baseten, Cognition, Deep Infra, and Together AI is already deploying these tools to improve throughput and reduce latency on Blackwell GPUs.

TL;DR

NVIDIA reports 5x token cost reduction on DeepSeek V4 in one month using Blackwell platform and its inference software stack
Baseten achieved 50% more tokens per second using TensorRT-LLM on Blackwell for DeepSeek V4 Pro
DigitalOcean and Hippocratic AI increased inference throughput 30% while maintaining sub-half-second response time across 10 million patient calls
NVIDIA's three-layer software approach (production operation, application acceleration, infrastructure access) aims to turn distributed agentic AI workloads into lower-cost serving

Why It Matters

As AI workloads shift from experimental pilots to production factories, cost per token has become the primary infrastructure metric, replacing peak performance specs. NVIDIA's software stack claims to compound optimizations across hardware, networking, and serving layers, directly impacting the economics of running large language models at scale. Early adopters report significant throughput gains and cost reductions, suggesting software optimization may become as important as hardware selection.

Business Impact

Organizations deploying AI in production face pressure to reduce inference costs while meeting latency requirements. NVIDIA's software stack enables inference providers to serve models more efficiently without building custom infrastructure from scratch, lowering barriers to competitive deployment. Companies like Cognition and Cursor can accelerate time-to-production by using ready-made frameworks rather than building serving infrastructure independently.

Key Implications

Software optimization on existing hardware may deliver cost improvements comparable to hardware upgrades, shifting investment priorities for infrastructure teams
Agentic AI workloads spanning multiple models, tools, and distributed tasks require orchestration layers that traditional web infrastructure cannot provide, creating demand for specialized serving frameworks
Early-mover advantage in inference optimization may compound as software improvements stack across layers, potentially widening cost gaps between optimized and unoptimized deployments

What to Watch

Monitor whether the reported 5x token cost reduction on DeepSeek V4 holds across other models and use cases, or if gains are model-specific. Track adoption rates of NVIDIA's TensorRT-LLM and Dynamo frameworks among competing inference providers to assess whether NVIDIA's software stack becomes industry standard. Watch for similar optimization claims from other hardware vendors and whether they achieve comparable cost reductions.

DeepSeek AI Hardware AI Agents Infrastructure Generative AI

Subscribe to the newsletter

The latest stories and analysis, delivered to your inbox.

Free. No spam. Unsubscribe any time.

DeepSeek Open-Sources DSpark, Cutting LLM Inference Costs by Up to 85%

DeepSeek has open-sourced DSpark, an MIT-licensed framework that accelerates large language model inference by up to 85% without altering model outputs. The system uses speculative decoding, where a smaller draft model predicts likely token sequences that a larger model then validates, reducing computational overhead. DeepSeek has released technical papers, model checkpoints, and training code via GitHub and Hugging Face, making the technique available to researchers and enterprises running open-weight models.

by carl.franzen@venturebeat.com (Carl Franzen)1 day ago· VentureBeat AI

DeepSeekTrendingNews

Anthropic's Mythos Announcement Triggered DeepSeek's $7.4B Fundraising

DeepSeek, a three-year-old Chinese AI lab that had never raised outside funding, completed a $7.4 billion Series A in mid-June, valuing the company at over $50 billion. The fundraising marks the largest first-time raise by a Chinese startup. According to three people familiar with CEO Liang Wenfeng's thinking, the decision to seek external capital was prompted by Anthropic's April release of Mythos, a model preview that Anthropic claimed could find and exploit software vulnerabilities.

by Jing Yang5 days ago· The Information

DeepSeekNews

Microsoft Eyes DeepSeek V4 to Cut Copilot Cowork Costs

Microsoft is exploring the integration of DeepSeek's V4 model as a cost-effective option for its Copilot Cowork AI assistant, according to reporting from Axios. The company is evaluating either a Microsoft-hosted version of DeepSeek V4 or another open-source alternative to reduce expenses associated with powering the assistant. This move reflects Microsoft's effort to balance capability with cost efficiency in its AI product offerings.

by Juro Osawa14 days ago· The Information

DeepSeekTrendingNews

DeepSeek Raises $7.4B with Control-Focused Deal Structure

Chinese AI lab DeepSeek closed a funding round raising over $7.4 billion at a valuation exceeding $50 billion, making it one of the largest AI funding rounds. The deal uses an unusual structure where investors fund a limited partnership managed by CEO Liang Wenfeng rather than DeepSeek directly, a mechanism designed to preserve his absolute control. All investor shares carry a five-year lockup period, preventing early exits.

by Qianer Liu15 days ago· The Information