NewsTrending

Apple's Flash-Based Model Architecture Breaks On-Device Memory Ceiling

Jun 10, 2026 · about 2 months ago

Apple announced AFM 3, a new foundation model family developed with Google that includes a 20-billion-parameter on-device model storing weights in NAND flash rather than DRAM. The architecture routes expert selection once per prompt instead of per token, allowing larger models to run locally while staying within consumer device memory constraints. This addresses a fundamental limitation that has kept on-device AI models significantly smaller than cloud alternatives.

TL;DR

AFM 3 Core Advanced stores 20B parameters in NAND flash, not DRAM, bypassing the memory ceiling that has limited on-device models
Expert routing happens once per prompt, not per token, because NAND-to-DRAM bandwidth cannot support continuous weight swapping
Active parameter count scales from 1B to 4B based on task complexity, drawn from the full 20B pool in flash storage
Apple developed the architecture with Google and runs server-side models on Nvidia GPUs in Google Cloud within Apple's Private Cloud Compute boundary

Why It Matters

On-device AI has been constrained by DRAM capacity, forcing developers to choose between capable cloud models and limited local ones. Apple's flash-based weight storage and per-prompt routing break this constraint, enabling substantially larger models to run locally. This shifts the practical frontier of what on-device AI agents can accomplish without cloud dependency.

Business Impact

Enterprise architects evaluating agentic workloads now have a third option beyond cloud-dependent or limited on-device models. Larger local models reduce latency, improve privacy, and lower cloud compute costs, but deployment viability depends on undisclosed metrics like energy consumption, thermal behavior, and transparent offloading policies that Apple has not yet published.

Key Implications

On-device model capacity can now scale to 20B parameters, closing the gap with server-side deployments and enabling more complex local reasoning
The per-prompt routing model trades token-level flexibility for memory efficiency, potentially affecting performance on tasks requiring dynamic expert selection across a sequence
Apple's undisclosed offloading behavior and lack of energy or thermal profiling data create uncertainty for enterprises planning production deployments
The architecture depends on NAND flash speed and DRAM bandwidth characteristics specific to Apple silicon, limiting portability to other platforms

What to Watch

Monitor whether Apple publishes energy, thermal, and bandwidth profiling data needed for production deployment decisions. Watch for third-party benchmarks on real-world agentic workloads and whether transparent offloading to cloud becomes visible to developers and users. Track adoption patterns among enterprise customers evaluating on-device versus hybrid inference strategies.

LLMs AI Hardware AI Agents Infrastructure Model Releases

Subscribe to the newsletter

The latest stories and analysis, delivered to your inbox.

Free. No spam. Unsubscribe any time.

Anthropic's Opus 5 Shifts AI Race to Cost Efficiency

Anthropic released Claude Opus 5 on Friday, positioning it as a cost-efficient alternative to its flagship Fable 5 model at half the price. The model scores higher than Fable 5 on several coding and agentic benchmarks while maintaining the same token pricing as its predecessor, Opus 4.8. The launch reflects a shift in the AI industry from raw capability competition toward economic efficiency for enterprise workflows.

by michael.nunez@venturebeat.com (Michael Nuñez)1 day ago· VentureBeat AI

LLMsNews

Amazon Cuts Staff From Homegrown LLM Division

Amazon has cut staff from its division developing proprietary large language models, according to a company spokesperson. The spokesperson indicated that while AI models remain a priority, Amazon is refocusing on initiatives deemed most critical. The move signals a potential shift in Amazon's internal AI strategy, though the company has not disclosed the scale of the reduction or specific details about affected teams.

by Catherine Perloff3 days ago· The Information

LLMsTrendingNews

Chinese AI Lab's New Model Challenges U.S. Dominance Narrative

Beijing-based Moonshot released Kimi K3, a 2.8 trillion parameter open-source AI model that topped Arena's coding leaderboard ahead of OpenAI's GPT-5.6 and Anthropic's Claude Fable 5. The release has reignited debate about whether Chinese AI developers are closing the capability gap with U.S. firms, with Arena's CEO noting this marks the first time a Chinese model challenges the perception that such advances rely primarily on distilling American models.

by Rocket Drew5 days ago· The Information

LLMsTrendingNews

Alibaba Launches Qwen3.8 Max, Escalating AI Competition

Alibaba Group has unveiled a preview version of Qwen3.8 Max, its largest model to date with 2.4 trillion parameters, claiming performance comparable to top U.S. AI models. The announcement signals continued competition between Chinese and American tech firms in large language model development. The move reflects broader efforts by Chinese AI companies to challenge Silicon Valley's dominance in generative AI.

by Henry Siu5 days ago· The Information