Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally
Summary
Researchers successfully ran a very large AI model (Qwen 397B, a Mixture-of-Experts model where each response only uses a subset of the total weights) on a MacBook Pro by using Apple's "LLM in a Flash" technique, which stores model data on the fast SSD storage and pulls it into RAM as needed rather than keeping everything in memory at once. They used Claude to run 90 experiments and generate optimized code that achieved 5.5+ tokens per second (response speed) by quantizing (reducing precision of) the expert weights to 2-bit while keeping other parts at full precision. The final setup used only 5.5GB of constant memory while streaming the remaining 120GB of compressed model weights from disk on demand.
Classification
Affected Vendors
Related Issues
Original source: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/#atom-everything
First tracked: March 18, 2026 at 08:00 PM
Classified by LLM (prompt v3) · confidence: 92%