AI Sec Watch: A Security Intelligence Platform for AI Systems

Luu, T.J.

Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally

infonewsLLM-Specific

research

Source: Simon Willison's WeblogMarch 18, 2026

Summary

Researchers successfully ran a very large AI model (Qwen 397B, a Mixture-of-Experts model where each response only uses a subset of the total weights) on a MacBook Pro by using Apple's "LLM in a Flash" technique, which stores model data on the fast SSD storage and pulls it into RAM as needed rather than keeping everything in memory at once. They used Claude to run 90 experiments and generate optimized code that achieved 5.5+ tokens per second (response speed) by quantizing (reducing precision of) the expert weights to 2-bit while keeping other parts at full precision. The final setup used only 5.5GB of constant memory while streaming the remaining 120GB of compressed model weights from disk on demand.

Classification

Attack SophisticationAdvanced

AI Component TargetedInference

Affected Vendors

Apple

Related Issues

info

Apple might use Google servers to store data for its upgraded AI Siri

Same vendorThe Verge (AI)

high

CVE-2025-43520: Apple Multiple Products Classic Buffer Overflow Vulnerability

Same vendorCISA Known Exploited Vulnerabilities

Monthly digest — independent AI security research

Original source: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/#atom-everything

First tracked: March 18, 2026 at 08:00 PM

Classified by LLM (prompt v3) · confidence: 92%