Running a 35B AI Agent on a $600 Mac mini at 30 Tokens/Second via Apple Silicon Flash-Paging

Walter Grace announced the "mac-code" project, demonstrating the ability to run a 35-billion-parameter AI model (Qwen3.5-35B-A3B) as a coding agent on a $600 Mac mini M4 with only 16GB of unified memory at 30 tokens per second—completely free with no cloud dependencies or API keys required. The breakthrough leverages Apple Silicon's unified memory architecture combined with SSD paging to work around RAM limitations, achieving 18.6x faster throughput compared to NVIDIA GPUs attempting the same SSD paging approach (1.6 tok/s vs 30 tok/s). This demonstrates a fundamental architectural advantage: Apple's unified memory allows seamless data movement between GPU, RAM, and NVMe storage, making large model inference on consumer hardware practical and cost-effective for local development and agentic work.

The mac-code project extends beyond simple model serving by introducing a sophisticated routing architecture where the LLM classifies its own intent as "search," "shell," or "chat" using plain text classification rather than JSON function calling—enabling tool use at only 2.6 bits per weight, unusually efficient for such aggressive quantization (IQ2_M). The project includes two deployment options: a llama.cpp backend offering 30 tok/s with a 35B Mixture-of-Experts model, and an MLX backend supporting 9B models with 64K context via KV cache quantization and persistent context storage across sessions. Additional breakthroughs include TurboQuant achieving 4x KV cache compression (26.6 MB → 6.7 MB at 0.993 cosine similarity) and context persistence that reloads from SSD in 0.0003 seconds or syncs across devices via Cloudflare R2.

The work positions Apple Silicon as a serious alternative for AI developers seeking control, privacy, and zero operational costs, challenging the assumption that large model development requires cloud GPU access. The project has gained attention in the LocalLLaMA community, where users have independently verified similar performance metrics with the Qwen3.5-35B-A3B model and discussed its viability as a game-changer for local agentic coding systems.

Key Takeaways

About

Sentiment / Tone

Enthusiastic and technically celebratory. The author presents the achievement as a significant breakthrough in consumer-level AI capability, emphasizing practical outcomes over hype. The tone is confident but not hyperbolic—the 18.6x faster claim is framed within the specific technical context of SSD paging comparisons. There's an undertone of liberation from cloud dependency ("no cloud, no API keys, $0/month"), positioned as both cost-effective and privacy-preserving. The author demonstrates deep technical credibility by discussing implementation specifics (quantization levels, bits per weight, KV cache compression) rather than making vague claims, which reinforces the authority of the benchmark claims.

Related Links

Research Notes

Walter Grace is a prolific GitHub developer (65+ repositories) who presents himself as an "Utopian coder" with apparent focus on infrastructure and inference optimization. The mac-code project builds on established techniques—Apple's \"LLM in a Flash\" research (unified memory paging), Google's TurboQuant (KV cache compression), and MLX (Apple's native ML framework)—but combines them into a cohesive, user-friendly package. The project has generated legitimate interest in the LocalLLaMA community on Reddit, where independent users have validated similar performance metrics with the Qwen3.5-35B-A3B model and discussed its viability for local coding assistance. The 30 tok/s benchmark for 35B with paging is genuinely impressive given the hardware constraints; typical M4 Mac mini achieves 18-35 tok/s for much smaller 7-8B models in RAM. The 18.6x faster claim requires context: it specifically compares SSD paging performance on Apple Silicon vs NVIDIA, a technically valid but somewhat narrow comparison—the real advantage is that Apple Silicon makes paging viable at reasonable speeds where NVIDIA's architecture makes it prohibitively slow. Community discussions highlight that Qwen3.5 models (especially the IQ2_M quantization variants) have earned respect for quality-to-size ratios, with users noting \"really, really good\" performance. The cost comparison ($0/month local vs cloud API pricing) is legitimate but doesn't account for development time, since local inference requires setup and troubleshooting. This work is part of a broader 2025 trend validating Apple Silicon for AI workloads, with Apple's own MLX research and M5 benchmarks showing under-3-second time-to-first-token for 30B MoE models. The project's potential impact is moderate but real: it significantly lowers the barrier for developers wanting to prototype agentic AI systems without API costs or privacy concerns, though performance remains behind high-end GPU clusters for production inference."

Running a 35B AI Agent on a $600 Mac mini at 30 Tokens/Second via Apple Silicon Flash-Paging

Summary

Key Takeaways

About

Sentiment / Tone

Related Links

Research Notes

Topics