Ollama Announces MLX-Powered Preview: 1.6x-2x Faster LLM Inference on Apple Silicon

Ollama announced the preview release of version 0.19 on March 31, 2026, featuring deep integration with Apple's MLX (Machine Learning eXperimental) framework. This marks a major architectural shift for running large language models on Apple silicon Macs, moving away from the previous llama.cpp-based implementation to leverage MLX's unified memory architecture. The update delivers substantial performance improvements: prefill speeds (prompt processing) increase by approximately 1.6x (from 1154 to 1810 tokens/second), while decode speeds (response generation) nearly double (from 58 to 112 tokens/second). These benchmarks were conducted on Alibaba's Qwen3.5-35B model, with even higher performance possible with int4 quantization.

The release introduces three complementary improvements beyond raw speed: intelligent cache management that reuses cache across conversations for tools like Claude Code, improved memory efficiency through smarter cache eviction strategies, and the adoption of NVIDIA's NVFP4 quantization format for maintaining production-level accuracy while reducing memory bandwidth requirements. MLX's unified memory architecture is the key enabler—unlike traditional GPU frameworks that require explicit data movement between CPU and GPU, MLX operations run seamlessly on either processor without memory transfers, a capability uniquely suited to Apple silicon's integrated architecture.

The preview targets demanding use cases: personal AI assistants (like OpenClaw) and coding agents (Claude Code, OpenCode, Codex) that previously struggled with responsiveness on local hardware. The update is positioned as a significant milestone in making powerful AI agents practical for individual developers on consumer hardware, though it currently supports only Alibaba's Qwen3.5-35B-A3B model and requires Macs with 32GB or more unified memory. Community reception has been mixed—some developers celebrated the long-awaited MLX support, while others noted that Ollama adoption of MLX was delayed compared to competing tools like LM Studio.

Key Takeaways

About

Sentiment / Tone

Professional and achievement-focused, with technical precision. The announcement adopts a "milestone achieved" tone, emphasizing concrete performance metrics and specific use cases without hyperbole. The framing centers on unlocking new capabilities (faster responses for coding agents) rather than abstract performance numbers. Ollama positions this as solving real developer pain points—responsiveness lag in local AI tools—through technical excellence rather than claiming market dominance. The tone is confident but measured, acknowledging current limitations (32GB minimum, single model support) while committing to future expansion.

Related Links

Research Notes

**Author & Project Context**: Ollama is an open-source project focused on simplifying local LLM deployment, with significant adoption among developers seeking privacy-preserving AI without cloud dependencies. The core team collaborates closely with hardware partners (Apple, NVIDIA) and framework communities (GGML, llama.cpp, MLX). **Why This Matters**: This represents a strategic architectural shift—moving from a generic llama.cpp foundation to a hardware-optimized implementation. It signals that local AI inference is graduating from hobbyist territory to developer-grade infrastructure. **Competitive Context**: LM Studio and other local AI tools had already integrated MLX support; Ollama's delayed adoption was a frequent Reddit/community complaint. This release directly addresses that gap, though the complete rebuild means Ollama is now uniquely positioned between pure MLX tools and generic frameworks. **Community Reactions**: Hacker News and Reddit discussions show genuine enthusiasm tempered by practical concerns—the 32GB memory floor eliminates M1/M2 MacBook Air users, and single-model support feels like beta. However, developers running coding agents (especially with Claude Code integration) immediately recognized the value. **Industry Implications**: This validates Apple's MLX investment and suggests unified memory architecture will increasingly drive framework optimization. NVIDIA's NVFP4 collaboration hints at standardization efforts across hardware for inference workloads. **Reliability Note**: Ollama's partnerships with Apple (MLX), NVIDIA (NVFP4, testing), and Alibaba (Qwen models) are genuine technical collaborations, not marketing. The performance benchmarks are reproducible on specified hardware.

Ollama Announces MLX-Powered Preview: 1.6x-2x Faster LLM Inference on Apple Silicon

Summary

Key Takeaways

About

Sentiment / Tone

Related Links

Research Notes

Topics