URL copied — paste it as a website source in a new notebook
Summary
Ollama announced the preview release of version 0.19 on March 31, 2026, featuring deep integration with Apple's MLX (Machine Learning eXperimental) framework. This marks a major architectural shift for running large language models on Apple silicon Macs, moving away from the previous llama.cpp-based implementation to leverage MLX's unified memory architecture. The update delivers substantial performance improvements: prefill speeds (prompt processing) increase by approximately 1.6x (from 1154 to 1810 tokens/second), while decode speeds (response generation) nearly double (from 58 to 112 tokens/second). These benchmarks were conducted on Alibaba's Qwen3.5-35B model, with even higher performance possible with int4 quantization.
The release introduces three complementary improvements beyond raw speed: intelligent cache management that reuses cache across conversations for tools like Claude Code, improved memory efficiency through smarter cache eviction strategies, and the adoption of NVIDIA's NVFP4 quantization format for maintaining production-level accuracy while reducing memory bandwidth requirements. MLX's unified memory architecture is the key enabler—unlike traditional GPU frameworks that require explicit data movement between CPU and GPU, MLX operations run seamlessly on either processor without memory transfers, a capability uniquely suited to Apple silicon's integrated architecture.
The preview targets demanding use cases: personal AI assistants (like OpenClaw) and coding agents (Claude Code, OpenCode, Codex) that previously struggled with responsiveness on local hardware. The update is positioned as a significant milestone in making powerful AI agents practical for individual developers on consumer hardware, though it currently supports only Alibaba's Qwen3.5-35B-A3B model and requires Macs with 32GB or more unified memory. Community reception has been mixed—some developers celebrated the long-awaited MLX support, while others noted that Ollama adoption of MLX was delayed compared to competing tools like LM Studio.
Key Takeaways
Ollama 0.19 replaces its llama.cpp backend with Apple's MLX framework, achieving 1.6x faster prompt processing and nearly 2x faster token generation on Apple silicon Macs—enabling real-time AI agents and assistants that were previously laggy.
MLX uniquely exploits Apple silicon's unified memory architecture (shared CPU/GPU memory), eliminating expensive data copy operations that plague traditional GPU frameworks—a capability that frameworks like PyTorch/CUDA cannot replicate on these chips.
The update introduces intelligent caching optimizations that make coding agents like Claude Code more responsive by reusing cache across conversations and storing intelligent snapshots at prompt junctures, reducing redundant computation.
NVFP4 quantization support maintains production-grade model accuracy while reducing memory bandwidth and storage—enabling Ollama users to match inference results from larger inference deployments and integrate with NVIDIA's model optimization tools.
Currently limited to Qwen3.5-35B-A3B model and Macs with 32GB+ unified memory in preview, with Ollama committing to expand supported architectures and model formats in future releases.
The M5, M5 Pro, and M5 Max chips gain additional advantages from Apple's new GPU Neural Accelerators, which MLX leverages for both time-to-first-token and generation throughput gains.
Ollama's move to MLX positions it for the enterprise/production AI workloads (via NVFP4 parity) while maintaining the developer experience of local, privacy-preserving inference—addressing constraints that previously limited local AI deployment.
Community context: Competing projects like LM Studio adopted MLX support earlier, with some users noting Ollama's delayed integration; however, this release represents a complete architectural rebuild rather than an incremental addition.
The announcement specifically targets developers building AI-powered tools (personal assistants, coding agents, agentic workflows) who need responsive local inference without cloud dependencies or latency unpredictability.
This update reflects broader industry momentum toward Apple silicon as a primary platform for local ML—Apple's MLX research team and ecosystem investment are now yielding tangible performance advantages that justify on-device AI for developer workflows.
About
Author: Ollama Team
Publication: X (Twitter)
Published: 2026-03-31
Sentiment / Tone
Professional and achievement-focused, with technical precision. The announcement adopts a "milestone achieved" tone, emphasizing concrete performance metrics and specific use cases without hyperbole. The framing centers on unlocking new capabilities (faster responses for coding agents) rather than abstract performance numbers. Ollama positions this as solving real developer pain points—responsiveness lag in local AI tools—through technical excellence rather than claiming market dominance. The tone is confident but measured, acknowledging current limitations (32GB minimum, single model support) while committing to future expansion.
Related Links
Ollama Blog: MLX-Powered Apple Silicon Preview Official technical deep-dive with benchmarks, architecture details, and setup instructions—directly from Ollama team with performance charts and code examples.
MLX Framework on GitHub Apple's open-source MLX framework—the foundational technology enabling Ollama's performance gains. Includes documentation on unified memory optimization and array framework design.
MacRumors Coverage: Ollama Now Runs Faster on Macs Consumer-focused explanation of the update with context on Apple silicon advantage and practical use cases (Claude Code, OpenCode). Good for understanding market implications.
The New Stack: Ollama Taps Apple's MLX Framework Technical analysis of how this solves developer constraints around local AI inference and agentic workloads—bridges the gap between infrastructure and developer experience.
Research Notes
**Author & Project Context**: Ollama is an open-source project focused on simplifying local LLM deployment, with significant adoption among developers seeking privacy-preserving AI without cloud dependencies. The core team collaborates closely with hardware partners (Apple, NVIDIA) and framework communities (GGML, llama.cpp, MLX). **Why This Matters**: This represents a strategic architectural shift—moving from a generic llama.cpp foundation to a hardware-optimized implementation. It signals that local AI inference is graduating from hobbyist territory to developer-grade infrastructure. **Competitive Context**: LM Studio and other local AI tools had already integrated MLX support; Ollama's delayed adoption was a frequent Reddit/community complaint. This release directly addresses that gap, though the complete rebuild means Ollama is now uniquely positioned between pure MLX tools and generic frameworks. **Community Reactions**: Hacker News and Reddit discussions show genuine enthusiasm tempered by practical concerns—the 32GB memory floor eliminates M1/M2 MacBook Air users, and single-model support feels like beta. However, developers running coding agents (especially with Claude Code integration) immediately recognized the value. **Industry Implications**: This validates Apple's MLX investment and suggests unified memory architecture will increasingly drive framework optimization. NVIDIA's NVFP4 collaboration hints at standardization efforts across hardware for inference workloads. **Reliability Note**: Ollama's partnerships with Apple (MLX), NVIDIA (NVFP4, testing), and Alibaba (Qwen models) are genuine technical collaborations, not marketing. The performance benchmarks are reproducible on specified hardware.
Topics
Apple silicon optimizationLarge Language ModelsLocal AI inferenceMLX frameworkApple machine learningAI agents and coding assistants