MLX + TurboQuant = Local Super Power: Sub-150ms Document Queries with Total Privacy

John T Davies (@jtdavies), a CTO in AI and FinTech with 30+ years in IT, demonstrates a breakthrough approach to local document analysis using Apple's MLX framework combined with Google's TurboQuant KV cache compression algorithm. His post showcases practical real-world results: sub-150ms time-to-first-token (TTFT) responses on a 75-page PDF (approximately 30,000 tokens) running locally on office Mac Minis. Davies describes deploying this solution to enable instant, lossless document queries across sensitive company and client documents with complete privacy—accessible via phones, iPads, and laptops without any data leaving the local network.

The technical approach involves pre-filling a 256k KV cache with documents and system prompts, applying TurboQuant quantization for compression, and running inference on Apple Silicon hardware. Davies provides a compelling example: analyzing the entire Claude Code codebase (17,000 lines of code from 177 source files, indexing all 1,902 files totaling roughly 500,000 lines) with zero quality loss. His comparison to cloud-based alternatives emphasizes the strategic advantage: achieving better response quality than the best public models while maintaining absolute privacy—a particularly pointed critique given Anthropic's recent source code leak.

The post positions local inference as a transformative capability for enterprise use, enabling employees to instantly answer critical questions ("What were our 3Q25 profits?", "When did this person start working for us?", "List all project milestones and team count") within seconds, with voice support also available. Davies argues this represents the democratization of private, fast, accurate document analysis without reliance on cloud services or security compromises.

Key Takeaways

About

Sentiment / Tone

Enthusiastically demonstrative with confident precision. Davies employs a technical but accessible tone, using specific benchmarks and real-world examples to substantiate claims. His rhetoric combines practical achievement ("we run this on an office server") with implicit critique of cloud alternatives, positioning local inference as intellectually superior and more ethically sound. The parenthetical "oops Anthropic!" reads as pointed irony—not hostile, but unmistakably highlighting the contrast between Anthropic's public security messaging and the practical reality of their March 31 source code leak. Overall: optimistic about what's technically possible, slightly sardonic about incumbent players, and motivated by privacy as both a technical and moral concern.

Related Links

Research Notes

**Author Background**: John T Davies is a credible voice on this topic—co-founder of C24 (Century 24, founded 2000) specializing in investment banking integration, with 30+ years across hardware, C/C++, Java, and enterprise architecture. He's actively working on private agentic systems for financial institutions, giving his claims about confidential document analysis immediate practical relevance. His X/Twitter account shows consistent technical depth: testing various quantization approaches, comparing LLM inference frameworks, and experimenting with cutting-edge models (Qwen3, Kimi-Linear, Jan models for tool-calling). **TurboQuant Context**: The algorithm is legitimately novel. Published as a conference paper at ICLR 2026 (Zandieh et al., "Online Vector Quantization with Near-optimal Distortion Rate"), it appeared on arXiv in April 2025. Google Research blogged about it March 24, 2026. The breakthrough is applying Quantized Johnson-Lindenstrauss (QJL) and PolarQuant techniques to KV cache compression, achieving 3-bit quantization with minimal perplexity degradation. Community implementations (in MLX, Triton, vLLM) emerged within 24 hours of Google's announcement, indicating strong validation from practitioners. **The Claude Code Leak Reference**: Davies's parenthetical "oops Anthropic!" alludes to a significant security incident. On March 31, 2026 (one day before his post), Anthropic accidentally exposed the complete Claude Code CLI source code via a misconfigured npm source map file—approximately 512,000 lines across ~2,000 TypeScript files. Chaofan Shou first flagged it publicly; the post accumulated 28.8M views. Anthropic confirmed the leak but stated no customer data or credentials were exposed. Davies's juxtaposition—his "zero loss (unlike Anthropic's security)"—is a damning implicit comparison: local inference offers absolute privacy; cloud solutions have demonstrated vulnerability to operational errors. **Reactions & Reception**: While the specific post hasn't been widely cited yet (as of April 1, 2026), related discussions in r/LocalLLaMA and broader AI circles show intense interest in MLX + TurboQuant for exactly this use case. Practitioners note that on 16GB Macs, TurboQuant dramatically extends context window (enabling larger document intake), though the actual model capability (3B, 7B, 35B) remains unchanged—still solving a legitimate problem. Some skeptics in Reddit threads report quality degradation when quantizing K tensors below 4-bit, but consensual finding is that Davies's sub-150ms claims are plausible for smaller context-heavy workloads on M-series hardware. **Credibility Caveats**: Davies cherry-picked a 75-page, 30k-token example. Larger documents, adversarial queries, or models requiring fine-grained reasoning may not scale as gracefully. He doesn't discuss latency for speculative decoding, token acceptance rates, or variance across models. The "zero loss" claim should be read as "zero loss in perplexity or benchmark metrics"—subtle semantic/reasoning shifts in extreme quantization are hard to measure but possible. His financial-sector background may bias him toward use cases where document Q&A is the main task; other domains (code generation, reasoning-heavy tasks) may see more visible degradation. **Broader Significance**: The post exemplifies a quiet shift in AI infrastructure: away from cloud-dependent LLM APIs (OpenAI, Anthropic, Claude) toward self-hosted local inference. Combined with TurboQuant's optimization, this makes enterprise-grade document analysis deployable on existing office hardware without capital investment or ongoing cloud bills. For regulated industries (finance, healthcare, law) with strict data governance, local inference eliminates the regulatory risk of third-party breaches entirely.