URL copied — paste it as a website source in a new notebook
Summary
John T Davies (@jtdavies), a CTO in AI and FinTech with 30+ years in IT, demonstrates a breakthrough approach to local document analysis using Apple's MLX framework combined with Google's TurboQuant KV cache compression algorithm. His post showcases practical real-world results: sub-150ms time-to-first-token (TTFT) responses on a 75-page PDF (approximately 30,000 tokens) running locally on office Mac Minis. Davies describes deploying this solution to enable instant, lossless document queries across sensitive company and client documents with complete privacy—accessible via phones, iPads, and laptops without any data leaving the local network.
The technical approach involves pre-filling a 256k KV cache with documents and system prompts, applying TurboQuant quantization for compression, and running inference on Apple Silicon hardware. Davies provides a compelling example: analyzing the entire Claude Code codebase (17,000 lines of code from 177 source files, indexing all 1,902 files totaling roughly 500,000 lines) with zero quality loss. His comparison to cloud-based alternatives emphasizes the strategic advantage: achieving better response quality than the best public models while maintaining absolute privacy—a particularly pointed critique given Anthropic's recent source code leak.
The post positions local inference as a transformative capability for enterprise use, enabling employees to instantly answer critical questions ("What were our 3Q25 profits?", "When did this person start working for us?", "List all project milestones and team count") within seconds, with voice support also available. Davies argues this represents the democratization of private, fast, accurate document analysis without reliance on cloud services or security compromises.
Key Takeaways
Sub-150ms time-to-first-token (TTFT) on 75-page PDFs with 30k tokens demonstrates 'instant' response times for human interaction, enabling practical enterprise document Q&A on office hardware.
TurboQuant (Google's ICLR 2026 algorithm) enables lossless KV cache compression to 3-bit precision with no training required, reducing memory overhead by 70%+ while maintaining model quality—the key technical enabler for long-context local inference.
Pre-filling the 256k KV cache with full documents and system prompts upfront means inference cost is negligible after initial setup, making document analysis economically viable on consumer Mac hardware.
The approach was tested on the complete Claude Code codebase (500k+ lines, 1,902 files) with zero quality loss, proving viability for large-scale private code and document analysis—directly applicable to enterprise use cases with confidential IP.
Running on office Mac Minis enables confidential document queries across phones/iPads/laptops with zero data transmission to external services, creating an absolute privacy guarantee for sensitive financial, legal, and personnel information.
Voice query support adds accessibility layer, allowing non-technical staff to instantly access critical company data (profit figures, employment history, project details) without intermediaries or cloud dependencies.
The implicit security comparison to cloud-based alternatives is sharpened by Anthropic's March 31 Claude Code leak (512k lines exposed via npm misconfiguration), suggesting local inference eliminates risk of third-party breaches entirely.
MLX's Metal GPU acceleration on Apple Silicon makes the combined approach practical on consumer hardware rather than server-grade infrastructure, fundamentally changing the economics of private document analysis.
About
Author: John T Davies
Publication: X/Twitter
Published: 2026-04-01
Sentiment / Tone
Enthusiastically demonstrative with confident precision. Davies employs a technical but accessible tone, using specific benchmarks and real-world examples to substantiate claims. His rhetoric combines practical achievement ("we run this on an office server") with implicit critique of cloud alternatives, positioning local inference as intellectually superior and more ethically sound. The parenthetical "oops Anthropic!" reads as pointed irony—not hostile, but unmistakably highlighting the contrast between Anthropic's public security messaging and the practical reality of their March 31 source code leak. Overall: optimistic about what's technically possible, slightly sardonic about incumbent players, and motivated by privacy as both a technical and moral concern.
**Author Background**: John T Davies is a credible voice on this topic—co-founder of C24 (Century 24, founded 2000) specializing in investment banking integration, with 30+ years across hardware, C/C++, Java, and enterprise architecture. He's actively working on private agentic systems for financial institutions, giving his claims about confidential document analysis immediate practical relevance. His X/Twitter account shows consistent technical depth: testing various quantization approaches, comparing LLM inference frameworks, and experimenting with cutting-edge models (Qwen3, Kimi-Linear, Jan models for tool-calling).
**TurboQuant Context**: The algorithm is legitimately novel. Published as a conference paper at ICLR 2026 (Zandieh et al., "Online Vector Quantization with Near-optimal Distortion Rate"), it appeared on arXiv in April 2025. Google Research blogged about it March 24, 2026. The breakthrough is applying Quantized Johnson-Lindenstrauss (QJL) and PolarQuant techniques to KV cache compression, achieving 3-bit quantization with minimal perplexity degradation. Community implementations (in MLX, Triton, vLLM) emerged within 24 hours of Google's announcement, indicating strong validation from practitioners.
**The Claude Code Leak Reference**: Davies's parenthetical "oops Anthropic!" alludes to a significant security incident. On March 31, 2026 (one day before his post), Anthropic accidentally exposed the complete Claude Code CLI source code via a misconfigured npm source map file—approximately 512,000 lines across ~2,000 TypeScript files. Chaofan Shou first flagged it publicly; the post accumulated 28.8M views. Anthropic confirmed the leak but stated no customer data or credentials were exposed. Davies's juxtaposition—his "zero loss (unlike Anthropic's security)"—is a damning implicit comparison: local inference offers absolute privacy; cloud solutions have demonstrated vulnerability to operational errors.
**Reactions & Reception**: While the specific post hasn't been widely cited yet (as of April 1, 2026), related discussions in r/LocalLLaMA and broader AI circles show intense interest in MLX + TurboQuant for exactly this use case. Practitioners note that on 16GB Macs, TurboQuant dramatically extends context window (enabling larger document intake), though the actual model capability (3B, 7B, 35B) remains unchanged—still solving a legitimate problem. Some skeptics in Reddit threads report quality degradation when quantizing K tensors below 4-bit, but consensual finding is that Davies's sub-150ms claims are plausible for smaller context-heavy workloads on M-series hardware.
**Credibility Caveats**: Davies cherry-picked a 75-page, 30k-token example. Larger documents, adversarial queries, or models requiring fine-grained reasoning may not scale as gracefully. He doesn't discuss latency for speculative decoding, token acceptance rates, or variance across models. The "zero loss" claim should be read as "zero loss in perplexity or benchmark metrics"—subtle semantic/reasoning shifts in extreme quantization are hard to measure but possible. His financial-sector background may bias him toward use cases where document Q&A is the main task; other domains (code generation, reasoning-heavy tasks) may see more visible degradation.
**Broader Significance**: The post exemplifies a quiet shift in AI infrastructure: away from cloud-dependent LLM APIs (OpenAI, Anthropic, Claude) toward self-hosted local inference. Combined with TurboQuant's optimization, this makes enterprise-grade document analysis deployable on existing office hardware without capital investment or ongoing cloud bills. For regulated industries (finance, healthcare, law) with strict data governance, local inference eliminates the regulatory risk of third-party breaches entirely.
Topics
Local LLM InferenceKV Cache CompressionApple Silicon and MLXEnterprise Privacy and SecurityDocument Analysis and RAGEdge AI Computing