This paper introduces Multimodal OCR (MOCR), a novel document parsing paradigm that jointly parses both text and graphics into unified textual representations. Unlike conventional OCR systems that recognize only text and treat graphical regions as discarded pixels, MOCR treats visual elements such as charts, diagrams, tables, icons, and UI components as first-class parsing targets. The approach enables structured document reconstruction with semantic relationships preserved between visual and textual components. The authors train a compact 3B-parameter model using a comprehensive data engine built from PDFs, rendered webpages, and native SVG assets, employing staged pretraining and supervised fine-tuning. The model achieves competitive results, ranking second only to Gemini 3 Pro on the OCR Arena Elo leaderboard and achieving state-of-the-art performance (83.9) on olmOCR Bench. On structured graphics parsing, dots.mocr surpasses even Gemini 3 Pro on image-to-SVG benchmarks.
Key Takeaways
Jointly parses text and graphics into unified representations, preserving semantic relationships across document elements
Reconstructs both text and graphics as structured outputs, enabling faithful document reconstruction and end-to-end training
Converts previously discarded graphics into reusable code-level supervision, creating multimodal training data from existing documents
Achieves state-of-the-art results: 83.9 on olmOCR Bench and ranks second only to Gemini 3 Pro on OCR Arena Elo leaderboard
Outperforms Gemini 3 Pro on image-to-SVG conversion benchmarks for charts, UI layouts, scientific figures, and chemical diagrams
Compact 3B-parameter model demonstrates that effective multimodal document parsing is achievable at smaller scales
Positive and informative. The paper presents significant technical advances with strong experimental results and is framed as an important contribution to document parsing.
This is a very recent paper (published March 13, 2026) that addresses a significant gap in OCR systems by treating graphical elements as first-class citizens rather than discarding them. The work is particularly notable for achieving competitive results with a relatively compact 3B-parameter model, suggesting the approach is practical and scalable. The paper demonstrates strong performance on both text recognition and graphics-to-code conversion (image-to-SVG), with open-source code and models available. The 20+ author collaboration suggests this is a substantial effort from a well-resourced team. The paper's emphasis on converting graphics into reusable training data represents an innovative approach to building large-scale multimodal training corpora.
Topics
Optical Character Recognition (OCR)Document ParsingMultimodal LearningComputer VisionNatural Language ProcessingDocument UnderstandingGraphics RecognitionStructured Output Generation