Multimodal OCR: Parse Anything from Documents

https://huggingface.co/papers/2603.13032
Research Paper (arXiv preprint) · Researched March 25, 2026

Summary

This paper introduces Multimodal OCR (MOCR), a novel document parsing paradigm that jointly parses both text and graphics into unified textual representations. Unlike conventional OCR systems that recognize only text and treat graphical regions as discarded pixels, MOCR treats visual elements such as charts, diagrams, tables, icons, and UI components as first-class parsing targets. The approach enables structured document reconstruction with semantic relationships preserved between visual and textual components. The authors train a compact 3B-parameter model using a comprehensive data engine built from PDFs, rendered webpages, and native SVG assets, employing staged pretraining and supervised fine-tuning. The model achieves competitive results, ranking second only to Gemini 3 Pro on the OCR Arena Elo leaderboard and achieving state-of-the-art performance (83.9) on olmOCR Bench. On structured graphics parsing, dots.mocr surpasses even Gemini 3 Pro on image-to-SVG benchmarks.

Key Takeaways

Jointly parses text and graphics into unified representations, preserving semantic relationships across document elements
Reconstructs both text and graphics as structured outputs, enabling faithful document reconstruction and end-to-end training
Converts previously discarded graphics into reusable code-level supervision, creating multimodal training data from existing documents
Achieves state-of-the-art results: 83.9 on olmOCR Bench and ranks second only to Gemini 3 Pro on OCR Arena Elo leaderboard
Outperforms Gemini 3 Pro on image-to-SVG conversion benchmarks for charts, UI layouts, scientific figures, and chemical diagrams
Compact 3B-parameter model demonstrates that effective multimodal document parsing is achievable at smaller scales

About

Author: Handong Zheng, Yumeng Li, Kaile Zhang, Liang Xin, Guangwei Zhao, and 16+ co-authors

Publication: arXiv

Published: 2026-03-13

Sentiment / Tone

Positive and informative. The paper presents significant technical advances with strong experimental results and is framed as an important contribution to document parsing.

Research Notes

This is a very recent paper (published March 13, 2026) that addresses a significant gap in OCR systems by treating graphical elements as first-class citizens rather than discarding them. The work is particularly notable for achieving competitive results with a relatively compact 3B-parameter model, suggesting the approach is practical and scalable. The paper demonstrates strong performance on both text recognition and graphics-to-code conversion (image-to-SVG), with open-source code and models available. The 20+ author collaboration suggests this is a substantial effort from a well-resourced team. The paper's emphasis on converting graphics into reusable training data represents an innovative approach to building large-scale multimodal training corpora.

Topics

Optical Character Recognition (OCR) Document Parsing Multimodal Learning Computer Vision Natural Language Processing Document Understanding Graphics Recognition Structured Output Generation