IBM Releases New Granite-Docling Model to Deliver End-to-End Document Understanding

IBM is releasing Granite-Docling-258M, an ultra-compact and cutting-edge open-source vision-language model (VLM) for converting documents to machine-readable formats while fully preserving their layout, tables, equations, lists, and more. It’s now available on Hugging Face through a standard Apache 2.0 license.

According to IBM, Granite-Docling is purpose-built for accurate and efficient document conversion, unlike most VLM-based approaches to optical character recognition (OCR) that aim to adapt large, general-purpose models to the task.

Even at an ultra-compact 258M parameters, Granite-Docling’s capabilities rival those of systems several times its size, making it extremely cost-effective. The model goes well beyond mere text extraction: it handles both inline and floating math and code, excels at recognizing table structure and preserves the layout and structure of the original document.

Whereas conventional OCR models convert documents directly to Markdown and lose connection to the source content, Granite-Docling’s unique method of faithfully translating complex structural elements makes its output ideal for downstream RAG applications.

Granite-Docling was developed by the team behind the celebrated open source Docling library. Docling provides tools, models and a command-line interface for document conversion, as well as plug-and-play integration with agentic AI workflows. Whereas the Docling library enables customizable ensemble pipelines, Granite-Docling is a single 258M parameter VLM that parses and processes documents in one shot.

The new Granite-Docling is a product-ready evolution of the experimental SmolDocling-256M-preview model released by IBM Research in partnership with Hugging Face in March 2025. Granite-Docling replaces the SmolLM-2 language backbone used for SmolDocling’s with a Granite 3-based architecture and replaces the SigLIP visual encoder with the updated SigLIP2, but otherwise retains the general methodology of SmolDocling (while exceeding its performance).

Crucially, Granite-Docling addresses certain instabilities present in SmolDocling-256M-preview, such as the occasional tendency to get stuck in loops of repeating the same token at a certain spot of a page.

While some imperfections are inevitable from any model, reliable enterprise use at scale requires the confidence that no individual errors will derail the workflow itself. IBM Research mitigated these instabilities for Granite-Docling through extensive dataset filtering and cleaning to remove samples with inconsistent or missing annotations, as well as any samples with irregularities that introduced counterproductive ambiguities.

Central to Granite-Docling’s efficacy is DocTags, a universal markup format developed by IBM Research that captures and describes all page elements—charts, tables, forms, code, equations, footnotes, captions and more—as well as their contextual relation to one another and location within a document layout.

DocTags define a structured vocabulary of unambiguous tags and rules that explicitly separate textual content from document structure, minimizing both confusion and token usage. This enables Granite-Docling to isolate each element, describe its specific location on the page, and then perform OCR within it. It can also concisely describe relationships between different elements, such as proper reading order or hierarchy—for instance, linking a caption to its corresponding figure/table.

DocTags is optimized for LLM readability. After Granite-Docling has output the original document(s) in DocTags, it can be easily converted directly into Markdown, JSON, or HTML (or fed into a Docling library pipeline), streamlining the process of converting proprietary documents into high-quality datasets for fine-tuning other LLMs or enhancing LLM responses through retrieval augmented generation (RAG).

SmolDocling-256-preview was trained on an English-language corpus, but it can reasonably handle documents authored in any language that uses standard Latin characters. After all, the model only needs to be able to parse and transcribe the document’s text—not (necessarily) understand it. But this obviously omits languages that don’t use Latin script, which limits SmolDocling’s utility in many parts of the world.

IBM’s intent is to make Granite-Docling as universally helpful as possible. To that end, Granite-Docling offers experimental multilingual capabilities across additional target languages that include Arabic, Chinese, and Japanese, with the goal of extending Granite-Docling to more of the world’s most widely used alphabets.

The development of both Granite-Docling and the Docling library have been, and will continue to be, guided by feedback from the vibrant Docling community. As with its SmolDocling predecessor, IBM Research’s goal in releasing the new Granite-Docling model is to gather community feedback that can guide the continuous refinement and expansion of Docling capabilities for future releases.

Granite-Docling-258M is now available through a standard Apache 2.0 license on Hugging Face.

For more information about this news, visit www.ibm.com.

Leave a Reply

Your email address will not be published. Required fields are marked *