Skip to main content

Docling Loader

Overviewโ€‹

Docling parses PDF, DOCX, PPTX, HTML, and other formats into a rich unified representation including document layout, tables etc., making them ready for generative AI workflows like RAG.

Docling Loader, presented in this notebook, seamlessly integrates Docling into LangChain, enabling you to:

  • use various document types in your LLM applications with ease and speed, and
  • leverage Docling's rich representation for advanced, document-native grounding.

In the sections below, we showcase Docling Loader's usage, covering document loading specifics but also demonstrating an end-to-end RAG pipeline.

This notebook provides a quick overview for getting started with Docling document loader. For detailed documentation of all Docling Loader features and configurations head to the API reference.

Integration detailsโ€‹

ClassPackageLocalSerializableJS support
DoclingLoaderlangchain_communityโœ…โŒโŒ

Loader featuresโ€‹

SourceDocument Lazy LoadingNative Async Support
DoclingLoaderโœ…โŒ

Setupโ€‹

Installationโ€‹

To use the Docling document loader you will need to have docling installed besides langchain-community:

%pip install -qU docling langchain-community

Initializationโ€‹

Now we can instantiate our loader and load documents.

By default, DoclingLoader loads each input document as a LangChain Document with Markdown content (more options can be found in the "Deep Dive" section further below).

from langchain_community.document_loaders import DoclingLoader

FILE_PATH = "https://arxiv.org/pdf/2408.09869"

loader = DoclingLoader(file_path=FILE_PATH)
API Reference:DoclingLoader

Loadโ€‹

docs = loader.load()
print(f"{docs[0].page_content[:200]=}")
docs[0].page_content[:200]='## Docling Technical Report\n\nVersion 1.0\n\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla '
print(docs[0].metadata)
{'source': 'https://arxiv.org/pdf/2408.09869'}

Lazy Loadโ€‹

Documents can also be loaded in a lazy fashion:

doc_iter = loader.lazy_load()
for doc in doc_iter:
pass # you can operate on `doc` here

Deep Diveโ€‹

Initializationโ€‹

The general syntax of DoclingLoader initialization is as follows (also see API reference):

loader = DoclingLoader(
file_path=FILE_PATH,
### OPTIONAL PARAMS: ###
converter=..., # any specific Docling converter to use
convert_kwargs=..., # any specific kwargs for conversion execution
export_type=..., # export mode: Markdown (default) or doc-chunks
md_export_kwargs=..., # any specific Markdown export kwargs (for Markdown mode)
chunker=..., # any specific Docling chunker to use (for doc-chunk mode)
)

DoclingLoader can be instantiated in two different modes:

  • Markdown mode: for each input doc, outputs a LangChain Document with the Markdown representation of the input doc. This is the default mode, implicitly used in the steps above.
  • Doc-chunks mode: for each input doc, outputs the doc chunks as per the chunker (by default using Docling layout-aware chunking) as LangChain Documents.

In the subsections below we explore both modes in more detail.

Document preparation using Markdown modeโ€‹

Following up on the steps further above, given that the docs have been loaded, any built-in (or custom) LangChain splitter can be used to split them. For example, below we show a possible splitting with a MarkdownHeaderTextSplitter:

%pip install -qU langchain-text-splitters
from langchain_text_splitters import MarkdownHeaderTextSplitter

splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "Header_1"), ("##", "Header_2"), ("###", "Header_3")],
)
md_splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]

for d in md_splits[:2]:
print(f"{d.metadata=}, {d.page_content=}")
d.metadata={'Header_2': 'Docling Technical Report'}, d.page_content='Version 1.0  \nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar  \nAI4K Group, IBM Research Ruschlikon, Switzerland'
d.metadata={'Header_2': 'Abstract'}, d.page_content='This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'

Document preparation using doc chunks modeโ€‹

The doc-chunks mode directly returns the document chunks including rich metadata such as the page number and the bounding box info.

loader = DoclingLoader(
file_path=FILE_PATH,
export_type=DoclingLoader.ExportType.DOC_CHUNKS,
)
doc_splits = loader.load()

for d in doc_splits[:2]:
print(f"{d.metadata=}, {d.page_content=}")
d.metadata={'source': 'https://arxiv.org/pdf/2408.09869', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/0', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'page_header', 'prov': [{'page_no': 1, 'bbox': {'l': 17.088111877441406, 't': 583.2296752929688, 'r': 36.339778900146484, 'b': 231.99996948242188, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 38]}]}], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}}, d.page_content='arXiv:2408.09869v3 [cs.CL] 30 Aug 2024'
d.metadata={'source': 'https://arxiv.org/pdf/2408.09869', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/2', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 282.772216796875, 't': 512.7218017578125, 'r': 328.8624572753906, 'b': 503.340087890625, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 11]}]}], 'headings': ['Docling Technical Report'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}}, d.page_content='Version 1.0'

RAG exampleโ€‹

In this section we put together a demo RAG pipeline and run it using the documents loaded above.

%pip install -qU langchain langchain-huggingface langchain-milvus
import json
import os
from pathlib import Path
from tempfile import mkdtemp

from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import PromptTemplate
from langchain_huggingface import HuggingFaceEndpoint
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_milvus import Milvus

# https://github.com/huggingface/transformers/issues/5486:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

QUESTION = "Which are the main AI models in Docling?"
PROMPT = PromptTemplate.from_template(
"Context information is below.\n---------------------\n{context}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {input}\nAnswer:\n",
)
HF_EMBED_MODEL_ID = "BAAI/bge-small-en-v1.5"
HF_LLM_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"

embedding = HuggingFaceEmbeddings(model_name=HF_EMBED_MODEL_ID)
llm = HuggingFaceEndpoint(repo_id=HF_LLM_MODEL_ID)


def run_rag(documents, embedding, llm, question, prompt):
def clip_text(text, threshold=100):
return f"{text[:threshold]}[...]" if len(text) > threshold else text

milvus_uri = str(Path(mkdtemp()) / "docling.db") # or set as needed
vectorstore = Milvus.from_documents(
documents,
embedding,
connection_args={"uri": milvus_uri},
drop_old=True,
)
retriever = vectorstore.as_retriever()
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
resp_dict = rag_chain.invoke({"input": question})

answer = clip_text(resp_dict["answer"], threshold=200)
print(f"Question:\n{resp_dict['input']}\n\nAnswer:\n{json.dumps(answer)}")
for i, doc in enumerate(resp_dict["context"]):
print()
print(f"Source {i+1}:")
print(f" text: {json.dumps(clip_text(doc.page_content, threshold=200))}")
for key in doc.metadata:
if key != "pk":
val = doc.metadata.get(key)
clipped_val = clip_text(val) if isinstance(val, str) else val
print(f" {key}: {clipped_val}")

RAG using Markdown modeโ€‹

Below we run the RAG pipeline passing it the output of the Markdown mode (after splitting):

run_rag(
documents=md_splits,
embedding=embedding,
llm=llm,
question=QUESTION,
prompt=PROMPT,
)
Question:
Which are the main AI models in Docling?

Answer:
"The main AI models in Docling are DocLayNet and TableFormer. DocLayNet is a layout analysis model that is an accurate object-detector for page elements, and TableFormer is a state-of-the-art table str[...]"

Source 1:
text: "As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis m[...]"
Header_2: 3.2 AI models

Source 2:
text: "This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layo[...]"
Header_2: Abstract

Source 3:
text: "Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed e[...]"
Header_2: 5 Applications

Source 4:
text: "Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecogni[...]"
Header_2: 6 Future work and contributions

RAG using doc-chunk modeโ€‹

Below we run the RAG pipeline passing it the output of the doc-chunk mode.

Notice how the sources now also contain document-level grounding (e.g. page number or bounding box information):

run_rag(
documents=doc_splits,
embedding=embedding,
llm=llm,
question=QUESTION,
prompt=PROMPT,
)
Question:
Which are the main AI models in Docling?

Answer:
"The main AI models in Docling are a layout analysis model, an accurate object-detector for page elements, and TableFormer, a state-of-the-art table structure recognition model. These models are develo[...]"

Source 1:
text: "As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis m[...]"
dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/34', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 107.07593536376953, 't': 406.1695251464844, 'r': 504.1148681640625, 'b': 330.2677307128906, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 608]}]}], 'headings': ['3.2 AI models'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
source: https://arxiv.org/pdf/2408.09869

Source 2:
text: "With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition[...]"
dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/9', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 107.0031967163086, 't': 136.7283935546875, 'r': 504.04998779296875, 'b': 83.30133056640625, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 488]}]}], 'headings': ['1 Introduction'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
source: https://arxiv.org/pdf/2408.09869

Source 3:
text: "Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecogni[...]"
dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/60', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 106.92281341552734, 't': 323.5386657714844, 'r': 504.00347900390625, 'b': 258.76641845703125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 543]}]}], 'headings': ['6 Future work and contributions'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
source: https://arxiv.org/pdf/2408.09869

Source 4:
text: "This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layo[...]"
dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/6', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 142.92593383789062, 't': 364.814697265625, 'r': 468.3847351074219, 'b': 300.651123046875, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 431]}]}], 'headings': ['Abstract'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
source: https://arxiv.org/pdf/2408.09869

API referenceโ€‹

For detailed documentation of all DoclingLoader features and configurations head to the API reference.


Was this page helpful?