Logo Epo
Public sector3 min read

From PDFs to structured output: automated patent transcription at scale.

The European Patent Office accelerates invention with Mistral AI.

The European Patent Office leverages Mistral AI to streamline complex application processing.

Stats:

  1. 400,000 pages/day throughput
  2. <1% character recognition error rate
  3. Lead time from up to 5 days down to a few minutes 
"Standard OCR tools often fall short when confronted with the intricacies of patent documents. What mattered to us was ST36 compliance and reliability on complex elements like tables, formulas, and figures at high volumes. Mistral AI's OCR expertise directly addressed the near-zero fault tolerance our patent grant process requires."
Angel Aledo Lopez, Chief Operating Officer and Chief Technology Officer, European Patent Office.

The European Patent Office (EPO) is at the forefront of Europe’s innovation ecosystem. It processes millions of pages of patent documents annually, and to maintain quality and efficiency, document transcription must be near error-free and compliant with industry standard ST36 XML. The EPO partnered with Mistral AI to modernize its capabilities. Over a three-month proof of concept, Mistral AI deployed advanced Optical Character Recognition (OCR) technology, achieving near-zero fault tolerance and unlocking new efficiencies in handling the complex data essential to the patent lifecycle.

Processing technical complexity at scale

The primary hurdle for the EPO is the extreme technical density of patent applications. Patent documents frequently embed mathematical formulas, chemical structures, detailed images, and intricate tables within text across multiple European languages. These elements are notoriously difficult for standard digital tools to interpret, often leading to incomplete or inaccurate data structuring.

For the EPO, the operational reality was clear: they needed a scalable solution that could handle massive volumes of complex data without sacrificing accuracy. To maintain the high quality of the patent grant process, the EPO required an automated transcription system capable of reading scientific and technical elements flawlessly while producing ST36-compliant structured output.

Mistral AI partnership: a tailored end-to-end solution

To address these specific challenges, the EPO and Mistral AI embarked on a collaborative proof of concept. Unlike a standard vendor relationship, this project involved dedicated teams working side-by-side to combine the EPO’s deep domain knowledge with Mistral’s cutting-edge AI capabilities. This synergy allowed for rapid iteration, ensuring the technology was adapted to the EPO’s operational reality rather than the other way around.​

The core of the solution included a 1B OCR model, specifically fine-tuned on over 150,000 PDFs, representing 50,000 unique patents. The team developed a sophisticated pipeline that successfully converts complex, multifaceted patent documents from Markdown, to HTML, and finally into the required ST36 XML format. This specialized pipeline ensures the model can handle the diversity of European patent documents formats that off-the-shelf tools fail to process.

Beyond the model itself, the team delivered a scalable, fault-tolerant application built for reliability and full compliance with industry standards, ready for on-premises deployment.

a8298277-b6b3-44c4-949a-59b2f5b6ed6e

Figure: Patent transcription automation with EPO fine-tuned OCR model

Driving precision and efficiency in patent examination

The deployed solution yielded substantial improvements in data extraction accuracy. The fine-tuned model demonstrated superior ability to handle the most difficult elements of patent applications, particularly complex formulas and technical images. Specifically, the system achieved a throughput of 400,000 pages transcribed per day while maintaining a character recognition error rate of less than 1%. This benefit also allows faster processing of patent applications.

The success of the solution validates the use of advanced AI in rigorous administrative environments. With a highly scalable system already in place, the EPO can automate the extraction of structural data to improve overall efficiency. This allows the EPO to focus its expert resources on the high-value task of patent examination, adhering to strict quality standards and near-zero fault tolerance requirements.