MarkushGrapher-2

Abstract

Automatically extracting chemical structures from documents is essential for the large-scale analysis of the literature in chemistry. Automatic pipelines have been developed to recognize molecules represented either in figures or in text independently. However, methods for recognizing chemical structures from multimodal descriptions (Markush structures) lag behind in precision and cannot be used for automatic large-scale processing.

In this work, we present MarkushGrapher-2, an end-to-end approach for the multimodal recognition of chemical structures in documents. First, our method employs a dedicated OCR model to extract text from chemical images. Second, the text, image, and layout information are jointly encoded through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. Finally, the resulting encodings are effectively fused through a two-stage training strategy and used to auto-regressively generate a representation of the Markush structure.

To address the lack of training data, we introduce an automatic pipeline for constructing a large-scale dataset of real-world Markush structures. In addition, we present IP5-M, a large manually-annotated benchmark of real-world Markush structures, designed to advance research on this challenging task. Extensive experiments show that our approach substantially outperforms state-of-the-art models in multimodal Markush structure recognition, while maintaining strong performance in molecule structure recognition. Code, models, and datasets will be released publicly.

Motivation

The automatic extraction of chemical structures from (document) images is a significant challenge that is highly useful for large-scale patent analysis, prior-art searches, and ultimately accelerating Research & Development of novel chemical compounds. With MarkushGrapher-2, we introduce a model for automatic end-to-end parsing of complex molecular and Markush structures from images into a graph-based string representation, namely (extended) SMILES. These (extended) SMILES can be used to populate large-scale chemical databases, enabling efficient search and analysis of chemical structures in the literature.

Key Contributions

Universal Model

MarkushGrapher-2 is a universal model for recognizing both molecular images and multimodal Markush structures in a single unified architecture.

ChemicalOCR

A dedicated OCR module for chemical images, enabling end-to-end processing and improved abbreviation recognition in chemical structures.

Two-Phase Training

A novel two-phase training strategy designed to improve the fusion of the pretrained vision encoding with multimodal Vision-Text-Layout (VTL) representations.

USPTO-MOL-M Dataset

A data generation pipeline for constructing real-world Markush structure training samples from MOL files and accompanying images extracted from USPTO.

IP5-M Benchmark

A manually annotated benchmark of 1,000 real-world Markush structures from patent documents of the IP5 patent offices (USPTO, JPO, KIPO, CNIPA, EPO).

Explanation Video

Model Architecture

MarkushGrapher-2 employs two complementary encoding pipelines. In the first pipeline, the input image is processed by a vision encoder (Swin-B ViT, pretrained for OCSR) followed by an MLP projector. In the second pipeline, the image is passed through ChemicalOCR to extract textual content and bounding boxes, which are fed into a Vision-Text-Layout (VTL) encoder together with the original image. The combined representation is passed to a text decoder to generate a sequential description of the Markush structure and its substituents in tabular form.

Results

Optical Character Recognition on Chemical Images

ChemicalOCR substantially outperforms existing OCR models (PaddleOCR v5, EasyOCR) on chemical image text recognition.

Model	M2S (103)				USPTO-M (74)				IP5-M (1000)
Model	P	R	F1	A	P	R	F1	A	P	R	F1	A
PaddleOCR v5	8.9	6.8	7.7	0.0	2.3	1.1	1.2	0.0	2.2	1.7	1.9	0.6
EasyOCR	9.8	10.7	10.2	0.0	24.8	14.2	18.0	0.0	23.5	15.2	18.4	2.7
ChemicalOCR (Ours)	86.9	87.4	87.2	32.0	93.5	92.6	93.0	63.5	85.6	87.4	86.5	69.5

Markush Structure Recognition

MarkushGrapher-2 substantially outperforms state-of-the-art models on Markush structure recognition across all benchmarks.

Method	M2S (103)				USPTO-M (74)	WildMol-M (10k)	IP5-M (1000)
	CXSMILES	Table		Markush	CXSMILES	CXSMILES	CXSMILES
	A	A	F1	A	A	A	A
Image only
MolParser-Base	39	--	--	--	30	38.1	47.7
MolScribe	21	--	--	--	7	28.1	22.3
Multimodal
GPT-5	3	8	24	0	--	--	--
DeepSeek-OCR	0	--	--	--	0	1.9	0.0
MarkushGrapher-1	38	29	65	10	32	--	--
MarkushGrapher-2 (Ours)	56	22	65	13	55	48.0	53.7

Molecular Structure Recognition

MarkushGrapher-2 maintains competitive performance on standard molecular structure recognition (SMILES prediction), achieving state-of-the-art results on UOB.

Method	WildMol (10k)	JPO (450)	UOB (5740)	USPTO (5719)
Image only
MolParser-Base	76.9	78.9	91.8	93.0
MolScribe	66.4	76.2	87.4	93.1
DECIMER 2.7	56.0	64.0	88.3	59.9
MolGrapher	45.5	67.5	94.9	91.5
Multimodal
GPT-5	--	19.2	--	--
DeepSeek-OCR	25.8	31.6	78.7	36.9
MarkushGrapher-2 (Ours)	68.4	71.0	96.6	89.8

Qualitative Examples

Input Image ChemicalOCR MarkushGrapher-2

1 / 4 M2S benchmark example

Ablation Studies

Effect of ChemicalOCR

The OCR predictions substantially improve MarkushGrapher-2's prediction accuracy, demonstrating the importance of text and layout modalities.

Method	M2S		USPTO-M		IP5-M
Method	A	A_InChIKey	A	A_InChIKey	A	A_InChIKey
Without OCR	4	39	3	51	15.4	51.3
With OCR	56	80	55	69	53.7	73.3

Effect of OCR Predictions: Comparison of MarkushGrapher-2 predictions with and without OCR input. The green circle indicates prediction of a frequency variation indicator (i.e., repeating Sg groups).

Effect of Two-Phase Training

Two-phase training improves the model's ability to encode Markush features while preserving performance on standard molecular recognition.

Method	M2S		JPO
Method	A	A_InChIKey	A	A_InChIKey
Fusion only	44	53	53.0	53.0
Adaptation + Fusion	50	68	61.5	61.5

Citation

@inproceedings{strohmeyer2026markushgrapher2,
  title     = {MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures},
  author    = {Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Valery and Nassar, Ahmed and Staar, Peter W. J.},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}