MarkushGrapher-2

End-to-end Multimodal Recognition of Chemical Structures

CVPR 2026

1IBM Research 2ETH Zurich

Abstract

Automatically extracting chemical structures from documents is essential for the large-scale analysis of the literature in chemistry. Automatic pipelines have been developed to recognize molecules represented either in figures or in text independently. However, methods for recognizing chemical structures from multimodal descriptions (Markush structures) lag behind in precision and cannot be used for automatic large-scale processing.

In this work, we present MarkushGrapher-2, an end-to-end approach for the multimodal recognition of chemical structures in documents. First, our method employs a dedicated OCR model to extract text from chemical images. Second, the text, image, and layout information are jointly encoded through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. Finally, the resulting encodings are effectively fused through a two-stage training strategy and used to auto-regressively generate a representation of the Markush structure.

To address the lack of training data, we introduce an automatic pipeline for constructing a large-scale dataset of real-world Markush structures. In addition, we present IP5-M, a large manually-annotated benchmark of real-world Markush structures, designed to advance research on this challenging task. Extensive experiments show that our approach substantially outperforms state-of-the-art models in multimodal Markush structure recognition, while maintaining strong performance in molecule structure recognition. Code, models, and datasets will be released publicly.

Motivation

Motivation: From patent documents to searchable databases

The automatic extraction of chemical structures from (document) images is a significant challenge that is highly useful for large-scale patent analysis, prior-art searches, and ultimately accelerating Research & Development of novel chemical compounds. With MarkushGrapher-2, we introduce a model for automatic end-to-end parsing of complex molecular and Markush structures from images into a graph-based string representation, namely (extended) SMILES. These (extended) SMILES can be used to populate large-scale chemical databases, enabling efficient search and analysis of chemical structures in the literature.

Key Contributions

Explanation Video

Model Architecture

MarkushGrapher-2 Architecture

MarkushGrapher-2 employs two complementary encoding pipelines. In the first pipeline, the input image is processed by a vision encoder (Swin-B ViT, pretrained for OCSR) followed by an MLP projector. In the second pipeline, the image is passed through ChemicalOCR to extract textual content and bounding boxes, which are fed into a Vision-Text-Layout (VTL) encoder together with the original image. The combined representation is passed to a text decoder to generate a sequential description of the Markush structure and its substituents in tabular form.

Results

Optical Character Recognition on Chemical Images

ChemicalOCR substantially outperforms existing OCR models (PaddleOCR v5, EasyOCR) on chemical image text recognition.

Model M2S (103) USPTO-M (74) IP5-M (1000)
P R F1 A P R F1 A P R F1 A
PaddleOCR v5 8.9 6.8 7.7 0.0 2.3 1.1 1.2 0.0 2.2 1.7 1.9 0.6
EasyOCR 9.8 10.7 10.2 0.0 24.8 14.2 18.0 0.0 23.5 15.2 18.4 2.7
ChemicalOCR (Ours) 86.9 87.4 87.2 32.0 93.5 92.6 93.0 63.5 85.6 87.4 86.5 69.5

Markush Structure Recognition

MarkushGrapher-2 substantially outperforms state-of-the-art models on Markush structure recognition across all benchmarks.

Method M2S (103) USPTO-M (74) WildMol-M (10k) IP5-M (1000)
CXSMILES Table Markush CXSMILES CXSMILES CXSMILES
A A F1 A A A A
Image only
MolParser-Base 39 -- -- -- 30 38.1 47.7
MolScribe 21 -- -- -- 7 28.1 22.3
Multimodal
GPT-5 3 8 24 0 -- -- --
DeepSeek-OCR 0 -- -- -- 0 1.9 0.0
MarkushGrapher-1 38 29 65 10 32 -- --
MarkushGrapher-2 (Ours) 56 22 65 13 55 48.0 53.7

Molecular Structure Recognition

MarkushGrapher-2 maintains competitive performance on standard molecular structure recognition (SMILES prediction), achieving state-of-the-art results on UOB.

Method WildMol (10k) JPO (450) UOB (5740) USPTO (5719)
Image only
MolParser-Base 76.9 78.9 91.8 93.0
MolScribe 66.4 76.2 87.4 93.1
DECIMER 2.7 56.0 64.0 88.3 59.9
MolGrapher 45.5 67.5 94.9 91.5
Multimodal
GPT-5 -- 19.2 -- --
DeepSeek-OCR 25.8 31.6 78.7 36.9
MarkushGrapher-2 (Ours) 68.4 71.0 96.6 89.8

Qualitative Examples

Ablation Studies

Effect of ChemicalOCR

The OCR predictions substantially improve MarkushGrapher-2's prediction accuracy, demonstrating the importance of text and layout modalities.

Method M2S USPTO-M IP5-M
A AInChIKey A AInChIKey A AInChIKey
Without OCR 4 39 3 51 15.4 51.3
With OCR 56 80 55 69 53.7 73.3
Effect of OCR Predictions

Effect of OCR Predictions: Comparison of MarkushGrapher-2 predictions with and without OCR input. The green circle indicates prediction of a frequency variation indicator (i.e., repeating Sg groups).

Effect of Two-Phase Training

Two-phase training improves the model's ability to encode Markush features while preserving performance on standard molecular recognition.

Method M2S JPO
A AInChIKey A AInChIKey
Fusion only 44 53 53.0 53.0
Adaptation + Fusion 50 68 61.5 61.5

Citation

@inproceedings{strohmeyer2026markushgrapher2,
  title     = {MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures},
  author    = {Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Valery and Nassar, Ahmed and Staar, Peter W. J.},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}