Ar-Q-Former: Historical Newspaper Article Separation Based on Multimodal Transformer Structure

Wenjun Sun*, Nancy Girdhar, Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo, Mickaël Coustaty, Antoine Doucet

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceedings published in a bookpeer-review

Abstract

Article separation for historical newspapers is an important task in the analysis of historical documents. But so far, it remains an under-researched area. Since historical newspapers themselves have two modalities, text, and image, this requires multimodal processing for information extraction. To tackle this task while accounting for the unique characteristics of historical newspapers, we propose an article separation model using a multimodal transformer structure. This model processes the newspaper image along with the bounding boxes and text content of its text blocks, linking them based on a predefined rule to reconstruct the overall page structure. The text and image information of the connected text blocks are then fed into a cross-modal transformer, and the classifier determines whether the connections between the text blocks need to be removed or not. The text blocks that remain connected are recognized as forming an article. A mask method is used to allow the image to reflect the positional relationships of the text blocks. We evaluated our architecture on two datasets, Neweye’s NLF and BNF. The results demonstrate that Ar-Q-former significantly outperforms similar structural modeling methods, achieving up to 19% points higher AR
on NLF and 22% points on BNF. It also reaches a performance level comparable to the reading order simulation method. However, there remains an approximate 10% point gap in mACS mPPA compared to rule-based methods, which are specifically tailored to these datasets. Nevertheless, Ar-Q-former exhibits greater generalizability. Additionally, this approach introduces multimodal text-image analysis and interaction compared to previous studies by innovatively incorporating the mask-image method to capture visual and positional information between text blocks.
Original languageEnglish
Title of host publicationThe 19th International Conference on Document Analysis and Recognition (ICDAR-2025)
Pages476–492
Publication statusPublished - 17 Sept 2025

Fingerprint

Dive into the research topics of 'Ar-Q-Former: Historical Newspaper Article Separation Based on Multimodal Transformer Structure'. Together they form a unique fingerprint.

Cite this