
Multi-rule Graph Construction Network for Entity Linking in Visually Rich Documents
- 1 School of Mechanical and Electronic Engineering, Wuhan University of Technology, Luoshi Road, Wuhan 430070, China
- 2 School of Mechanical and Electronic Engineering, Wuhan University of Technology, Luoshi Road, Wuhan 430070, China
- 3 Birmingham Institute of Fashion and Creative Art, Wuhan Textile University (NANHU CAMPUS), Fangzhi Road, Wuhan 430070, China
* Author to whom correspondence should be addressed.
Abstract
Entity linking in visually rich documents (VRDs) is critical for industrial automation but faces challenges from complex layouts and computational inefficiency in existing models. Traditional approaches relying on pre-trained transformers or graph networks struggle with noisy OCR outputs, a high number of parameters, and invalid edge predictions in industrial VRDs. To Vision Information Extraction (VIE) for Industrial VRDs, we introduce a lightweight multi-rule graph construction network for entity linking, integrating text and layout embeddings as graph nodes. A multi-rule filtering method reduces invalid edges using node distance, link interference, and standardization rules inspired by document production logic and reading habit. A node relation enhancement module with Graph Attention Networks (GAT) enhances nodes through multi-rule edges and attention scores, enabling robust reasoning for noisy and complex layouts. Evaluated on FUNSD and SIBR datasets, our model achieves F1 scores of 58.96% and 65.08% with only 18M parameters, outperforming non-pretrained baselines while maintaining deployability for resource-constrained environments.
Keywords
Vision Information Extraction, Entity Linking, Graph Construction, Relation Extraction
[1]. Jung K, Kim KI, Jain AK. Text information extraction in images and video: A survey[J]. Pattern Recognition, 2004, 37(5):977-997. DOI:10.1016/j.patcog.2003.10.012.
[2]. Devlin J, Chang W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint: arXiv:1810.04805 (2018).
[3]. Xu Y, Li M, Cui L, et al. LayoutLM: Pre-training of text and layout for document image understanding[C]//KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, 2020. DOI:10.1145/3394486.3403172.
[4]. Xu Y, Xu Y, Lv T, et al. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding[J]. arXiv preprint: arXiv:2012.14740 (2020).
[5]. Xu Y, Lv T, Cui L, et al. LayoutXLM: Multimodal pre-training for multilingual visually-rich document understanding[C]//Proceedings of the 2021 Conference on Neural Information Processing Systems (NeurIPS). 2021. DOI:10.48550/arXiv.2104.08836.
[6]. Liu X, Gao F, Zhang Q, et al. Graph convolution for multimodal information extraction from visually rich documents[C]//Proceedings of the 2019 Conference on North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 2019. DOI:10.18653/v1/N19-2005.
[7]. Tang G, Xie L, Jin L, et al. MatchVIE: Exploiting match relevancy between entities for visual information extraction[C]//IJCAI-21: The 30th International Joint Conference on Artificial Intelligence. 2021. DOI:10.24963/ijcai.2021/144.
[8]. Yu W, Lu N, Qi X, et al. PICK: Processing key information extraction from documents using improved graph learning-convolutional networks[C]//2021 International Conference on Pattern Recognition (ICPR). IEEE, 2021. DOI:10.1109/ICPR48806.2021.9412927.
[9]. Huang Y, Lv T, Cui L, et al. LayoutLMv3: Pre-training for document AI with unified text and image masking[J]. arXiv preprint: arXiv:2204.08387 (2022).
[10]. Luo CW, Cheng CX, Zheng Q, Yao C. GeoLayoutLM: Geometric pre-training for visual information extraction[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023:7092-7101. DOI:10.1109/CVPR52729.2023.00685.
[11]. Luo CW, Cheng CX, Zheng Q, Yao C. GeoLayoutLM: Geometric pre-training for visual information extraction[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023:7092-7101. DOI:10.1109/CVPR52729.2023.00685.
[12]. Schuster D, Muthmann K, Esser D, et al. Intellix - End-User trained information extraction for document archiving[C]//International Conference on Document Analysis and Recognition (ICDAR). IEEE Computer Society, 2013. DOI:10.1109/ICDAR.2013.28.
[13]. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. arXiv preprint: arXiv:1706.03762 (2017).
[14]. Li P, Gu J, Kuen J, et al. SelfDoc: Self-supervised document representation learning[J]. arXiv preprint: arXiv:2106.03331 (2021).
[15]. Yang ZB, Long RJ, Wang PF, et al. Modeling entities as semantic points for visual information extraction in the wild[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2023. DOI:10.1109/CVPR52729.2023.01474.
[16]. Xu CB, Chen YM, Liu CL. EntityLayout: Entity-level pre-training language model for semantic entity recognition and relation extraction[C]//Document Analysis and Recognition - ICDAR 2024, Part I. Lecture Notes in Computer Science, vol 14804. Springer, Cham, 2024:262-279. DOI:10.1007/978-3-031-70533-5_16.
[17]. Wang J, Jin L, Ding K. LiLT: A simple yet effective language-independent layout transformer for structured document understanding[J]. arXiv preprint: arXiv:2202.13669 (2022).
[18]. Cao P, Wu J. GraphRevisedIE: Multimodal information extraction with graph-revised network[C]//International Conference on Multimodal Interaction (ICMI). 2024.
[19]. Hong T, Kim D, Ji M, et al. BROS: a pre-trained language model focusing on text and layout for better key information extraction from documents[C]//Proceedings of the AAAI Conference on Artificial Intelligence, vol 36. AAAI Press, 2022:10767-10775.
[20]. Chi Z, et al. InfoXLM: an information-theoretic framework for cross-lingual language pre-training[J]. arXiv preprint: arXiv:2007.07834 (2020).
[21]. Li Y, et al. StrucText: structured text understanding with multi-modal transformers[C]//29th ACM International Conference on Multimedia (MM). ACM, 2021:1912–1920.
[22]. Chen G, Chen P, Wang Q, et al. EMGE: Entities and mentions gradual enhancement with semantics and connection modelling for document-level relation extraction[J]. Knowledge-Based Systems, 2025, 309. DOI:10.1016/j.knosys.2024.112777.
[23]. Biescas N, Boned C, Lladós J, et al. GeoContrastNet: Contrastive key-value edge learning for language-agnostic document understanding[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024.
[24]. Veličković P, Cucurull G, Casanova A, et al. Graph attention networks[J]. arXiv preprint: arXiv:1710.10903 (2017).
[25]. Jaume G, Ekenel HK, Thiran JP. FUNSD: A dataset for form understanding in noisy scanned documents[C]//2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). IEEE, 2019. DOI:10.1109/ICDARW.2019.10029.
Cite this article
Zheng,Y.;Chen,J.;Zhang,W. (2025). Multi-rule Graph Construction Network for Entity Linking in Visually Rich Documents. Applied and Computational Engineering,150,53-62.
Data availability
The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.
Disclaimer/Publisher's Note
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
About volume
Volume title: Proceedings of the 3rd International Conference on Software Engineering and Machine Learning
© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and
conditions of the Creative Commons Attribution (CC BY) license. Authors who
publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons
Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this
series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published
version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial
publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and
during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See
Open access policy for details).