Abstract
Existing methods for Visual Information Extraction (VIE) from form-like documents typically fragment the process into separate subtasks, such as key information extraction, key-value pair extraction, and choice group extraction. However, these approaches often overlook the hierarchical structure of form documents, including hierarchical key-value pairs and hierarchical choice groups. To address these limitations, we present a new perspective, reframing VIE as a relation prediction problem and unifying labels of different tasks into a single label space. This unified approach allows for the definition of various relation types and effectively tackles hierarchical relationships in form-like documents. In line with this perspective, we present UniVIE, a unified model that addresses the VIE problem comprehensively. UniVIE functions using a coarse-to-fine strategy. It initially generates tree proposals through a Tree Proposal Network, which are subsequently refined into hierarchical trees by a Relation Decoder module. To enhance the relation prediction capabilities of UniVIE, we incorporate two novel tree constraints into the Relation Decoder: a Tree Attention Mask and a Tree Level Embedding. Extensive experimental evaluations on both our in-house dataset HierForms and a publicly available dataset SIBR, substantiate that our method achieves state-of-the-art results, underscoring the effectiveness and potential of our unified approach in advancing the field of VIE.
K. Hu, J. Wang, W. Lin, Z. Zhong, L. Sun—Work done when Kai Hu and Jiawei Wang were interns, Weihong Lin, Zhuoyao Zhong and Lei Sun were full-time employees in Multi-Modal Interaction Group, Microsoft Research Asia, Beijing, China.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aggarwal, M., Gupta, H., Sarkar, M., Krishnamurthy, B.: Form2Seq: a framework for higher-order form structure extraction. In: EMNLP, pp. 3830–3840 (2020)
Aggarwal, M., Sarkar, M., Gupta, H., Krishnamurthy, B.: Multi-modal association based grouping for form structure extraction. In: WACV, pp. 2075–2084 (2020)
Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha, R.: Docformer: end-to-end transformer for document understanding. In: ICCV, pp. 993–1003 (2021)
Carbonell, M., Riba, P., Villegas, M., Fornés, A., Lladós, J.: Named entity recognition and relation extraction with graph neural networks in semi structured documents. In: ICPR, pp. 9622–9627 (2021)
Cesarini, F., Francesconi, E., Gori, M., Soda, G.: Analysis and understanding of multi-class invoices. Doc. Anal. Recogn. 6, 102–114 (2003)
Chu, Y.J.: On the shortest arborescence of a directed graph. Sci. Sinica 14, 1396–1400 (1965)
Davis, B., Morse, B., Price, B., Tensmeyer, C., Wigington, C., Morariu, V.: End-to-end document recognition and understanding with dessurt. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) ECCV 2022. LNCS, vol. 13804, pp. 280–296. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-25069-9_19
Davis, B., Morse, B., Price, B., Tensmeyer, C., Wiginton, C.: Visual fudge: form understanding via dynamic graph editing. In: ICDAR, pp. 416–431 (2021)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Dengel, A.R., Klein, B.: Smartfix: a requirements-driven system for document analysis and understanding. In: DAS, pp. 433–444 (2002)
Denk, T.I., Reisswig, C.: BERTgrid: contextualized embedding for 2D document representation and understanding. arXiv preprint arXiv:1909.04948 (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019)
Dhouib, M., Bettaieb, G., Shabou, A.: Docparser: end-to-end OCR-free information extraction from visually rich documents. In: ICDAR, pp. 155–172 (2023)
Dozat, T., Manning, C.D.: Deep biaffine attention for neural dependency parsing. In: ICLR (2017)
Edmonds, J., et al.: Optimum branchings. J. Res. Natl. Bureau Stand. B 71(4), 233–240 (1967)
Esser, D., Schuster, D., Muthmann, K., Berger, M., Schill, A.: Automatic indexing of scanned documents: a layout-based approach. In: DRR, pp. 118–125 (2012)
Gao, M., Xue, L., Ramaiah, C., Xing, C., Xu, R., Xiong, C.: Docquerynet: value retrieval with arbitrary queries for form-like documents. In: COLING, pp. 2141—2146 (2022)
Gemelli, A., Biswas, S., Civitelli, E., Lladós, J., Marinai, S.: Doc2graph: a task agnostic document understanding framework based on graph neural networks. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) ECCV 2022. LNCS, vol. 13804, pp. 329–344. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-25069-9_22
Gupta, P., Schütze, H., Andrassy, B.: Table filling multi-task recurrent neural network for joint entity and relation extraction. In: COLING, pp. 2537–2547 (2016)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: WACV, pp. 2961–2969 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Hu, K., Wu, Z., Zhong, Z., Lin, W., Sun, L., Huo, Q.: A question-answering approach to key value pair extraction from form-like document images. In: AAAI, pp. 12899–12906 (2023)
Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: Layoutlmv3: pre-training for document AI with unified text and image masking. In: ACM MM, pp. 4083–4091 (2022)
Huang, Z., et al.: ICDAR2019 competition on scanned receipt OCR and information extraction. In: ICDAR, pp. 1516–1520 (2019)
Hwang, W., Yim, J., Park, S., Yang, S., Seo, M.: Spatial dependency parsing for semi-structured document information extraction. In: ACL Findings, pp. 330—343 (2021)
Jaume, G., Ekenel, H.K., Thiran, J.P.: FunSD: a dataset for form understanding in noisy scanned documents. In: ICDAR Workshops, pp. 1–6 (2019)
Katti, A.R., et al.: Chargrid: towards understanding 2D documents. In: EMNLP, pp. 4459–4469 (2018)
Kim, G., et al.: OCR-free document understanding transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 498–517. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_29
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Lee, C.Y., et al.: Formnet: structural encoding beyond sequential modeling in form document information extraction. In: ACL, pp. 3735–3754 (2022)
Lin, W., et al.: Vibertgrid: a jointly trained multi-modal 2d document representation for key information extraction from documents. In: ICDAR, pp. 548–563 (2021)
Mathur, P., et al.: LayerDoc: layer-wise extraction of spatial hierarchical structure in visually-rich documents. In: WACV, pp. 3610–3620 (2023)
Medvet, E., Bartoli, A., Davanzo, G.: A probabilistic approach to printed document understanding. IJDAR 14, 335–347 (2011)
Park, S., et al.: Cord: a consolidated receipt dataset for post-OCR parsing. In: NeurIPS Workshops (2019)
Qiao, B., Zou, Z., Huang, Y., Fang, K., Zhu, X., Chen, Y.: A joint model for entity and relation extraction based on BERT. In: Neural Computing and Applications pp. 1–11 (2022)
Rastogi, M., et al.: Information extraction from document images via FCA based template detection and knowledge graph rule induction. In: CVPR Workshops, pp. 558–559 (2020)
Rusinol, M., Benkhelfallah, T., Poulain dAndecy, V.: Field extraction from administrative documents by incremental structural templates. In: ICDAR, pp. 1100–1104 (2013)
Schuster, D., et al.: Intellix – end-user trained information extraction for document archiving. In: ICDAR, pp. 101–105 (2013)
Shi, D., Liu, S., Du, J., Zhu, H.: LayoutGCN: a lightweight architecture for visually rich document understanding. In: ICDAR, pp. 149–165 (2023)
Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: CVPR, pp. 761–769 (2016)
Šimsa, Š., et al.: Docile benchmark for document information localization and extraction. In: ICDAR, pp. 147–166 (2023)
Wang, J., Jin, L., Ding, K.: Lilt: a simple yet effective language-independent layout transformer for structured document understanding. In: ACL, pp. 7747–7757 (2022)
Wang, J., Lu, W.: Two are better than one: joint entity and relation extraction with table-sequence encoders. In: EMNLP, pp. 1706–1721 (2020)
Wang, Y., Sun, C., Wu, Y., Zhou, H., Li, L., Yan, J.: Unire: a unified label space for entity relation extraction. In: ACL, pp. 220—231 (2021)
Wang, Y., Yu, B., Zhang, Y., Liu, T., Zhu, H., Sun, L.: TPlinker: single-stage joint extraction of entities and relations through token pair linking. In: COLING, pp. 1572—1582 (2020)
Watanabe, T., Luo, Q., Sugie, N.: Layout recognition of multi-kinds of table-form documents. TPAMI 17(4), 432–445 (1995)
Xu, Y., et al.: Layoutlmv2: multi-modal pre-training for visually-rich document understanding. In: ACL, pp. 2579—2591 (2021)
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: KDD, pp. 1192–1200 (2020)
Xu, Y., et al.: Xfund: a benchmark dataset for multilingual visually rich form understanding. In: ACL Findings, pp. 3214–3224 (2022)
Yan, H., Sun, Y., Li, X., Zhou, Y., Huang, X., Qiu, X.: UTC-IE: a unified token-pair classification architecture for information extraction. In: ACL, pp. 4096–4122 (2023)
Yang, Z., et al.: Modeling entities as semantic points for visual information extraction in the wild. In: CVPR, pp. 15358–15367 (2023)
Zhang, P., et al.: Trie: end-to-end text reading and information extraction for document understanding. In: ACM MM, pp. 1413–1422 (2020)
Zhang, Y., Bo, Z., Wang, R., Cao, J., Li, C., Bao, Z.: Entity relation extraction as dependency parsing in visually rich documents. In: EMNLP, pp. 2759–2768 (2021)
Zheng, S., et al.: Joint entity and relation extraction based on a hybrid neural network. Neurocomputing 257, 59–66 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hu, K., Wang, J., Lin, W., Zhong, Z., Sun, L., Huo, Q. (2024). UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-Like Documents. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14809. Springer, Cham. https://doi.org/10.1007/978-3-031-70552-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-70552-6_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70551-9
Online ISBN: 978-3-031-70552-6
eBook Packages: Computer ScienceComputer Science (R0)