UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-Like Documents

Hu, Kai; Wang, Jiawei; Lin, Weihong; Zhong, Zhuoyao; Sun, Lei; Huo, Qiang

doi:10.1007/978-3-031-70552-6_5

Kai Hu^10,11,
Jiawei Wang^10,11,
Weihong Lin¹¹,
Zhuoyao Zhong¹¹,
Lei Sun¹¹ &
…
Qiang Huo¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14809))

Included in the following conference series:

International Conference on Document Analysis and Recognition

242 Accesses

Abstract

Existing methods for Visual Information Extraction (VIE) from form-like documents typically fragment the process into separate subtasks, such as key information extraction, key-value pair extraction, and choice group extraction. However, these approaches often overlook the hierarchical structure of form documents, including hierarchical key-value pairs and hierarchical choice groups. To address these limitations, we present a new perspective, reframing VIE as a relation prediction problem and unifying labels of different tasks into a single label space. This unified approach allows for the definition of various relation types and effectively tackles hierarchical relationships in form-like documents. In line with this perspective, we present UniVIE, a unified model that addresses the VIE problem comprehensively. UniVIE functions using a coarse-to-fine strategy. It initially generates tree proposals through a Tree Proposal Network, which are subsequently refined into hierarchical trees by a Relation Decoder module. To enhance the relation prediction capabilities of UniVIE, we incorporate two novel tree constraints into the Relation Decoder: a Tree Attention Mask and a Tree Level Embedding. Extensive experimental evaluations on both our in-house dataset HierForms and a publicly available dataset SIBR, substantiate that our method achieves state-of-the-art results, underscoring the effectiveness and potential of our unified approach in advancing the field of VIE.

K. Hu, J. Wang, W. Lin, Z. Zhong, L. Sun—Work done when Kai Hu and Jiawei Wang were interns, Weihong Lin, Zhuoyao Zhong and Lei Sun were full-time employees in Multi-Modal Interaction Group, Microsoft Research Asia, Beijing, China.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A LayoutLMv3-Based Model for Enhanced Relation Extraction in Visually-Rich Documents

A Span Extraction Approach for Information Extraction on Visually-Rich Documents

ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images

Notes

References

Aggarwal, M., Gupta, H., Sarkar, M., Krishnamurthy, B.: Form2Seq: a framework for higher-order form structure extraction. In: EMNLP, pp. 3830–3840 (2020)
Google Scholar
Aggarwal, M., Sarkar, M., Gupta, H., Krishnamurthy, B.: Multi-modal association based grouping for form structure extraction. In: WACV, pp. 2075–2084 (2020)
Google Scholar
Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha, R.: Docformer: end-to-end transformer for document understanding. In: ICCV, pp. 993–1003 (2021)
Google Scholar
Carbonell, M., Riba, P., Villegas, M., Fornés, A., Lladós, J.: Named entity recognition and relation extraction with graph neural networks in semi structured documents. In: ICPR, pp. 9622–9627 (2021)
Google Scholar
Cesarini, F., Francesconi, E., Gori, M., Soda, G.: Analysis and understanding of multi-class invoices. Doc. Anal. Recogn. 6, 102–114 (2003)
Article Google Scholar
Chu, Y.J.: On the shortest arborescence of a directed graph. Sci. Sinica 14, 1396–1400 (1965)
MathSciNet Google Scholar
Davis, B., Morse, B., Price, B., Tensmeyer, C., Wigington, C., Morariu, V.: End-to-end document recognition and understanding with dessurt. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) ECCV 2022. LNCS, vol. 13804, pp. 280–296. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-25069-9_19
Chapter Google Scholar
Davis, B., Morse, B., Price, B., Tensmeyer, C., Wiginton, C.: Visual fudge: form understanding via dynamic graph editing. In: ICDAR, pp. 416–431 (2021)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Google Scholar
Dengel, A.R., Klein, B.: Smartfix: a requirements-driven system for document analysis and understanding. In: DAS, pp. 433–444 (2002)
Google Scholar
Denk, T.I., Reisswig, C.: BERTgrid: contextualized embedding for 2D document representation and understanding. arXiv preprint arXiv:1909.04948 (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019)
Google Scholar
Dhouib, M., Bettaieb, G., Shabou, A.: Docparser: end-to-end OCR-free information extraction from visually rich documents. In: ICDAR, pp. 155–172 (2023)
Google Scholar
Dozat, T., Manning, C.D.: Deep biaffine attention for neural dependency parsing. In: ICLR (2017)
Google Scholar
Edmonds, J., et al.: Optimum branchings. J. Res. Natl. Bureau Stand. B 71(4), 233–240 (1967)
Article MathSciNet Google Scholar
Esser, D., Schuster, D., Muthmann, K., Berger, M., Schill, A.: Automatic indexing of scanned documents: a layout-based approach. In: DRR, pp. 118–125 (2012)
Google Scholar
Gao, M., Xue, L., Ramaiah, C., Xing, C., Xu, R., Xiong, C.: Docquerynet: value retrieval with arbitrary queries for form-like documents. In: COLING, pp. 2141—2146 (2022)
Google Scholar
Gemelli, A., Biswas, S., Civitelli, E., Lladós, J., Marinai, S.: Doc2graph: a task agnostic document understanding framework based on graph neural networks. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) ECCV 2022. LNCS, vol. 13804, pp. 329–344. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-25069-9_22
Chapter Google Scholar
Gupta, P., Schütze, H., Andrassy, B.: Table filling multi-task recurrent neural network for joint entity and relation extraction. In: COLING, pp. 2537–2547 (2016)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: WACV, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hu, K., Wu, Z., Zhong, Z., Lin, W., Sun, L., Huo, Q.: A question-answering approach to key value pair extraction from form-like document images. In: AAAI, pp. 12899–12906 (2023)
Google Scholar
Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: Layoutlmv3: pre-training for document AI with unified text and image masking. In: ACM MM, pp. 4083–4091 (2022)
Google Scholar
Huang, Z., et al.: ICDAR2019 competition on scanned receipt OCR and information extraction. In: ICDAR, pp. 1516–1520 (2019)
Google Scholar
Hwang, W., Yim, J., Park, S., Yang, S., Seo, M.: Spatial dependency parsing for semi-structured document information extraction. In: ACL Findings, pp. 330—343 (2021)
Google Scholar
Jaume, G., Ekenel, H.K., Thiran, J.P.: FunSD: a dataset for form understanding in noisy scanned documents. In: ICDAR Workshops, pp. 1–6 (2019)
Google Scholar
Katti, A.R., et al.: Chargrid: towards understanding 2D documents. In: EMNLP, pp. 4459–4469 (2018)
Google Scholar
Kim, G., et al.: OCR-free document understanding transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 498–517. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_29
Chapter Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Lee, C.Y., et al.: Formnet: structural encoding beyond sequential modeling in form document information extraction. In: ACL, pp. 3735–3754 (2022)
Google Scholar
Lin, W., et al.: Vibertgrid: a jointly trained multi-modal 2d document representation for key information extraction from documents. In: ICDAR, pp. 548–563 (2021)
Google Scholar
Mathur, P., et al.: LayerDoc: layer-wise extraction of spatial hierarchical structure in visually-rich documents. In: WACV, pp. 3610–3620 (2023)
Google Scholar
Medvet, E., Bartoli, A., Davanzo, G.: A probabilistic approach to printed document understanding. IJDAR 14, 335–347 (2011)
Article Google Scholar
Park, S., et al.: Cord: a consolidated receipt dataset for post-OCR parsing. In: NeurIPS Workshops (2019)
Google Scholar
Qiao, B., Zou, Z., Huang, Y., Fang, K., Zhu, X., Chen, Y.: A joint model for entity and relation extraction based on BERT. In: Neural Computing and Applications pp. 1–11 (2022)
Google Scholar
Rastogi, M., et al.: Information extraction from document images via FCA based template detection and knowledge graph rule induction. In: CVPR Workshops, pp. 558–559 (2020)
Google Scholar
Rusinol, M., Benkhelfallah, T., Poulain dAndecy, V.: Field extraction from administrative documents by incremental structural templates. In: ICDAR, pp. 1100–1104 (2013)
Google Scholar
Schuster, D., et al.: Intellix – end-user trained information extraction for document archiving. In: ICDAR, pp. 101–105 (2013)
Google Scholar
Shi, D., Liu, S., Du, J., Zhu, H.: LayoutGCN: a lightweight architecture for visually rich document understanding. In: ICDAR, pp. 149–165 (2023)
Google Scholar
Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: CVPR, pp. 761–769 (2016)
Google Scholar
Šimsa, Š., et al.: Docile benchmark for document information localization and extraction. In: ICDAR, pp. 147–166 (2023)
Google Scholar
Wang, J., Jin, L., Ding, K.: Lilt: a simple yet effective language-independent layout transformer for structured document understanding. In: ACL, pp. 7747–7757 (2022)
Google Scholar
Wang, J., Lu, W.: Two are better than one: joint entity and relation extraction with table-sequence encoders. In: EMNLP, pp. 1706–1721 (2020)
Google Scholar
Wang, Y., Sun, C., Wu, Y., Zhou, H., Li, L., Yan, J.: Unire: a unified label space for entity relation extraction. In: ACL, pp. 220—231 (2021)
Google Scholar
Wang, Y., Yu, B., Zhang, Y., Liu, T., Zhu, H., Sun, L.: TPlinker: single-stage joint extraction of entities and relations through token pair linking. In: COLING, pp. 1572—1582 (2020)
Google Scholar
Watanabe, T., Luo, Q., Sugie, N.: Layout recognition of multi-kinds of table-form documents. TPAMI 17(4), 432–445 (1995)
Article Google Scholar
Xu, Y., et al.: Layoutlmv2: multi-modal pre-training for visually-rich document understanding. In: ACL, pp. 2579—2591 (2021)
Google Scholar
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: KDD, pp. 1192–1200 (2020)
Google Scholar
Xu, Y., et al.: Xfund: a benchmark dataset for multilingual visually rich form understanding. In: ACL Findings, pp. 3214–3224 (2022)
Google Scholar
Yan, H., Sun, Y., Li, X., Zhou, Y., Huang, X., Qiu, X.: UTC-IE: a unified token-pair classification architecture for information extraction. In: ACL, pp. 4096–4122 (2023)
Google Scholar
Yang, Z., et al.: Modeling entities as semantic points for visual information extraction in the wild. In: CVPR, pp. 15358–15367 (2023)
Google Scholar
Zhang, P., et al.: Trie: end-to-end text reading and information extraction for document understanding. In: ACM MM, pp. 1413–1422 (2020)
Google Scholar
Zhang, Y., Bo, Z., Wang, R., Cao, J., Li, C., Bao, Z.: Entity relation extraction as dependency parsing in visually rich documents. In: EMNLP, pp. 2759–2768 (2021)
Google Scholar
Zheng, S., et al.: Joint entity and relation extraction based on a hybrid neural network. Neurocomputing 257, 59–66 (2017)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, China
Kai Hu & Jiawei Wang
Microsoft Research Asia, Beijing, China
Kai Hu, Jiawei Wang, Weihong Lin, Zhuoyao Zhong, Lei Sun & Qiang Huo

Authors

Kai Hu
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Weihong Lin
View author publications
You can also search for this author in PubMed Google Scholar
Zhuoyao Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Lei Sun
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Huo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kai Hu .

Editor information

Editors and Affiliations

Luleå Tekniska Universitet, Luleå, Sweden
Elisa H. Barney Smith
Luleå Tekniska Universitet, Luleå, Sweden
Marcus Liwicki
Tsinghua University, Beijing, China
Liangrui Peng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, K., Wang, J., Lin, W., Zhong, Z., Sun, L., Huo, Q. (2024). UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-Like Documents. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14809. Springer, Cham. https://doi.org/10.1007/978-3-031-70552-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-70552-6_5
Published: 11 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70551-9
Online ISBN: 978-3-031-70552-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-Like Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A LayoutLMv3-Based Model for Enhanced Relation Extraction in Visually-Rich Documents

A Span Extraction Approach for Information Extraction on Visually-Rich Documents

ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-Like Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A LayoutLMv3-Based Model for Enhanced Relation Extraction in Visually-Rich Documents

A Span Extraction Approach for Information Extraction on Visually-Rich Documents

ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation