iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://unpaywall.org/10.1007/978-3-031-70552-6_5
UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-Like Documents | SpringerLink
Skip to main content

UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-Like Documents

  • Conference paper
  • First Online:
Document Analysis and Recognition - ICDAR 2024 (ICDAR 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14809))

Included in the following conference series:

  • 242 Accesses

Abstract

Existing methods for Visual Information Extraction (VIE) from form-like documents typically fragment the process into separate subtasks, such as key information extraction, key-value pair extraction, and choice group extraction. However, these approaches often overlook the hierarchical structure of form documents, including hierarchical key-value pairs and hierarchical choice groups. To address these limitations, we present a new perspective, reframing VIE as a relation prediction problem and unifying labels of different tasks into a single label space. This unified approach allows for the definition of various relation types and effectively tackles hierarchical relationships in form-like documents. In line with this perspective, we present UniVIE, a unified model that addresses the VIE problem comprehensively. UniVIE functions using a coarse-to-fine strategy. It initially generates tree proposals through a Tree Proposal Network, which are subsequently refined into hierarchical trees by a Relation Decoder module. To enhance the relation prediction capabilities of UniVIE, we incorporate two novel tree constraints into the Relation Decoder: a Tree Attention Mask and a Tree Level Embedding. Extensive experimental evaluations on both our in-house dataset HierForms and a publicly available dataset SIBR, substantiate that our method achieves state-of-the-art results, underscoring the effectiveness and potential of our unified approach in advancing the field of VIE.

K. Hu, J. Wang, W. Lin, Z. Zhong, L. Sun—Work done when Kai Hu and Jiawei Wang were interns, Weihong Lin, Zhuoyao Zhong and Lei Sun were full-time employees in Multi-Modal Interaction Group, Microsoft Research Asia, Beijing, China.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-layout?view=doc-intel-3.0.0.

  2. 2.

    https://duguang.aliyun.com/.

References

  1. Aggarwal, M., Gupta, H., Sarkar, M., Krishnamurthy, B.: Form2Seq: a framework for higher-order form structure extraction. In: EMNLP, pp. 3830–3840 (2020)

    Google Scholar 

  2. Aggarwal, M., Sarkar, M., Gupta, H., Krishnamurthy, B.: Multi-modal association based grouping for form structure extraction. In: WACV, pp. 2075–2084 (2020)

    Google Scholar 

  3. Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha, R.: Docformer: end-to-end transformer for document understanding. In: ICCV, pp. 993–1003 (2021)

    Google Scholar 

  4. Carbonell, M., Riba, P., Villegas, M., Fornés, A., Lladós, J.: Named entity recognition and relation extraction with graph neural networks in semi structured documents. In: ICPR, pp. 9622–9627 (2021)

    Google Scholar 

  5. Cesarini, F., Francesconi, E., Gori, M., Soda, G.: Analysis and understanding of multi-class invoices. Doc. Anal. Recogn. 6, 102–114 (2003)

    Article  Google Scholar 

  6. Chu, Y.J.: On the shortest arborescence of a directed graph. Sci. Sinica 14, 1396–1400 (1965)

    MathSciNet  Google Scholar 

  7. Davis, B., Morse, B., Price, B., Tensmeyer, C., Wigington, C., Morariu, V.: End-to-end document recognition and understanding with dessurt. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) ECCV 2022. LNCS, vol. 13804, pp. 280–296. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-25069-9_19

    Chapter  Google Scholar 

  8. Davis, B., Morse, B., Price, B., Tensmeyer, C., Wiginton, C.: Visual fudge: form understanding via dynamic graph editing. In: ICDAR, pp. 416–431 (2021)

    Google Scholar 

  9. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)

    Google Scholar 

  10. Dengel, A.R., Klein, B.: Smartfix: a requirements-driven system for document analysis and understanding. In: DAS, pp. 433–444 (2002)

    Google Scholar 

  11. Denk, T.I., Reisswig, C.: BERTgrid: contextualized embedding for 2D document representation and understanding. arXiv preprint arXiv:1909.04948 (2019)

  12. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019)

    Google Scholar 

  13. Dhouib, M., Bettaieb, G., Shabou, A.: Docparser: end-to-end OCR-free information extraction from visually rich documents. In: ICDAR, pp. 155–172 (2023)

    Google Scholar 

  14. Dozat, T., Manning, C.D.: Deep biaffine attention for neural dependency parsing. In: ICLR (2017)

    Google Scholar 

  15. Edmonds, J., et al.: Optimum branchings. J. Res. Natl. Bureau Stand. B 71(4), 233–240 (1967)

    Article  MathSciNet  Google Scholar 

  16. Esser, D., Schuster, D., Muthmann, K., Berger, M., Schill, A.: Automatic indexing of scanned documents: a layout-based approach. In: DRR, pp. 118–125 (2012)

    Google Scholar 

  17. Gao, M., Xue, L., Ramaiah, C., Xing, C., Xu, R., Xiong, C.: Docquerynet: value retrieval with arbitrary queries for form-like documents. In: COLING, pp. 2141—2146 (2022)

    Google Scholar 

  18. Gemelli, A., Biswas, S., Civitelli, E., Lladós, J., Marinai, S.: Doc2graph: a task agnostic document understanding framework based on graph neural networks. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) ECCV 2022. LNCS, vol. 13804, pp. 329–344. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-25069-9_22

    Chapter  Google Scholar 

  19. Gupta, P., Schütze, H., Andrassy, B.: Table filling multi-task recurrent neural network for joint entity and relation extraction. In: COLING, pp. 2537–2547 (2016)

    Google Scholar 

  20. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: WACV, pp. 2961–2969 (2017)

    Google Scholar 

  21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  22. Hu, K., Wu, Z., Zhong, Z., Lin, W., Sun, L., Huo, Q.: A question-answering approach to key value pair extraction from form-like document images. In: AAAI, pp. 12899–12906 (2023)

    Google Scholar 

  23. Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: Layoutlmv3: pre-training for document AI with unified text and image masking. In: ACM MM, pp. 4083–4091 (2022)

    Google Scholar 

  24. Huang, Z., et al.: ICDAR2019 competition on scanned receipt OCR and information extraction. In: ICDAR, pp. 1516–1520 (2019)

    Google Scholar 

  25. Hwang, W., Yim, J., Park, S., Yang, S., Seo, M.: Spatial dependency parsing for semi-structured document information extraction. In: ACL Findings, pp. 330—343 (2021)

    Google Scholar 

  26. Jaume, G., Ekenel, H.K., Thiran, J.P.: FunSD: a dataset for form understanding in noisy scanned documents. In: ICDAR Workshops, pp. 1–6 (2019)

    Google Scholar 

  27. Katti, A.R., et al.: Chargrid: towards understanding 2D documents. In: EMNLP, pp. 4459–4469 (2018)

    Google Scholar 

  28. Kim, G., et al.: OCR-free document understanding transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 498–517. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_29

    Chapter  Google Scholar 

  29. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  30. Lee, C.Y., et al.: Formnet: structural encoding beyond sequential modeling in form document information extraction. In: ACL, pp. 3735–3754 (2022)

    Google Scholar 

  31. Lin, W., et al.: Vibertgrid: a jointly trained multi-modal 2d document representation for key information extraction from documents. In: ICDAR, pp. 548–563 (2021)

    Google Scholar 

  32. Mathur, P., et al.: LayerDoc: layer-wise extraction of spatial hierarchical structure in visually-rich documents. In: WACV, pp. 3610–3620 (2023)

    Google Scholar 

  33. Medvet, E., Bartoli, A., Davanzo, G.: A probabilistic approach to printed document understanding. IJDAR 14, 335–347 (2011)

    Article  Google Scholar 

  34. Park, S., et al.: Cord: a consolidated receipt dataset for post-OCR parsing. In: NeurIPS Workshops (2019)

    Google Scholar 

  35. Qiao, B., Zou, Z., Huang, Y., Fang, K., Zhu, X., Chen, Y.: A joint model for entity and relation extraction based on BERT. In: Neural Computing and Applications pp. 1–11 (2022)

    Google Scholar 

  36. Rastogi, M., et al.: Information extraction from document images via FCA based template detection and knowledge graph rule induction. In: CVPR Workshops, pp. 558–559 (2020)

    Google Scholar 

  37. Rusinol, M., Benkhelfallah, T., Poulain dAndecy, V.: Field extraction from administrative documents by incremental structural templates. In: ICDAR, pp. 1100–1104 (2013)

    Google Scholar 

  38. Schuster, D., et al.: Intellix – end-user trained information extraction for document archiving. In: ICDAR, pp. 101–105 (2013)

    Google Scholar 

  39. Shi, D., Liu, S., Du, J., Zhu, H.: LayoutGCN: a lightweight architecture for visually rich document understanding. In: ICDAR, pp. 149–165 (2023)

    Google Scholar 

  40. Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: CVPR, pp. 761–769 (2016)

    Google Scholar 

  41. Šimsa, Š., et al.: Docile benchmark for document information localization and extraction. In: ICDAR, pp. 147–166 (2023)

    Google Scholar 

  42. Wang, J., Jin, L., Ding, K.: Lilt: a simple yet effective language-independent layout transformer for structured document understanding. In: ACL, pp. 7747–7757 (2022)

    Google Scholar 

  43. Wang, J., Lu, W.: Two are better than one: joint entity and relation extraction with table-sequence encoders. In: EMNLP, pp. 1706–1721 (2020)

    Google Scholar 

  44. Wang, Y., Sun, C., Wu, Y., Zhou, H., Li, L., Yan, J.: Unire: a unified label space for entity relation extraction. In: ACL, pp. 220—231 (2021)

    Google Scholar 

  45. Wang, Y., Yu, B., Zhang, Y., Liu, T., Zhu, H., Sun, L.: TPlinker: single-stage joint extraction of entities and relations through token pair linking. In: COLING, pp. 1572—1582 (2020)

    Google Scholar 

  46. Watanabe, T., Luo, Q., Sugie, N.: Layout recognition of multi-kinds of table-form documents. TPAMI 17(4), 432–445 (1995)

    Article  Google Scholar 

  47. Xu, Y., et al.: Layoutlmv2: multi-modal pre-training for visually-rich document understanding. In: ACL, pp. 2579—2591 (2021)

    Google Scholar 

  48. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: KDD, pp. 1192–1200 (2020)

    Google Scholar 

  49. Xu, Y., et al.: Xfund: a benchmark dataset for multilingual visually rich form understanding. In: ACL Findings, pp. 3214–3224 (2022)

    Google Scholar 

  50. Yan, H., Sun, Y., Li, X., Zhou, Y., Huang, X., Qiu, X.: UTC-IE: a unified token-pair classification architecture for information extraction. In: ACL, pp. 4096–4122 (2023)

    Google Scholar 

  51. Yang, Z., et al.: Modeling entities as semantic points for visual information extraction in the wild. In: CVPR, pp. 15358–15367 (2023)

    Google Scholar 

  52. Zhang, P., et al.: Trie: end-to-end text reading and information extraction for document understanding. In: ACM MM, pp. 1413–1422 (2020)

    Google Scholar 

  53. Zhang, Y., Bo, Z., Wang, R., Cao, J., Li, C., Bao, Z.: Entity relation extraction as dependency parsing in visually rich documents. In: EMNLP, pp. 2759–2768 (2021)

    Google Scholar 

  54. Zheng, S., et al.: Joint entity and relation extraction based on a hybrid neural network. Neurocomputing 257, 59–66 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kai Hu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hu, K., Wang, J., Lin, W., Zhong, Z., Sun, L., Huo, Q. (2024). UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-Like Documents. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14809. Springer, Cham. https://doi.org/10.1007/978-3-031-70552-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70552-6_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70551-9

  • Online ISBN: 978-3-031-70552-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics