Abstract
We present a new hybrid document layout analysis approach to simultaneously detecting graphical page objects, group text-lines into text regions according to reading order, and recognize the logical roles of text regions from heterogeneous document images. For graphical page object detection, we leverage a state-of-the-art Transformer-based object detection model, namely DINO, as a new graphical page object detector to detect tables, figures, and (displayed) formulas in a top-down manner. Furthermore, we introduce a new bottom-up text region detection model to group text-lines located outside graphical page objects into text regions according to reading order and recognize the logical role of each text region by using both visual and textual features. Experimental results show that our bottom-up text region detection model achieves higher localization and logical role classification accuracy than previous top-down methods. Moreover, in addition to the locations of text regions, our approach can also output the reading order of text-lines in each text region directly. The state-of-the-art results obtained on DocLayNet and PubLayNet demonstrate the effectiveness of our approach.
J. Wang, H. Sun, K. Hu and E. Zhang—This work was done when Jiawei Wang, Haiqing Sun, Kai Hu and Erhan Zhang were interns in MMI Group, Microsoft Research Asia, Beijing, China.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Bi, H., et al.: Srrv: A novel document object detector based on spatial-related relation and vision. IEEE Transactions on Multimedia (2022)
Binmakhashen, G.M., Mahmoud, S.A.: Document layout analysis: a comprehensive survey. ACM Comput. Surv. (CSUR) 52(6), 1–36 (2019)
Biswas, S., Banerjee, A., Lladós, J., Pal, U.: Docsegtr: an instance-level end-to-end document image segmentation transformer. arXiv preprint arXiv:2201.11438 (2022)
Cai, Z., Vasconcelos, N.: Cascade r-cnn: high quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 43(5), 1483–1498 (2019)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Proceedings of the European Conference on Computer Vision, pp. 213–229 (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Proceedings of the European Conference on Computer Vision, pp. 213–229 (2020)
Dai, X., et al.: Dynamic head: Unifying object detection heads with attentions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7373–7382 (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186 (2019)
Doermann, D., Tombre, K. (eds.): Handbook of Document Image Processing and Recognition. Springer, London (2014). https://doi.org/10.1007/978-0-85729-859-1
Girshick, R.: Fast r-cnn. In: Proceedings of the International Conference on Computer Vision, pp. 1440–1448 (2015)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Gu, J., et al.: Unified pretraining framework for document understanding. arXiv preprint arXiv:2204.10939 (2022)
He, D., Cohen, S., Price, B., Kifer, D., Giles, C.L.: Multi-scale multi-task fcn for semantic page segmentation and table detection. In: Proceedings of the International Conference on Document Analysis and Recognition. vol. 1, pp. 254–261 (2017)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-CNN. In: Proceedings of the International Conference on Computer Visio, pp. 2961–2969 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: Layoutlmv3: Pre-training for document ai with unified text and image masking. In: Proceedings of the ACM International Conference on Multimedia, pp. 4083–4091 (2022)
Jocher, G., et al.: ultralytics/yolov5: v5.0 - YOLOv5-P6 1280 models, AWS, Supervise.ly and YouTube integrations (Apr 2021)
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: Dn-detr: Accelerate detr training by introducing query denoising. arXiv preprint arXiv:2203.01305 (2022)
Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C., Wei, F.: Dit: Self-supervised pre-training for document image transformer. In: Proceedings of the ACM International Conference on Multimedia. pp. 3530–3539 (2022)
Li, X.H., Yin, F., Liu, C.L.: Page object detection from pdf document images by deep structured prediction and supervised clustering. In: Proceedings of the International Conference on Pattern Recognition, pp. 3627–3632 (2018)
Li, X.H., Yin, F., Liu, C.L.: Page segmentation using convolutional neural network and graphical model. In: Proceedings of the International Workshop on Document Analysis Systems, pp. 231–245 (2020)
Li, X.H., et al.: Instance aware document image segmentation using label pyramid networks and deep watershed transformation. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 514–519 (2019)
Li, Y., Zou, Y., Ma, J.: Deeplayout: A semantic segmentation approach to page layout analysis. In: Proceedings of the International Conference on Intelligent Computing Methodologies, pp. 266–277 (2018)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the International Conference on Computer Vision, pp. 2980–2988 (2017)
Liu, S., et al.: Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329 (2022)
Liu, S., Wang, R., Raptis, M., Fujii, Y.: Unified line and paragraph detection by graph convolutional networks. In: Proceedings of the International Workshop on Document Analysis Systems, pp. 33–47 (2022)
Liu, Z., et al.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10012–10022 (2021)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Long, S., Qin, S., Panteleev, D., Bissacco, A., Fujii, Y., Raptis, M.: Towards end-to-end unified scene text detection and layout analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049–1059 (2022)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Luo, S., Ding, Y., Long, S., Han, S.C., Poon, J.: Doc-gcn: Heterogeneous graph convolutional networks for document layout analysis. arXiv preprint arXiv:2208.10970 (2022)
Minouei, M., Soheili, M.R., Stricker, D.: Document layout analysis with an enhanced object detector. In: Proceedings of the International Conference on Pattern Recognition and Image Analysis, pp. 1–5 (2021)
Naik, S., Hashmi, K.A., Pagani, A., Liwicki, M., Stricker, D., Afzal, M.Z.: Investigating attention mechanism for page object detection in document images. Appl. Sci. 12(15), 7486 (2022)
Oliveira, D.A.B., Viana, M.P.: Fast cnn-based document layout analysis. In: Proceedings of the International Conference on Computer Vision Workshops, pp. 1173–1180 (2017)
Pfitzmann, B., Auer, C., Dolfi, M., Nassar, A.S., Staar, P.W.: Doclaynet: A large human-annotated dataset for document-layout analysis. arXiv preprint arXiv:2206.01062 (2022)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Saha, R., Mondal, A., Jawahar, C.: Graphical object detection in document images. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 51–58 (2019)
Sang, Y., Zeng, Y., Liu, R., Yang, F., Yao, Z., Pan, Y.: Exploiting spatial attention and contextual information for document image segmentation. In: Proceedings of the Advances in Knowledge Discovery and Data Mining, pp. 261–274 (2022)
Shi, C., Xu, C., Bi, H., Cheng, Y., Li, Y., Zhang, H.: Lateral feature enhancement network for page object detection. IEEE Trans. Instrum. Meas. 71, 1–10 (2022)
Sun, P., et al.: Sparse r-cnn: End-to-end object detection with learnable proposals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 14454–14463 (2021)
Vo, N.D., Nguyen, K., Nguyen, T.V., Nguyen, K.: Ensemble of deep object detectors for page object detection. In: Proceedings of the International Conference on Ubiquitous Information Management and Communicatio, pp. 1–6 (2018)
Wang, R., Fujii, Y., Popat, A.C.: Post-ocr paragraph recognition by graph convolutional networks. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 493–502 (2022)
Wang, X., Zhang, R., Kong, T., Li, L., Shen, C.: Solov2: Dynamic and fast instance segmentation. In: Proceedings of the Advances in Neural information processing systems. vol. 33, pp. 17721–17732 (2020)
Xue, C., Huang, J., Zhang, W., Lu, S., Wang, C., Bai, S.: Contextual text block detection towards scene text understanding. In: Proceedings of the European Conference on Computer Vision, pp. 374–391 (2022)
Yang, H., Hsu, W.: Transformer-based approach for document layout understanding. In: Proceedings of the International Conference on Image Processing, pp. 4043–4047 (2022)
Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., Lee Giles, C.: Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5315–5324 (2017)
Yi, X., Gao, L., Liao, Y., Zhang, X., Liu, R., Jiang, Z.: Cnn based page object detection in document images. In: Proceedings of the International Conference on Document Analysis and Recognition. vol. 1, pp. 230–235 (2017)
Zhang, H., et al.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)
Zhang, J., Elhoseiny, M., Cohen, S., Chang, W., Elgammal, A.: Relationship proposal networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5678–5686 (2017)
Zhang, P., et al.: Vsr: a unified framework for document layout analysis combining vision, semantics and relations. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 115–130 (2021)
Zhang, Y., Bo, Z., Wang, R., Cao, J., Li, C., Bao, Z.: Entity relation extraction as dependency parsing in visually rich documents. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 2759–2768 (2021)
Zhong, X., Tang, J., Yepes, A.J.: Publaynet: largest dataset ever for document layout analysis. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 1015–1022 (2019)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection. In: Proceedings of the International Conference on Learning Representations (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhong, Z. et al. (2023). A Hybrid Approach to Document Layout Analysis for Heterogeneous Document Images. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14191. Springer, Cham. https://doi.org/10.1007/978-3-031-41734-4_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-41734-4_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41733-7
Online ISBN: 978-3-031-41734-4
eBook Packages: Computer ScienceComputer Science (R0)