Global-SEG: Text Semantic Segmentation Based on Global Semantic Pair Relations

Sun, Wenjun; Tran, Hanh Thi Hong; González-Gallardo, Carlos-Emiliano; Coustaty, Mickaël; Doucet, Antoine

doi:10.1007/978-3-031-70546-5_15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14807))

Included in the following conference series:

International Conference on Document Analysis and Recognition

358 Accesses

Abstract

Text semantic segmentation is a crucial task in language understanding, as subsequent natural language processing tasks often require cohesive semantic blocks. This paper introduces a new perspective on this task by utilizing global semantic pair relations from both token- and sentence-level language models. This approach addresses the limitations of prior work, which concentrated solely on individual semantic units like sentences. Our model processes both local and global levels of sentence semantics via encoders and then combines the semantics obtained at each stage into a semantic embedding matrix. This matrix is then fed through a convolutional neural network and finally used as input through another encoder. This process enables the identification of semantic segmentation boundaries by describing the relationships of global semantic pairs. Furthermore, we utilize semantic embeddings from large language models and consider the positional information of text within the document to assess their efficacy in augmenting semantics. We test our model with both contemporary and historical corpora, and the results demonstrate that our approach outperforms benchmarks on each dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning Word and Sentence Embeddings Using a Generative Convolutional Network

Computing Sentence Embedding by Merging Syntactic Parsing Tree and Word Embedding

A Sentence Similarity Model Based on Word Embeddings and Dependency Syntax-Tree

Notes

1.
https://zenodo.org/record/5654858.
2.
https://zenodo.org/record/5654841.
3.
The code of this paper is available at https://github.com/WenjunSUN1997/text_seg.

References

Arnold, S., Schneider, R., Cudré-Mauroux, P., Gers, F.A., Löser, A.: Sector: a neural model for coherent topic segmentation and classification. Trans. Assoc. Comput. Linguist. 7, 169–184 (2019)
Article Google Scholar
Barrow, J., Jain, R., Morariu, V., Manjunatha, V., Oard, D.W., Resnik, P.: A joint model for document segmentation and segment labeling. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 313–322 (2020)
Google Scholar
Beeferman, D., Berger, A., Lafferty, J.: Statistical models for text segmentation. Mach. Learn. 34, 177–210 (1999)
Article Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Boroş, E., et al.: Alleviating digitization errors in named entity recognition for historical documents. In: Proceedings of the 24th Conference on Computational Natural Language Learning, pp. 431–441 (2020)
Google Scholar
Chen, H., Branavan, S., Barzilay, R., Karger, D.R.: Global models of document structure using latent permutations. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 371–379. Association for Computational Linguistics (2009)
Google Scholar
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680. Association for Computational Linguistics (2017)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, June 2019. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
Ehrmann, M., et al.: Extended overview of HIPE-2022: named entity recognition and linking in multilingual historical documents. In: CEUR Workshop Proceedings, pp. 1038–1063. No. 3180. CEUR-WS (2022)
Google Scholar
Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, pp. 6894–6910. Association for Computational Linguistics (ACL) (2021)
Google Scholar
Girdhar, N., Coustaty, M., Doucet, A.: Benchmarking NAS for article separation in historical newspapers. In: Goh, D.H., Chen, S.J., Tuarob, S. (eds.) ICADL 2023. LNCS, vol. 14457, pp. 76–88. Springer, Singapore (2023). https://doi.org/10.1007/978-981-99-8085-7_7
Chapter Google Scholar
Glavaš, G., Nanni, F., Ponzetto, S.P.: Unsupervised text segmentation using semantic relatedness graphs. In: Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics, pp. 125–130. Association for Computational Linguistics (2016)
Google Scholar
Glavaš, G., Somasundaran, S.: Two-level transformer and auxiliary coherence modeling for improved text segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 7797–7804 (2020)
Google Scholar
Gong, Z., et al.: Tipster: a topic-guided language model for topic-aware text segmentation. In: Bhattacharya, A., et al. (eds.) DASFAA 2022, Part III. Lecture Notes in Computer Science, vol. 13247, pp. 213–221. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-00129-1_14
Chapter Google Scholar
Hearst, M.A.: Multi-paragraph segmentation expository text. In: 32nd Annual Meeting of the Association for Computational Linguistics, pp. 9–16 (1994)
Google Scholar
Li, B., Zhou, H., He, J., Wang, M., Yang, Y., Li, L.: On the sentence embeddings from pre-trained language models. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9119–9130 (2020)
Google Scholar
Li, J., Sun, A., Joty, S.R.: SegBot: a generic neural text segmentation model with pointer network. In: IJCAI, pp. 4166–4172 (2018)
Google Scholar
Lo, K., Jin, Y., Tan, W., Liu, M., Du, L., Buntine, W.: Transformer over pre-trained transformer for neural text segmentation with enhanced topic coherence. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 3334–3340 (2021)
Google Scholar
Lukasik, M., Dadachev, B., Papineni, K., Simões, G.: Text segmentation by cross segment attention. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4707–4716. Association for Computational Linguistics, Online, November 2020. https://doi.org/10.18653/v1/2020.emnlp-main.380. https://aclanthology.org/2020.emnlp-main.380
Moro, G., Ragazzi, L.: Semantic self-segmentation for abstractive summarization of long documents in low-resource regimes. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 11085–11093 (2022)
Google Scholar
OpenAI: GPT-4 technical report (2023)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992 (2019)
Google Scholar
Riedl, M., Biemann, C.: TopicTiling: a text segmentation algorithm based on LDA. In: Proceedings of ACL 2012 Student Research Workshop, pp. 37–42 (2012)
Google Scholar
Schweter, S., März, L., Schmid, K., Çano, E.: hmBERT: historical multilingual language models for named entity recognition. In: Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, vol. 3180, pp. 1109–1129, September 2022. http://eprints.cs.univie.ac.at/7549/
Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Utiyama, M., Isahara, H.: A statistical model for domain-independent text segmentation. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pp. 499–506 (2001)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Wang, L., Li, S., Lü, Y., Wang, H.: Learning to rank semantic coherence for topic segmentation. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1340–1344 (2017)
Google Scholar
Xia, J., et al.: Dialogue topic segmentation via parallel extraction network with neighbor smoothing. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2126–2131 (2022)
Google Scholar
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020)
Google Scholar
Zhang, N., et al.: Document-level relation extraction as semantic segmentation. In: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pp. 3999–4006 (2021)
Google Scholar

Download references

Acknowledgements

This work has been supported by the ANNA (2019-1R40226), TERMITRAD (AAPR2020-2019-8510010), Pypa (AAPR2021-2021-12263410), and Actuadata (AAPR2022-2021-17014610) projects funded by the Nouvelle-Aquitaine Region, France.

Author information

Authors and Affiliations

University of La Rochelle, L3i, La Rochelle, France
Wenjun Sun, Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo, Mickaël Coustaty & Antoine Doucet
Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
Hanh Thi Hong Tran
Jožef Stefan Institute, Ljubljana, Slovenia
Hanh Thi Hong Tran

Authors

Wenjun Sun
View author publications
You can also search for this author in PubMed Google Scholar
Hanh Thi Hong Tran
View author publications
You can also search for this author in PubMed Google Scholar
Carlos-Emiliano González-Gallardo
View author publications
You can also search for this author in PubMed Google Scholar
Mickaël Coustaty
View author publications
You can also search for this author in PubMed Google Scholar
Antoine Doucet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenjun Sun .

Editor information

Editors and Affiliations

Luleå Tekniska Universitet, Luleå, Sweden
Elisa H. Barney Smith
Luleå Tekniska Universitet, Luleå, Sweden
Marcus Liwicki
Tsinghua University, Beijing, China
Liangrui Peng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, W., Tran, H.T.H., González-Gallardo, CE., Coustaty, M., Doucet, A. (2024). Global-SEG: Text Semantic Segmentation Based on Global Semantic Pair Relations. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14807. Springer, Cham. https://doi.org/10.1007/978-3-031-70546-5_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-70546-5_15
Published: 11 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70545-8
Online ISBN: 978-3-031-70546-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Global-SEG: Text Semantic Segmentation Based on Global Semantic Pair Relations

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning Word and Sentence Embeddings Using a Generative Convolutional Network

Computing Sentence Embedding by Merging Syntactic Parsing Tree and Word Embedding

A Sentence Similarity Model Based on Word Embeddings and Dependency Syntax-Tree

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

Global-SEG: Text Semantic Segmentation Based on Global Semantic Pair Relations

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning Word and Sentence Embeddings Using a Generative Convolutional Network

Computing Sentence Embedding by Merging Syntactic Parsing Tree and Word Embedding

A Sentence Similarity Model Based on Word Embeddings and Dependency Syntax-Tree

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation