Recognition of Concordances for Indexing in Digital Libraries

Marinai, Simone; Capobianco, Samuele; Ziran, Zahra; Giuntini, Andrea; Mansueto, Pierluigi

doi:10.1007/978-3-030-39905-4_14

Simone Marinai¹⁰,
Samuele Capobianco¹⁰,
Zahra Ziran¹⁰,
Andrea Giuntini¹⁰ &
…
Pierluigi Mansueto¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1177))

Included in the following conference series:

Italian Research Conference on Digital Libraries

798 Accesses

Abstract

We describe a system for the automatic transcription of books with concordances. Even if the recognition of printed text with OCR tools is nearly solved for high quality documents, the recognition of structured text, where dictionaries and other linguistic tools can be of little help, is still a difficult task. In this work, we propose to use several techniques for correcting the imperfect text recognized by the OCR software by taking into account both physical features of the documents and the redundancy of information implicit in concordances.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Usability Analysis of the Concordia Tool Applying Novel Concordance Searching

Recognition of OCR Invoice Metadata Block Types

Probabilistic Indexing and Search for Hyphenated Words

Notes

1.
https://github.com/napolux/paroleitaliane/tree/master/paroleitaliane.

References

Cagni, G.M.: Concordanze degli scritti di S. Antonio M. Zaccaria. Collana spiritualita barnabitica, 4 (1960)
Google Scholar
Capobianco, S., Marinai, S.: Text line extraction in handwritten historical documents. In: Grana, C., Baraldi, L. (eds.) IRCDL 2017. CCIS, vol. 733, pp. 68–79. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68130-6_6
Chapter Google Scholar
Cesarini, F., Gori, M., Marinai, S., Soda, G.: Structured document segmentation and representation by the modified X-Y tree. In: Fifth International Conference on Document Analysis and Recognition, ICDAR 1999, Bangalore, India, 20–22 September 1999, pp. 563–566 (1999)
Google Scholar
Gatos, B.G.: Imaging Techniques in document analysis processes. In: Doermann, D., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition, pp. 73–131. Springer, London (2014). https://doi.org/10.1007/978-0-85729-859-1_4
Chapter Google Scholar
Likforman-Sulem, L., Zahour, A., Taconet, B.: Text line segmentation of historical documents: a survey. Int. J. Doc. Anal. Recognit. 9(2), 123–138 (2007)
Article Google Scholar
Mandal, S., Chowdhury, S.P., Das, A.K., Chanda, B.: Automated detection and segmentation of table of contents page from document images. In: 2003 Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 1, pp. 398–402 (2003)
Google Scholar
Marinai, S., Marino, E., Soda, G.: Table of contents recognition for converting PDF documents in e-book formats. In: Proceedings of the 10th ACM Symposium on Document Engineering, DocEng 2010, pp. 73–76. ACM, New York (2010)
Google Scholar
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)
Article Google Scholar
Read, A.W.: Dictionary, Encyclopaedia Britannica (2016). https://www.britannica.com/topic/dictionary. Accessed 30 Sept 2019
Smith, R.: An overview of the Tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 629–633, September 2007
Google Scholar
danvk: Finding blocks of text in an image using Python, OpenCV and numpy (2015)
Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishers Inc., San Francisco (1999)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

University of Florence, Firenze, Italy
Simone Marinai, Samuele Capobianco, Zahra Ziran, Andrea Giuntini & Pierluigi Mansueto

Authors

Simone Marinai
View author publications
You can also search for this author in PubMed Google Scholar
Samuele Capobianco
View author publications
You can also search for this author in PubMed Google Scholar
Zahra Ziran
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Giuntini
View author publications
You can also search for this author in PubMed Google Scholar
Pierluigi Mansueto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Simone Marinai .

Editor information

Editors and Affiliations

University of Bari, Bari, Italy
Michelangelo Ceci
University of Bari, Bari, Italy
Stefano Ferilli
Sapienza University of Rome, Rome, Italy
Antonella Poggi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Marinai, S., Capobianco, S., Ziran, Z., Giuntini, A., Mansueto, P. (2020). Recognition of Concordances for Indexing in Digital Libraries. In: Ceci, M., Ferilli, S., Poggi, A. (eds) Digital Libraries: The Era of Big Data and Data Science. IRCDL 2020. Communications in Computer and Information Science, vol 1177. Springer, Cham. https://doi.org/10.1007/978-3-030-39905-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-39905-4_14
Published: 22 January 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39904-7
Online ISBN: 978-3-030-39905-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Recognition of Concordances for Indexing in Digital Libraries

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Usability Analysis of the Concordia Tool Applying Novel Concordance Searching

Recognition of OCR Invoice Metadata Block Types

Probabilistic Indexing and Search for Hyphenated Words

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Recognition of Concordances for Indexing in Digital Libraries

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Usability Analysis of the Concordia Tool Applying Novel Concordance Searching

Recognition of OCR Invoice Metadata Block Types

Probabilistic Indexing and Search for Hyphenated Words

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation