Multi-font printed Mongolian document recognition system

Peng, Liangrui; Liu, Changsong; Ding, Xiaoqing; Jin, Jianming; Wu, Youshou; Wang, Hua; Bao, Yanhua

doi:10.1007/s10032-009-0106-8

Multi-font printed Mongolian document recognition system

Original Paper
Published: 16 January 2010

Volume 13, pages 93–106, (2010)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Liangrui Peng^1,2,3,
Changsong Liu^1,2,3,
Xiaoqing Ding^1,2,3,
Jianming Jin⁴,
Youshou Wu^1,2,3,
Hua Wang^1,2,3 &
…
Yanhua Bao⁵

306 Accesses
20 Citations
Explore all metrics

Abstract

Mongolian is one of the most common written languages in China, Mongolia, and Russia. Many printed Mongolian documents still remain to be digitized for digital library applications. The traditional Mongolian script has a unique vertical cursive writing style and multiple font variations, which makes Mongolian Optical Character Recognition challenging. As the traditional Mongolian script has subcomponent characteristics, such that one character may be a constituent of another character, in this work we define a novel character set for recognition using segmented components. The components are combined into characters in a rule-based post-processing module. For overall character recognition, a method based on Visual Directional Features and multi-level classifiers is presented. For character segmentation, segmentation points are identified by analyzing the properties of projection profiles and connected components. Mongolian has dozens of different printed font types that can be categorized into two major groups, namely, standard and handwritten-style groups. The segmentation parameters are adjusted for each group. Additionally, script identification and relevant character recognition kernels are integrated for the recognition of Mongolian text mixed with Chinese and English. A novel multi-font printed Mongolian document recognition system based on the proposed methods is implemented. Experiments indicate a text recognition rate of 96.9% on the test samples from real documents with multiple font types and mixed script. The proposed methods can also be applied to other scripts in the Mongolian script family, such as Todo and Sibe, with significant potential for extension to historic Mongolian documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Amin A., Mari F.J.: Machine recognition and correction of printed Arabic text. IEEE Trans. Syst. Man Cybern. 19(5), 1300–1306 (1989)
Article Google Scholar
Amin A.: Recognition of hand-printed characters based on structural description and inductive logic programming. Pattern Recognit. Lett. 24(16), 3187–3196 (2003)
Article Google Scholar
Auda, A.G., Raafat, H.: An automatic text reader using neural networks. In: Proceedings of the Canadian Conference on Electrical and Computer Engineering, Vancouver, BC Canada, pp. 92–95 (1993)
Bazzi I., Schwartz I., Makhoul J.: An omnifont open-vocabulary OCR system for English and Arabic. IEEE Trans. PAMI 21(6), 495–504 (1999)
Google Scholar
Creating and Supporting OpenType Fonts for the Mongolian Script, http://www.microsoft.com/typography/otfntdev/mongolot/
Ding, X., Wen, D., Peng, L., Liu, C.: Document digitization technology and its application for digital library in China. In: Proceedings of the First International Workshop on Document Image Analysis for Libraries–DIAL, pp. 46–53 (2004)
Fang, C., Liu, C., Peng, L., Ding, X.: Automatic performance evaluation of printed Chinese character recognition systems. IJDAR(4), no. 3, pp. 177–182 (2002)
Feng, Z.D., Huo, Q.: Confidence guided progressive search and fast match techniques for high performance Chinese/English OCR. In: 16th International Conference on Pattern Recognition, pp. 89–92 (2002)
Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. New York (1990)
Gao, G., Li, W., Hou, H., et al.: Multi-agent based recognition system of printed Mongolian characters. In: Proceedings of the International Conference on Active Media Technology, pp. 376–381 (2003)
Guo, H., Ding, X.Q., Zhang, Z., Guo, F.X.: Realization of a high-performance bilingual Chinese–English OCR system. In: 3rd International Conference on Document Analysis and Recognition, pp. 978–981 (1995)
Hubel D.H., Wiesel T.N.: Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. 160, 106–154 (1962)
Google Scholar
Huo, Q., Feng, Z.D.: Improving Chinese/English OCR performance by using MCE-based character-pair modeling and negative training. In: 7th International Conference on Document Analysis and Recognition, pp. 364–368 (2003)
Juang B.H., Katagiri S.: Discriminative training for minimum error classification. IEEE Trans. Signal Process. 40(12), 3043–3054 (1992)
Article MATH Google Scholar
Kato N. et al.: A handwritten character recognition system using directional element feature and asymmetric mahalanobis distance. IEEE Trans. PAMI 21(3), 258–262 (1999)
Google Scholar
Kimura F., Takashina K., Tsuruoka S., Miyake Y.: Modified quadratic discriminant functions and the application to Chinese character recognition. IEEE Trans. Pattern Anal. Mach. Intell. 9(1), 149–153 (1987)
Article Google Scholar
Lin X., Ding X., Chen M. et al.: Adaptive confidence transform based classifier combination for Chinese character recognition. Pattern Recognit. Lett. 19(10), 975–988 (1998)
Article Google Scholar
Lorigo L.M., Govindaraju V.: Offline Arabic handwriting recognition: a survey. IEEE Trans. PAMI 28(5), 712–724 (2006)
Google Scholar
Miled, H., Ben Amara, N.E.: Planar Markov modeling for Arabic writing recognition: advancement state. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, pp. 69–73 (2001)
Peng, L., Liu, C., Ding, X., et al.: Multilingual document recognition research and its application in China. In: 2nd International Conference on Document Image Analysis for Libraries, pp. 126–132 (2006)
Peng, L., Liu, C., Ding, X., Wang, H., Jin, J.: Multi-font printed Mongolian document recognition system, SPIE 2009, DRR 7247-20, 72470J-1 to 7247OJ-7. (2009)
Qoijongjab: Mongolian encoding (in Chinese). Publishing house of Inner Mongolia University, Hohhot (2000)
Romeo-Pakker, K., Miled, H., Lecourtier, Y.: A new approach for Latin Arabic character segmentation. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montral, pp. 874–877 (1995)
The Unicode Standard, Version 5.1.0, http://www.unicode.org/versions/Unicode5.1.0/
Wang, K., Wang, Q.: A high performance European OCR system. In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, vol. 1, pp. 232–236 (2007)
Ymin, A., Aoki, Y.: On the segmentation of multi-font printed Uygur scripts. In: Proceedings of the 13th International Conference on Pattern Recognition, Vienna, pp. 215–219 (1996)
Zahour, A., Taconet, B., Mercy, P., Ramdane, S.: Arabic hand-written text-line extraction. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, Seattle, pp. 281–285 (2001)
Zheng Y.F., Liu C.S., Ding X.Q.: Single character type identification. Proc. SPIE Doc. Recognit. Retr. IX 4670, 49–56 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronic Engineering, Tsinghua University, 100084, Beijing, China
Liangrui Peng, Changsong Liu, Xiaoqing Ding, Youshou Wu & Hua Wang
Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, 100084, Beijing, China
Liangrui Peng, Changsong Liu, Xiaoqing Ding, Youshou Wu & Hua Wang
State Key Laboratory of Intelligent Technology and Systems, Tsinghua University, 100084, Beijing, China
Liangrui Peng, Changsong Liu, Xiaoqing Ding, Youshou Wu & Hua Wang
HP Labs China, 100084, Beijing, China
Jianming Jin
Mongolian Department, Hulunbeier College, 021008, Hailar, Inner Mongolia, China
Yanhua Bao

Authors

Liangrui Peng
View author publications
You can also search for this author in PubMed Google Scholar
Changsong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoqing Ding
View author publications
You can also search for this author in PubMed Google Scholar
Jianming Jin
View author publications
You can also search for this author in PubMed Google Scholar
Youshou Wu
View author publications
You can also search for this author in PubMed Google Scholar
Hua Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yanhua Bao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liangrui Peng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Peng, L., Liu, C., Ding, X. et al. Multi-font printed Mongolian document recognition system. IJDAR 13, 93–106 (2010). https://doi.org/10.1007/s10032-009-0106-8

Download citation

Received: 14 April 2009
Revised: 08 November 2009
Accepted: 01 December 2009
Published: 16 January 2010
Issue Date: June 2010
DOI: https://doi.org/10.1007/s10032-009-0106-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-font printed Mongolian document recognition system

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A comprehensive handwritten Indic script recognition system: a tree-based approach

Language, Script, and Font Recognition

A top-down character segmentation approach for Assamese and Telugu handwritten documents

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Multi-font printed Mongolian document recognition system

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A comprehensive handwritten Indic script recognition system: a tree-based approach

Language, Script, and Font Recognition

A top-down character segmentation approach for Assamese and Telugu handwritten documents

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation