Abstract
Many languages in Africa are written using Latin-based scripts, but often with extra diacritics (e.g. dots below in Igbo: \({\d i}, {\d o}, {\d u}\)) or modifications to the letters themselves (e.g. open vowels “e” and “o” in Lingala: ɛ, ɔ). While it is possible to render these characters accurately in Unicode, oftentimes keyboard input methods are not easily accessible or are cumbersome to use, and so the vast majority of electronic texts in many African languages are written in plain ASCII. We call the process of converting an ASCII text to its proper Unicode form unicodification. This paper describes an open-source package which performs automatic unicodification, implementing a variant of an algorithm described in previous work of De Pauw, Wagacha, and de Schryver. We have trained models for more than 100 languages using web data, and have evaluated each language using a range of feature sets.
Similar content being viewed by others
Notes
Source code and training data are available from http://sourceforge.net/projects/lingala/ (under the GNU GPLv3, as the package charlifter), or directly from the author.
For reasons of space we have not listed the language names in the tables; see http://www.sil.org/ISO639-3/codes.asp for the full list.
See http://accentuate.us/ for more information.
References
Caldwell, M. E. (2009). Development of psychometrically equivalent speech audiometry materials for testing children in Mongolian, M.S. Thesis, Brigham Young University, December.
De Pauw, G., Wagacha, P. W., & de Schryver, G.-M. (2007). Automatic diacritic restoration for resource-scarce languages. In V. Matousek, & P. Mautner, (Eds.), Proceedings of text, speech and dialogue conference 2007, pp. 170–179.
De Pauw, G., Wagacha, P. W., & de Schryver, G.-M. (2011). Collection and deployment of a parallel corpus English-Swahili, Language resources and evaluation, this volume.
Fairon, C., et al. (Eds.) (2007). Building and Exploring Web Corpora, Proceedings of the 3rd web as corpus Workshop, Louvain-la-Neuve, Belgium.
Haslam V. N. (2009). Psychometrically equivalent monosyllabic words for word recognition testing in Mongolian, M.S. Thesis, Brigham Young University, August.
Iftene, A., & Trandabăţ D. (2009). Recovering diacritics using Wikipedia and Google. In: Knowledge engineering: Principles and techniques, Proceedings of the international conference on knowledge engineering KEPT2009, pp. 37–40.
Mihalcea, R. (2002). Diacritics restoration: Learning from letters versus learning from words. In Proceedings of the third international conference on intelligent text processing and computational linguistics.
Mihalcea, R., & Nastase, V. (2002). Letter level learning for language independent diacritics restoration. In Proceedings of CoNLL-2002, pp. 105–111.
Moran, S. (2011). An ontology for accessing transcription systems, Language resources and evaluation, this volume.
Scannell, K. P. (2007). The Crúbadán project: Corpus building for under-resourced languages. In Building and Exploring Web Corpora. Proceedings of the 3rd web as corpus workshop, pp. 5–15.
Simard, M. (1998). Automatic insertion of accents in French text. In Ide & Vuotilainen (Eds.), Proceedings of the third conference on empirical methods in natural language processing, pp. 27–35.
Simard, M., & Deslauriers, A. (2001). Real-time automatic insertion of accents in French text. Natural Language Engineering, 7(2), 143–165.
Spriet, T., & El-Bèze, M. (1997). Réaccentuation Automatique de Textes. In FRACTAL 97, Besançon.
Streiter, O., & Stuflesser, M. (2006). Design features for the collection and distribution of basic NLP-resources for the world’s writing systems. In Proceedings of LREC 2006, Genova, Italy.
Tufiş, D., & Chiţu, A. (1999). Automatic diacritics insertion in romanian texts. In Proceedings of the 5th international workshop on computational lexicography COMPLEX ’99, pp. 185–194.
Tufiş, D., & Ceauşu, A. (2008). DIAC+: A professional diacritics recovering system. In Proceedings of the sixth international language resources and evaluation (LREC’08).
Wagacha, P. W., De Pauw, G., & Githinji, P. W. (2006). A grapheme-based approach for accent restoration in Gĩkũyũ. In Proceedings of LREC’06, pp. 1937–1940.
Yarowsky, D. (1994). A comparison of corpus-based techniques for restoring accents in Spanish and French text. In Proceedings of the 2nd annual workshop on very large text corpora, pp. 99–120.
Acknowledgments
We are grateful to Nuance Communications, and especially Ann Aoki Becker, for their support and for their ongoing commitment to developing input technology for under-resourced languages around the world. Thanks also to my student Michael Schade for making this work much more accessible to language communities through his Firefox add-on, and to my many collaborators on the Crúbadán project for their help preparing the web corpora which were used to train the language models, especially Tunde Adegbola (Yoruba), Denis Jacquerye (Lingala), Chinedu Uchechukwa (Igbo), Thapelo Otlogetswe (Setswana), Abdoul Cisse and Mohomodou Houssouba (Songhay), and Outi Sané (Diola). Alexandru Szasz gave helpful feedback on Romanian, as did Jean Came Poulard on Haitian Creole. Finally, thanks to Guy De Pauw, Peter Wagacha, and Gilles-Maurice de Schryver for their encouragement of this work. This paper is dedicated to the memory of my friend and collaborator on Frisian, Eeltje de Vries (1938–2008).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Scannell, K.P. Statistical unicodification of African languages. Lang Resources & Evaluation 45, 375–386 (2011). https://doi.org/10.1007/s10579-011-9150-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-011-9150-3