Abstract
Extracting parallel units (e.g. sentences or phrases) from comparable corpora in order to enrich existing statistical translation models is an avenue that has attracted a lot of research in recent years. There are experiments that convincingly show how parallel sentences extracted from comparable corpora are able to improve statistical machine translation (SMT). Yet, the existing body of research on the subject does not take into account the degree of comparability of the corpus being processed nor the computation time that it takes to extract translational similar pairs from a corpus of a given size. We will show that the performance of a parallel unit extractor crucially depends on the degree of comparability, such that it is more difficult to mine for parallel data in a weakly comparable corpus than a strongly comparable corpus.
Most of the research in parallel data mining from comparable corpora focusses on parallel sentence mining, but parallel phrase mining (i.e. sub-sentential fragments) is of equal importance, because it can be more robust in the presence of weakly comparable corpora that usually do not contain whole translated sentences. We will present different approaches to parallel sentence and phrase mining from comparable corpora developed in the ACCURAT project, and we will evaluate them both in terms of absolute measures (e.g., P, R and F1) and with respect to their ability to generate significant improvements of the BLEU scores of a statistical translation system. Comprehensive testing of these algorithms in the context of statistical machine translation will be undertaken in Chap. 6.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
With the possible exception of parallelising the computations.
- 2.
Or ‘alignments’ or ‘pairs.’ These terms will be used with the same meaning throughout this section.
- 3.
We did not attempt to find the mathematical maximum of the expression from Eq. (5.7), and we realise that the consequence of this choice and of the greedy search procedure is not finding the true optimum.
- 4.
- 5.
We keep functional words lists for all languages.
- 6.
- 7.
We experimented with different power values for the cohesion score. We had the best results with ½ (the square root).
- 8.
But we acknowledge the fact that the probability of a sentence pair being parallel as computed by the classifier of Munteanu and Marcu is a proper model of parallelism.
- 9.
To obtain the dictionaries mentioned throughout this subsection, we have applied GIZA++ on the JRC Acquis corpus (Steinberger et al. 2006).
- 10.
For two source and target words, if the pair is not in the dictionary, we use a 0 to 1 normalised version of the Levenshtein distance in order to assign a ‘translation probability’ based on string similarity alone. If the source and target words are similar above a certain threshold (experimentally set to 0.7), we consider them to be translations.
- 11.
Mostly from the News domain for all language pairs.
- 12.
When an example occurs multiple times with both labels, we retain all the occurrences of the example with the most frequent label and remove all the conflicting occurrences.
- 13.
- 14.
For each parallel sentence, 2 noise sentences were added.
- 15.
- 16.
- 17.
- 18.
These phrases are extracted with the SVM margin that maximises the F-measure, see the ‘Classifier evaluation’ subsection for details.
- 19.
Koehn (2004) reports that an increase of 1% in BLEU score is a significant improvement.
- 20.
And, if it is a set, no source phrase is repeated.
- 21.
The probability threshold over which all generated parallel pairs is correct is dependent on the type of document pairs. For the English-Romanian pair of parallel documents on which we tested, at least 0.5 is guaranteed to indicate perfect parallelism (we have determined that by manually inspecting the output).
References
Aker, A., Kanoulas, E., & Gaizauskas, R. (2012a). A light way to collect comparable corpora from the Web. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012) (pp. 21–27), Istanbul, Turkey.
Aker, A., Feng, Y., & Gaizauskas, R. (2012b). Automatic bilingual phrase extraction from comparable corpora. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), IIT Bombay, Mumbai, India.
Aswani, N., & Gaizauskas, R. (2010). English-Hindi transliteration using multiple similarity metrics. In Proceedings of the 7th Language Resources and Evaluation Conference (LREC 2010), Valletta, Malta.
Borman, S. (2009). The expectation maximization algorithm. A short tutorial. http://www.seanborman.com/publications/EM_algorithm.pdf
Brown, P. F., Pietra, S. A. D., Pietra, V. J. D., & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), 263–311.
Ceauşu, A. (2009). Statistical machine translation for Romanian. PhD Thesis, Romanian Academy (in Romanian).
Chen, S. F.(1993). Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st Annual Meeting on Association for Computational Linguistics (pp. 9–16), Columbus, OH.
Chiang, D. (2005). A hierarchical phrase-based model for statistical machine translation. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, June 2005 (pp. 263–270), Ann Arbor, MI.
Fellbaum, C. (Ed.) (1998) WordNet: An electronic lexical database. Cambridge, MA: MIT Press.
Fung, P., & Cheung, P. (2004). Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and EM. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004) (pp. 57–63), Barcelona, Spain.
Gale, W. A., & Church, K. W. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75–102.
Gao, Q., & Vogel, S. (2008). Parallel implementations of a word alignment tool. In Proceedings of ACL-08 HLT: Software Engineering, Testing, and Quality Assurance for Natural Language Processing, June 20, 2008 (pp. 49–57), Ohio State University, Columbus, OH.
Hewavitharana, S., & Vogel, S. (2011). Extracting parallel phrases from comparable data. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web (BUCC 2011) (pp. 61–68), Portland, OR.
Ion, R. (2012). PEXACC: A parallel sentence mining algorithm from comparable corpora. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012) (pp. 2181–2188), May 21–27, 2012, Istanbul, Turkey.
Ion, R., Ceauşu, A., & Irimia, E. (2011a). An expectation maximization algorithm for textual unit alignment. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora (BUCC 2011) (pp. 128–135), June 24th, 2011, Portland, OR.
Ion, R., Zhang, X., Su, F., Paramita, M., & Ștefănescu, D. (2011b). Report on Multi-Level Alignment of Comparable Corpora. Technical report no. D2.2 of the ACCURAT Project (http://www.accurat-project.eu/).
Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004) (pp. 388–395), Barcelona, Spain.
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of the Tenth Machine Translation Summit, September 12–16, 2005 (pp. 79—86), Phuket, Thailand.
Koehn, P., Och, F., & Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (pp. 48–54), May 27–June 1, 2003, Edmonton, Canada.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Cowan, B., et al. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL Companion Volume Proceedings of the Demo and Poster Sessions (pp. 177–180), Prague, Czech Republic.
Manning, C., Raghavan, P., & Schutze, H. (2008). Introduction to information retrieval (Vol. 1). Cambridge: Cambridge University Press.
Munteanu, D. S., & Marcu, D. (2002). Processing comparable corpora with bilingual suffix trees. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002) (pp. 289–295), July 6–7, 2002, University of Pennsylvania, Philadelphia, PA
Munteanu, D. S., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.
Och, F. J. (2003). Minimum error rate training in statistical machine translation. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (pp. 160–167), July 07–12, 2003, Sapporo, Japan.
Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.
Och, F. J., & Ney, H. (2004). The alignment template approach to statistical machine translation. Computational Linguistics, 30(4), 417–449.
Papineni, K., Roukos, S., Ward, T., & Zhu, W. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, July 7–12 2002 (pp. 311–318), University of Pennsylvania, Philadelphia, PA.
Quirk, C., Udupa, R., & Menezes, A. (2007). Generative models of noisy translations with applications to parallel fragment extraction. In Proceedings of the MT Summit XI (pp. 321–327), September, 2007, Copenhagen, Demark.
Rauf, S. A., & Schwenk, H. (2011). Parallel sentence generation from comparable corpora for improved SMT. Machine Translation, 25(4), 341–375.
Skadiņa, I., Aker, A., Giouli, V., Tufiş, D., Gaizauskas, R., Mieriņa, M., et al. (2010). A collection of comparable corpora for under-resourced languages. In Proceedings of the Fourth International Conference Baltic HLT 2010. Frontiers in Artificial Intelligence and Applications (Vol. 219, pp. 161–168), IOS Press.
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (AMTA 2006): Visions for the Future of Machine Translation (pp. 223–231), Cambridge, MA.
Snover, M., Madnani, N., Dorr, B., & Schwartz, R. (2009). Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric. In Proceedings of the Fourth Workshop on Statistical Machine Translation (pp. 259–268). Association for Computational Linguistics, Athens, Greece.
Ștefănescu, D., Ion, R., & Hunsicker, S. (2012). Hybrid parallel sentence mining from comparable corpora. In Proceedings of the16th Conference of the European Association for Machine Translation (EAMT 2012) (pp. 137–144), May 28–30, 2012, Trento, Italy.
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiș, D., et al. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’2006), May 24–26, 2006, Genoa, Italy.
Steinberger, R., Eisele, A., Klocek, A., Pilos, S., & Schlüter, P. (2012). DGT-TM: A freely Available Translation Memory in 22 Languages. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012), May 21–27, 2012, Istanbul, Turkey.
Stolcke, A. (2002). SRILM – An extensible language modeling toolkit. In Proceedings of the International Conference of Spoken Language Processing (ICSLP 2002) (pp. 901–904), September 2002, Denver, CO.
Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., & Laurikkala, J. (2008). Focused web crawling in the acquisition of comparable corpora. Information Retrieval, 11(5), 427–445.
Thi Ngoc Diep, D., Besacier, L., Castelli, E. (2010). A fully unsupervised approach for mining parallel data from comparable corpora. In Proceedings of the 14th Annual Conference of the European Association for Machine Translation (EAMT 2010), May 27–28, 2010, Saint-Raphaël, France.
Tillmann, C. (2009). A beam-search extraction algorithm for comparable data. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers (pp. 225–228), Suntec, Singapore, August 4th, 2009.
Tsvetkov, Y., & Wintner, S. (2010). Automatic acquisition of parallel corpora from websites with dynamic content. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10) (pp. 3389–3392), Valletta, Malta, May 2010.
Tufiș, D., Ion, R., Ceaușu, A., & Ștefănescu, D. (2006). Improved lexical alignment by combining multiple reified alignments. In Proceedings of the11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006) (pp. 153–160), Trento, Italy, April 3–7 2006.
Tufiș, D., Ion, R., Bozianu, L., Ceaușu, A., & Ștefănescu, D. (2008). Romanian wordnet: Current state, new applications and prospects. In A. Tanacs, D. Csendes, V. Vincze, C. Fellbaum, & P. Vossen (Eds.), Proceedings of 4th Global WordNet Conference, GWC-2008, January 2008 (pp. 441–452). Hungary: University of Szeged.
Zhang, Y., Wu, K., Gao, J., & Vines, P. (2006). Automatic acquisition of Chinese-English parallel corpus from the web. In Proceedings of 28th European Conference on Information Retrieval ECIR 2006, April 10–12, 2006, London.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Additional information
Chapter editors: Radu Ion and Dan Tufiș
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Aker, A. et al. (2019). Mapping and Aligning Units from Comparable Corpora. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-99004-0_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99003-3
Online ISBN: 978-3-319-99004-0
eBook Packages: Computer ScienceComputer Science (R0)