Abstract
This paper present ParaDiom – a parallel corpus with 2000 Slovene and English text segments. The text segments are rich with manually annotated idiomatic expressions, which poses a challenge for machine translation systems. We describe the definition of idiomatic expressions, the sampling of the corpus sentences, the annotation scheme, and the general characteristics of the finished corpus. The motivation for this corpus is to have a test set for machine translation systems to evaluate their performance on figurative language. In the last part of the paper, we demonstrate an example use of the corpus in a machine translation experiment.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Kočevje is a region in Slovenia.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
References
Abarna, S., Sheeba, J., Devaneyan, S.P.: An ensemble model for idioms and literal text classification using knowledge-enabled BERT in deep learning. Measur. Sens. 24, 100434 (2022)
Brank, J.: Q-CAT corpus annotation tool (2019). http://hdl.handle.net/11356/1262, slovenian language resource repository CLARIN.SI
Briskilal, J., Subalalitha, C.: An ensemble model for classifying idioms and literal texts using BERT and RoBERTa. Inf. Process. Manage. 59(1), 102756 (2022)
Cowie, A.P.: Multiword lexical units and communicative language teaching. In: Arnaud, P.J.L., Béjoint, H. (eds.) Vocabulary and Applied Linguistics, pp. 1–12. Palgrave Macmillan UK, London (1992)
Dhariya, O., Malviya, S., Tiwary, U.S.: A hybrid approach for Hindi-English machine translation. In: 2017 International Conference on Information Networking (ICOIN), pp. 389–394. IEEE (2017)
Diller, H.J., De Smet, H., Tyrkkö, J.: A European database of descriptors of English electronic texts. Eur. Engl. Messenger 19, 21–35 (2011)
Donaj, G., Antloga, Š.: Parallel corpus of idiomatic text ParaDiom 1.0 (2022). http://hdl.handle.net/11356/1714. slovenian language resource repository CLARIN.SI
Ducar, C., Schocket, D.H.: Machine translation and the L2 classroom: pedagogical solutions for making peace with google translate. Foreign Lang. Ann. 51(4), 779–795 (2018)
Ebrahim, S., Hegazy, D., Mostafa, M.G.H.M., El-Beltagy, S.R.: Detecting and integrating multiword expression into English-Arabic statistical machine translation. Procedia Comput. Sci. 117, 111–118 (2017)
Erjavec, T., et al.: The ParlaMint corpora of parliamentary proceedings. Lang. Resour. Eval. 57(1), 415–448 (2022)
Gantar, P., Krek, S., Kuzman, T.: Verbal multiword expressions in Slovene. In: Mitkov, R. (ed.) EUROPHRAS 2017. LNCS (LNAI), vol. 10596, pp. 247–259. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69805-2_18
Ghoneim, M., Diab, M.: Multiword expressions in the context of statistical machine translation. In: Mitkov, R., Park, J.C. (eds.) Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 1181–1187. Asian Federation of Natural Language Processing, Nagoya, Japan (2013)
Gläser, R.: Terminological problems in linguistics, with special reference to neologisms. In: Hartmann, R.R.K. (ed.) LEXeter ’83 Proceedings, pp. 345–351. Max Niemeyer Verlag, Tübingen, Germany (Sep (1983)
Gläser, R.: The stylistic potential of phraseological units in the light of genre analysis. In: Cowie, A.P. (ed.) Phraseology: Theory, Analysis, and Applications, chap. 9, pp. 128–143. Oxford University Press, Oxford (1998)
Junczys-Dowmunt, M., et al.: Marian: fast neural machine translation in C++. In: Proceedings of ACL 2018, System Demonstrations, pp. 116–121. Association for Computational Linguistics, Melbourne, Australia (2018)
Keber, J.: Slovar Slovenskih Frazemov. Založba ZRC, ZRC SAZU, Ljubljana (2011)
Krek, S., et al.: Gigafida 2.0: the reference corpus of written standard Slovene. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 3340–3345. European Language Resources Association, Marseille, France (2020)
Kržišnik, E.: Idiomatska beseda ali frazeološka enota. Slavistična revija 58(1), 83–94 (2010)
Ljubešić, N., Dobrovoljc, K.: What does neural bring? Analysing improvements in morphosyntactic annotation and lemmatisation of Slovenian, Croatian and Serbian. In: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pp. 29–34. Association for Computational Linguistics, Florence, Italy (2019)
Mel’cuk, I.: Phrasemes in language and phraseology in linguistics. In: Everaert, M., Erik-Jan van der Linden, A.S., Schreuder, R., Schreuder, R. (eds.) Idioms: Structural and Psycological Perspectives, pp. 167–232. Hillsdale: Lawrence Erlbaum Associates (1995)
Naciscione, A.: Stylistic use of phraseological units in discourse. John Benjamins Publishing Company, Amsterdam, Philadelphia (2010)
Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: a Python natural language processing toolkit for many human languages. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 101–108. Association for Computational Linguistics (2020)
Saini, J.R., Modh, J.C.: GIdTra: a dictionary-based MTS for translating Gujarati bigram idioms to English. In: 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC), pp. 192–196. IEEE, Waknaghat, India (2016)
Savary, A., et al.: The PARSEME shared task on automatic identification of verbal multiword expressions. In: Markantonatou, S., Ramisch, C., Savary, A., Vincze, V. (eds.) Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pp. 31–47. Association for Computational Linguistics, Valencia, Spain (2017)
Svensson, M.H.: A very complex criterion of fixedness: Noncompositionality. In: Granger, S., Meunier, F. (eds.) Phraseology: An Interdisciplinary Perspective, pp. 81–93. John Benjamins Publishing Company, Philadelphia (2008)
Verstraten, L.: Fixed phrases in monolingual learners’ dictionaries. In: Arnaud, P.J.L., Béjoint, H. (eds.) Vocabulary and Applied Linguistics, pp. 28–40. Palgrave Macmillan UK, London (1992)
Vieira, L.N., O’Sullivan, C., Zhang, X., O’Hagan, M.: Machine translation in society: insights from UK users. Language Resources and Evaluation (2022)
Acknowledgements
This work was supported by CLARIN.SI and the Slovenian Research Agency (research core funding No.P2-0069-Advanced Methods of Interaction in Telecommunications).
The authors thank the creators of the ParaCrawl project (paracrawl.eu) and OpenSubtitles (www.opensubtitles.org) for their corpora and OPUS (opus.nlpl.eu) for their service. The authors also thank the HPC RIVR (www.hpc-rivr.si) consortium for the use of the HPC system VEGA on the Institute of Information Science (IZUM).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Donaj, G., Antloga, Š. (2023). ParaDiom – A Parallel Corpus of Idiomatic Texts. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-40498-6_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40497-9
Online ISBN: 978-3-031-40498-6
eBook Packages: Computer ScienceComputer Science (R0)