Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain

García-Ferrero, Iker; Agerri, Rodrigo; Salazar, Aitziber Atutxa; Cabrio, Elena; de la Iglesia, Iker; Lavelli, Alberto; Magnini, Bernardo; Molinet, Benjamin; Ramirez-Romero, Johana; Rigau, German; Villa-Gonzalez, Jose Maria; Villata, Serena; Zaninello, Andrea

Computer Science > Computation and Language

arXiv:2404.07613 (cs)

[Submitted on 11 Apr 2024]

Title:Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain

Authors:Iker García-Ferrero, Rodrigo Agerri, Aitziber Atutxa Salazar, Elena Cabrio, Iker de la Iglesia, Alberto Lavelli, Bernardo Magnini, Benjamin Molinet, Johana Ramirez-Romero, German Rigau, Jose Maria Villa-Gonzalez, Serena Villata, Andrea Zaninello

View PDF HTML (experimental)

Abstract:Research on language technology for the development of medical applications is currently a hot topic in Natural Language Understanding and Generation. Thus, a number of large language models (LLMs) have recently been adapted to the medical domain, so that they can be used as a tool for mediating in human-AI interaction. While these LLMs display competitive performance on automated medical texts benchmarks, they have been pre-trained and evaluated with a focus on a single language (English mostly). This is particularly true of text-to-text models, which typically require large amounts of domain-specific pre-training data, often not easily accessible for many languages. In this paper, we address these shortcomings by compiling, to the best of our knowledge, the largest multilingual corpus for the medical domain in four languages, namely English, French, Italian and Spanish. This new corpus has been used to train Medical mT5, the first open-source text-to-text multilingual model for the medical domain. Additionally, we present two new evaluation benchmarks for all four languages with the aim of facilitating multilingual research in this domain. A comprehensive evaluation shows that Medical mT5 outperforms both encoders and similarly sized text-to-text models for the Spanish, French, and Italian benchmarks, while being competitive with current state-of-the-art LLMs in English.

Comments:	LREC-COLING 2024
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2404.07613 [cs.CL]
	(or arXiv:2404.07613v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2404.07613

Submission history

From: Iker García-Ferrero [view email]
[v1] Thu, 11 Apr 2024 10:01:32 UTC (162 KB)

Computer Science > Computation and Language

Title:Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators