M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

Li, Lei; Yin, Yuwei; Li, Shicheng; Chen, Liang; Wang, Peiyi; Ren, Shuhuai; Li, Mukai; Yang, Yazheng; Xu, Jingjing; Sun, Xu; Kong, Lingpeng; Liu, Qi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.04387 (cs)

[Submitted on 7 Jun 2023 (v1), last revised 8 Jun 2023 (this version, v2)]

Title:M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

Authors:Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, Lingpeng Kong, Qi Liu

View PDF

Abstract:Instruction tuning has significantly advanced large language models (LLMs) such as ChatGPT, enabling them to align with human instructions across diverse tasks. However, progress in open vision-language models (VLMs) has been limited due to the scarcity of high-quality instruction datasets. To tackle this challenge and promote research in the vision-language field, we introduce the Multi-Modal, Multilingual Instruction Tuning (M$^3$IT) dataset, designed to optimize VLM alignment with human instructions. Our M$^3$IT dataset comprises 40 carefully curated datasets, including 2.4 million instances and 400 manually written task instructions, reformatted into a vision-to-text structure. Key tasks are translated into 80 languages with an advanced translation system, ensuring broader accessibility. M$^3$IT surpasses previous datasets regarding task coverage, instruction number and instance scale. Moreover, we develop Ying-VLM, a VLM model trained on our M$^3$IT dataset, showcasing its potential to answer complex questions requiring world knowledge, generalize to unseen video tasks, and comprehend unseen instructions in Chinese. We have open-sourced the dataset to encourage further research.

Comments:	Fix dataset url: this https URL Project: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2306.04387 [cs.CV]
	(or arXiv:2306.04387v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.04387

Submission history

From: Lei Li [view email]
[v1] Wed, 7 Jun 2023 12:35:37 UTC (7,285 KB)
[v2] Thu, 8 Jun 2023 13:44:24 UTC (3,659 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators