DOCCI: Descriptions of Connected and Contrasting Images

Onoe, Yasumasa; Rane, Sunayana; Berger, Zachary; Bitton, Yonatan; Cho, Jaemin; Garg, Roopal; Ku, Alexander; Parekh, Zarana; Pont-Tuset, Jordi; Tanzer, Garrett; Wang, Su; Baldridge, Jason

Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.19753 (cs)

[Submitted on 30 Apr 2024]

Title:DOCCI: Descriptions of Connected and Contrasting Images

Authors:Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, Su Wang, Jason Baldridge

View PDF

Abstract:Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. We instruct human annotators to create comprehensive descriptions for each image; these average 136 words in length and are crafted to clearly distinguish each image from those that are related or similar. Each description is highly compositional and typically encompasses multiple challenges. Through both quantitative and qualitative analyses, we demonstrate that DOCCI serves as an effective training resource for image-to-text generation -- a PaLI 5B model finetuned on DOCCI shows equal or superior results compared to highly-performant larger models like LLaVA-1.5 7B and InstructBLIP 7B. Furthermore, we show that DOCCI is a useful testbed for text-to-image generation, highlighting the limitations of current text-to-image models in capturing long descriptions and fine details.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2404.19753 [cs.CV]
	(or arXiv:2404.19753v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.19753

Submission history

From: Yasumasa Onoe [view email]
[v1] Tue, 30 Apr 2024 17:56:24 UTC (15,944 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:DOCCI: Descriptions of Connected and Contrasting Images

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:DOCCI: Descriptions of Connected and Contrasting Images

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators