Unveiling the Tapestry of Consistency in Large Vision-Language Models

Zhang, Yuan; Xiao, Fei; Huang, Tao; Fan, Chun-Kai; Dong, Hongyuan; Li, Jiawen; Wang, Jiacong; Cheng, Kuan; Zhang, Shanghang; Guo, Haoyuan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2405.14156 (cs)

[Submitted on 23 May 2024 (v1), last revised 6 Oct 2024 (this version, v4)]

Title:Unveiling the Tapestry of Consistency in Large Vision-Language Models

Authors:Yuan Zhang, Fei Xiao, Tao Huang, Chun-Kai Fan, Hongyuan Dong, Jiawen Li, Jiacong Wang, Kuan Cheng, Shanghang Zhang, Haoyuan Guo

View PDF HTML (experimental)

Abstract:Large vision-language models (LVLMs) have recently achieved rapid progress, exhibiting great perception and reasoning abilities concerning visual information. However, when faced with prompts in different sizes of solution spaces, LVLMs fail to always give consistent answers regarding the same knowledge point. This inconsistency of answers between different solution spaces is prevalent in LVLMs and erodes trust. To this end, we provide a multi-modal benchmark ConBench, to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point. Based on the ConBench tool, we are the first to reveal the tapestry and get the following findings: (1) In the discriminate realm, the larger the solution space of the prompt, the lower the accuracy of the answers. (2) Establish the relationship between the discriminative and generative realms: the accuracy of the discriminative question type exhibits a strong positive correlation with its Consistency with the caption. (3) Compared to open-source models, closed-source models exhibit a pronounced bias advantage in terms of Consistency. Eventually, we ameliorate the consistency of LVLMs by trigger-based diagnostic refinement, indirectly improving the performance of their caption. We hope this paper will accelerate the research community in better evaluating their models and encourage future advancements in the consistency domain. The project is available at this https URL.

Comments:	Accepted by NeurIPS 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2405.14156 [cs.CV]
	(or arXiv:2405.14156v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2405.14156

Submission history

From: Yuan Zhang [view email]
[v1] Thu, 23 May 2024 04:08:23 UTC (10,076 KB)
[v2] Thu, 6 Jun 2024 03:58:29 UTC (10,076 KB)
[v3] Fri, 7 Jun 2024 12:21:57 UTC (10,076 KB)
[v4] Sun, 6 Oct 2024 09:51:25 UTC (10,076 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Unveiling the Tapestry of Consistency in Large Vision-Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Unveiling the Tapestry of Consistency in Large Vision-Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators