Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

Zhang, Yichi; Dong, Yinpeng; Zhang, Siyuan; Min, Tianzan; Su, Hang; Zhu, Jun

Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.11207 (cs)

[Submitted on 17 Apr 2024]

Title:Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

Authors:Yichi Zhang, Yinpeng Dong, Siyuan Zhang, Tianzan Min, Hang Su, Jun Zhu

View PDF HTML (experimental)

Abstract:Although Multimodal Large Language Models (MLLMs) have demonstrated promising versatile capabilities, their performance is still inferior to specialized models on downstream tasks, which makes adaptation necessary to enhance their utility. However, fine-tuning methods require independent training for every model, leading to huge computation and memory overheads. In this paper, we propose a novel setting where we aim to improve the performance of diverse MLLMs with a group of shared parameters optimized for a downstream task. To achieve this, we propose Transferable Visual Prompting (TVP), a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model. We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts, including 1) Feature Consistency Alignment: which imposes constraints to the prompted feature changes to maintain task-agnostic knowledge; 2) Task Semantics Enrichment: which encourages the prompted images to contain richer task-specific semantics with language guidance. We validate the effectiveness of TVP through extensive experiments with 6 modern MLLMs on a wide variety of tasks ranging from object recognition and counting to multimodal reasoning and hallucination correction.

Comments:	Accepted in CVPR 2024 as Poster (Highlight)
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2404.11207 [cs.CV]
	(or arXiv:2404.11207v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.11207

Submission history

From: Yichi Zhang [view email]
[v1] Wed, 17 Apr 2024 09:39:07 UTC (11,154 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators