Abstract
Knowledge-based visual question answering (KB-VQA) requires to answer questions according to the given image with the assistance of external knowledge. Recently, researchers generally tend to design different multimodal networks to extract visual and text semantic features for KB-VQA. Despite the significant progress, ‘caption’ information, a textual form of image semantics, which can also provide visually non-obvious cues for the reasoning process, is often ignored. In this paper, we introduce a novel framework, the Knowledge Based Caption Enhanced Net (KBCEN), designed to integrate caption information into the KB-VQA process. Specifically, for better knowledge reasoning, we make utilization of caption information comprehensively from both explicit and implicit perspectives. For the former, we explicitly link caption entities to knowledge graph together with object tags and question entities. While for the latter, a pre-trained multimodal BERT with natural implicit knowledge is leveraged to co-represent caption tokens, object regions as well as question tokens. Moreover, we develop a mutual correlation module to discern intricate correlations between explicit and implicit representations, thereby facilitating knowledge integration and final prediction. We conduct extensive experiments on three publicly available datasets (i.e., OK-VQA v1.0, OK-VQA v1.1 and A-OKVQA). Both quantitative and qualitative results demonstrate the superiority and rationality of our proposed KBCEN.
Similar content being viewed by others
Data availability
No datasets were generated or analyzed during the current study.
References
Marino K, Rastegari M, Farhadi A, Mottaghi R (2019) OK-VQA: a visual question answering benchmark requiring external knowledge. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3195–3204
Kim J-H, Jun J, Zhang B-T (2018) Bilinear attention networks. In: Advances in neural information processing systems, vol 31
Li G, Wang X, Zhu W (2020) Boosting visual question answering with context-aware knowledge aggregation. In: Proceedings of the 28th ACM international conference on multimedia, pp 1227–1235
Ben-Younes H, Cadene R, Cord M, Thome N (2017) Mutan: Multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2612–2620
Guo W, Zhang Y, Yang J, Yuan X (2021) Re-attention for visual question answering. IEEE Trans Image Process 30:6730–6743
Wang P, Wu Q, Shen C, Dick A (2018) Hengel: FVQA: Fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations
Wu J, Lu J, Sabharwal A, Mottaghi R (2022) Multi-modal answer validation for knowledge-based vqa. In: Proceedings of the AAAI conference on artificial intelligence, vol 36. pp 2712–2721
Marino K, Chen X, Parikh D, Gupta A, Rohrbach M (2021) Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based VQA. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14111–14121
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-CNN. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, vol 28
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in neural information processing systems, vol 32
Tan H, Bansal M (2019) Lxmert: Learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 5100–5111
Gardères F, Ziaeefard M, Abeloos B, Lecue F (2020) Conceptbert: Concept-aware representation for visual question answering. In: Findings of the association for computational linguistics: EMNLP 2020, pp 489–498
Zhu Z, Yu J, Wang Y, Sun Y, Hu Y, Wu Q (2021) Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering. In: Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, pp 1097–1103
Kenton JDM-WC, Toutanova LK (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp 4171–4186
Li LH, Yatskar M, Yin D, Hsieh C-J, Chang K-W (2019) Visualbert: a simple and performant baseline for vision and language. arXiv:1908.03557
Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE international conference on computer vision, pp 1–9
Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: Advances in neural information processing systems vol 28
Sharma H, Jalal AS (2022) Convolutional neural networks-based VQA model. In: Proceedings of international conference on frontiers in computing and systems: COMSYS 2021, Springer, pp 109–116
Wang F, Liu Q, Chen E, Huang Z, Yin Y, Wang S, Su Y (2022) NeuralCD: a general framework for cognitive diagnosis. IEEE Trans Knowl Data Eng 35(8):8312–8327
Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 1821–1830
Ben-Younes H, Cadene R, Thome N, Cord M (2019) Block: bilinear superdiagonal fusion for visual question answering and visual relationship detection. In: Proceedings of the AAAI conference on artificial intelligence, vol 33. pp 8102–8109
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Liang J, Jiang L, Cao L, Li L-J, Hauptmann AG (2018) Focal visual-text attention for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6135–6143
Changpinyo S, Kukliansy D, Szpektor I, Chen X, Ding N, Soricut R (2022) All you may need for VQA are image captions. In: Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: human language technologies, pp 1947–1963
Wang P, Wu Q, Shen C, Dick A, Hengel A (2017) Explicit knowledge-based reasoning for visual question answering. In: Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI-17, pp 1290–1296
Schwenk D, Khandelwal A, Clark C, Marino K, Mottaghi R (2022) A-okvqa: A benchmark for visual question answering using world knowledge. In: Computer Vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII. Springer, pp 146–162
Shah S, Mishra A, Yadati N, Talukdar PP (2019) Kvqa: Knowledge-aware visual question answering. In: Proceedings of the AAAI conference on artificial intelligence, vol 33. pp 8876–8884
Gao F, Ping Q, Thattai G, Reganti A, Wu YN, Natarajan P (2022) Transform-retrieve-generate: Natural language-centric outside-knowledge visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5067–5077
Formica A, Mele I, Taglino F (2024) A template-based approach for question answering over knowledge bases. Knowl Inf Syst 66(1):453–479
Lin W, Byrne B (2022) Retrieval augmented visual question answering with outside knowledge. In: Proceedings of the 2022 conference on empirical methods in natural language processing, pp 11238–11254
Shao Z, Yu Z, Wang M, Yu J (2023) Prompting large language models with answer heuristics for knowledge-based visual question answering. In: Computer vision and pattern recognition (CVPR)
Lin Y, Xie Y, Chen D, Xu Y, Zhu C, Yuan L (2022) Revive: Regional visual representation matters in knowledge-based visual question answering. In: Advances in neural information processing systems
Rathnayake H, Sumanapala J, Rukshani R, Ranathunga S (2022) Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification. Knowl Inf Syst 64(7):1937–1966
Yang Z, Gan Z, Wang J, Hu X, Lu Y, Liu Z, Wang L (2022) An empirical study of GPT-3 for few-shot knowledge-based vqa. In: Proceedings of the AAAI conference on artificial intelligence, vol 36. pp 3081–3089
Huang D, Wei Z, Yue A, Zhao X, Chen Z, Li R, Jiang K, Chang B, Zhang Q, Zhang S et al (2023) Dsqa-llm: Domain-specific intelligent question answering based on large language model. In: International conference on AI-generated content, Springer, pp 170–180
Yu Z, Ouyang X, Shao Z, Wang M, Yu J (2023) Prophet: Prompting large language models with complementary answer heuristics for knowledge-based visual question answering. arXiv:2303.01903
Hu Y, Hua H, Yang Z, Shi W, Smith NA, Luo J (2022) Promptcap: prompt-guided task-aware image captioning. arXiv:2211.09699
Gui L, Wang B, Huang Q, Hauptmann A, Bisk Y, Gao J (2021) Kat: a knowledge augmented transformer for vision-and-language. arXiv:2112.08614
Li S, Luo C, Zhu Y, Wu W (2023) Bold driver and static restart fused adaptive momentum for visual question answering. Knowl Inf Syst 65(2):921–943
Muscetti M, Rinaldi AM, Russo C, Tommasino C (2022) Multimedia ontology population through semantic analysis and hierarchical deep features extraction techniques. Knowl Inf Syst 64(5):1283–1303
Gao J, Al-Sabri R, Oloulade BM, Chen J, Lyu T, Wu Z (2023) Gm2nas: multitask multiview graph neural architecture search. Knowl Inf Syst 65(10):4021–4054
Su Z, Gou G (2024) Knowledge enhancement and scene understanding for knowledge-based visual question answering. Knowl Inf Syst 66(3):2193–2208
Ruan S, Zhang Y, Zhang K, Fan Y, Tang F, Liu Q, Chen E (2021) Dae-gan: dynamic aspect-aware gan for text-to-image synthesis. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13960–13969
Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv:1512.02167
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 457–468
Gao P, Jiang Z, You H, Lu P, Hoi SC, Wang X, Li H (2019) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6639–6648
Hannan D, Jain A, Bansal M (2020) Manymodalqa: Modality disambiguation and qa over diverse inputs. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 7879–7886
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inform Process Syst 30
Singh LG, Singh SR (2024) Sentiment analysis of tweets using text and graph multi-views learning. Knowl Inform Syst. https://doi.org/10.1007/s10115-023-02053-8
Ding Y, Yu J, Liu B, Hu Y, Cui M, Wu Q (2022) Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5089–5098
Salaberria A, Azkune G, Lacalle OL, Soroa A, Agirre E (2023) Image captioning for effective use of language models in knowledge-based visual question answering. Expert Syst Appl 212:118669
Jiang L, Meng Z (2023) Knowledge-based visual question answering using multi-modal semantic graph. Electronics 12(6):1390
Schelling B, Plant C (2020) Dataset-transformation: improving clustering by enhancing the structure with dipscaling and diptransformation. Knowl Inf Syst 62(2):457–484
Wang M, Zhou X, Chen Y (2024) JMFEEL-NET: a joint multi-scale feature enhancement and lightweight transformer network for crowd counting. Knowl Inform Syst. https://doi.org/10.1007/s10115-023-02056-5
Li G, Duan N, Fang Y, Gong M, Jiang D (2020) Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 11336–11344
Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2019) Vl-bert: Pre-training of generic visual-linguistic representations. In: International conference on learning representations
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763
Li J, Li D, Xiong C, Hoi S (2022) Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International conference on machine learning, PMLR, pp 12888–12900
Girdhar R, El-Nouby A, Liu Z, Singh M, Alwala KV, Joulin A, Misra I (2023) Imagebind: one embedding space to bind them all. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15180–15190
Zhang P, Li X, Hu X, Yang J, Zhang L, Wang L, Choi Y, Gao J (2021) Vinvl: revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5579–5588
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73
Vinyals O, Toshev A, Bengio S, Erhan D (2016) Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, PMLR, pp 2048–2057
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K et al (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144
Liu H, Singh P (2004) Conceptnet-a practical commonsense reasoning tool-kit. BT Technol J 22(4):211–226
Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z (2007) Dbpedia: a nucleus for a web of open data. In: International semantic web conference, Springer, pp 722–735
Bhakthavatsalam S, Richardson K, Tandon N, Clark P (2020) Do dogs have whiskers? a new knowledge base of haspart relations. arXiv:2006.07510
Schlichtkrull M, Kipf TN, Bloem P, Berg Rvd, Titov I, Welling M (2018) Modeling relational data with graph convolutional networks. In: European semantic web conference. Springer, pp 593–607
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. In: International conference on learning representations
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In: Proceedings of the ieee conference on computer vision and pattern recognition, pp 6904–6913
Ruan S, Zhang K, Wu L, Xu T, Liu Q, Chen E (2021) Color enhanced cross correlation net for image sentiment analysis. IEEE Trans Multim. https://doi.org/10.1109/TMM.2021.3118208
Sun R, Tao H, Chen Y, Liu Q (2024) HACAN: a hierarchical answer-aware and context-aware network for question generation. Front Comput Sci 18(5):185321
Guo D, Xu C, Tao D (2023) Bilinear graph networks for visual question answering. IEEE Trans Neural Netw Learn Syst 34(2):1023–1034. https://doi.org/10.1109/TNNLS.2021.3104937
Mishra A, Anand A, Guha P (2023) Dual attention and question categorization-based visual question answering. IEEE Trans Artif Intell 4(1):81–91. https://doi.org/10.1109/TAI.2022.3160418
Song L, Li J, Liu J, Yang Y, Shang X, Sun M (2023) Answering knowledge-based visual questions via the exploration of question purpose. Pattern Recogn 133:109015
Acknowledgements
This research was partially supported by grants from the National Key Research and Development Program of China (Grant No. 2021YFF0901000), the National Natural Science Foundation of China (No. 62337001, No. 62376086), Joint Funds of the National Natural Science Foundation of China (Grant No. U22A2094) and the Fundamental Research Funds for the Central Universities.
Author information
Authors and Affiliations
Contributions
Bin Feng was involved in conceptualization, investigation, methodology, formal analysis, writing—original draft. Shulan Ruan helped in software, validation, writing—review and editing. Likang Wu contributed to data curation, investigation. Huijie Liu assisted in visualization, formal analysis. Kai Zhang was involved in investigation, supervision. Kun Zhang helped in writing—review and editing. Qi Liu assisted in writing—review and editing, resources, supervision. Enhong Chen performed project administration, supervision.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Feng, B., Ruan, S., Wu, L. et al. Caption matters: a new perspective for knowledge-based visual question answering. Knowl Inf Syst 66, 6975–7003 (2024). https://doi.org/10.1007/s10115-024-02166-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-024-02166-8