iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://doi.org/10.1007/s10115-024-02166-8
Caption matters: a new perspective for knowledge-based visual question answering | Knowledge and Information Systems Skip to main content

Advertisement

Log in

Caption matters: a new perspective for knowledge-based visual question answering

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Knowledge-based visual question answering (KB-VQA) requires to answer questions according to the given image with the assistance of external knowledge. Recently, researchers generally tend to design different multimodal networks to extract visual and text semantic features for KB-VQA. Despite the significant progress, ‘caption’ information, a textual form of image semantics, which can also provide visually non-obvious cues for the reasoning process, is often ignored. In this paper, we introduce a novel framework, the Knowledge Based Caption Enhanced Net (KBCEN), designed to integrate caption information into the KB-VQA process. Specifically, for better knowledge reasoning, we make utilization of caption information comprehensively from both explicit and implicit perspectives. For the former, we explicitly link caption entities to knowledge graph together with object tags and question entities. While for the latter, a pre-trained multimodal BERT with natural implicit knowledge is leveraged to co-represent caption tokens, object regions as well as question tokens. Moreover, we develop a mutual correlation module to discern intricate correlations between explicit and implicit representations, thereby facilitating knowledge integration and final prediction. We conduct extensive experiments on three publicly available datasets (i.e., OK-VQA v1.0, OK-VQA v1.1 and A-OKVQA). Both quantitative and qualitative results demonstrate the superiority and rationality of our proposed KBCEN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

No datasets were generated or analyzed during the current study.

References

  1. Marino K, Rastegari M, Farhadi A, Mottaghi R (2019) OK-VQA: a visual question answering benchmark requiring external knowledge. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3195–3204

  2. Kim J-H, Jun J, Zhang B-T (2018) Bilinear attention networks. In: Advances in neural information processing systems, vol 31

  3. Li G, Wang X, Zhu W (2020) Boosting visual question answering with context-aware knowledge aggregation. In: Proceedings of the 28th ACM international conference on multimedia, pp 1227–1235

  4. Ben-Younes H, Cadene R, Cord M, Thome N (2017) Mutan: Multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2612–2620

  5. Guo W, Zhang Y, Yang J, Yuan X (2021) Re-attention for visual question answering. IEEE Trans Image Process 30:6730–6743

    Article  Google Scholar 

  6. Wang P, Wu Q, Shen C, Dick A (2018) Hengel: FVQA: Fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427

    Article  Google Scholar 

  7. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations

  8. Wu J, Lu J, Sabharwal A, Mottaghi R (2022) Multi-modal answer validation for knowledge-based vqa. In: Proceedings of the AAAI conference on artificial intelligence, vol 36. pp 2712–2721

  9. Marino K, Chen X, Parikh D, Gupta A, Rohrbach M (2021) Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based VQA. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14111–14121

  10. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-CNN. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969

  11. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, vol 28

  12. Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in neural information processing systems, vol 32

  13. Tan H, Bansal M (2019) Lxmert: Learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 5100–5111

  14. Gardères F, Ziaeefard M, Abeloos B, Lecue F (2020) Conceptbert: Concept-aware representation for visual question answering. In: Findings of the association for computational linguistics: EMNLP 2020, pp 489–498

  15. Zhu Z, Yu J, Wang Y, Sun Y, Hu Y, Wu Q (2021) Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering. In: Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, pp 1097–1103

  16. Kenton JDM-WC, Toutanova LK (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp 4171–4186

  17. Li LH, Yatskar M, Yin D, Hsieh C-J, Chang K-W (2019) Visualbert: a simple and performant baseline for vision and language. arXiv:1908.03557

  18. Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE international conference on computer vision, pp 1–9

  19. Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: Advances in neural information processing systems vol 28

  20. Sharma H, Jalal AS (2022) Convolutional neural networks-based VQA model. In: Proceedings of international conference on frontiers in computing and systems: COMSYS 2021, Springer, pp 109–116

  21. Wang F, Liu Q, Chen E, Huang Z, Yin Y, Wang S, Su Y (2022) NeuralCD: a general framework for cognitive diagnosis. IEEE Trans Knowl Data Eng 35(8):8312–8327

    Article  Google Scholar 

  22. Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 1821–1830

  23. Ben-Younes H, Cadene R, Thome N, Cord M (2019) Block: bilinear superdiagonal fusion for visual question answering and visual relationship detection. In: Proceedings of the AAAI conference on artificial intelligence, vol 33. pp 8102–8109

  24. Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29

  25. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086

  26. Liang J, Jiang L, Cao L, Li L-J, Hauptmann AG (2018) Focal visual-text attention for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6135–6143

  27. Changpinyo S, Kukliansy D, Szpektor I, Chen X, Ding N, Soricut R (2022) All you may need for VQA are image captions. In: Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: human language technologies, pp 1947–1963

  28. Wang P, Wu Q, Shen C, Dick A, Hengel A (2017) Explicit knowledge-based reasoning for visual question answering. In: Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI-17, pp 1290–1296

  29. Schwenk D, Khandelwal A, Clark C, Marino K, Mottaghi R (2022) A-okvqa: A benchmark for visual question answering using world knowledge. In: Computer Vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII. Springer, pp 146–162

  30. Shah S, Mishra A, Yadati N, Talukdar PP (2019) Kvqa: Knowledge-aware visual question answering. In: Proceedings of the AAAI conference on artificial intelligence, vol 33. pp 8876–8884

  31. Gao F, Ping Q, Thattai G, Reganti A, Wu YN, Natarajan P (2022) Transform-retrieve-generate: Natural language-centric outside-knowledge visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5067–5077

  32. Formica A, Mele I, Taglino F (2024) A template-based approach for question answering over knowledge bases. Knowl Inf Syst 66(1):453–479

    Article  Google Scholar 

  33. Lin W, Byrne B (2022) Retrieval augmented visual question answering with outside knowledge. In: Proceedings of the 2022 conference on empirical methods in natural language processing, pp 11238–11254

  34. Shao Z, Yu Z, Wang M, Yu J (2023) Prompting large language models with answer heuristics for knowledge-based visual question answering. In: Computer vision and pattern recognition (CVPR)

  35. Lin Y, Xie Y, Chen D, Xu Y, Zhu C, Yuan L (2022) Revive: Regional visual representation matters in knowledge-based visual question answering. In: Advances in neural information processing systems

  36. Rathnayake H, Sumanapala J, Rukshani R, Ranathunga S (2022) Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification. Knowl Inf Syst 64(7):1937–1966

    Article  Google Scholar 

  37. Yang Z, Gan Z, Wang J, Hu X, Lu Y, Liu Z, Wang L (2022) An empirical study of GPT-3 for few-shot knowledge-based vqa. In: Proceedings of the AAAI conference on artificial intelligence, vol 36. pp 3081–3089

  38. Huang D, Wei Z, Yue A, Zhao X, Chen Z, Li R, Jiang K, Chang B, Zhang Q, Zhang S et al (2023) Dsqa-llm: Domain-specific intelligent question answering based on large language model. In: International conference on AI-generated content, Springer, pp 170–180

  39. Yu Z, Ouyang X, Shao Z, Wang M, Yu J (2023) Prophet: Prompting large language models with complementary answer heuristics for knowledge-based visual question answering. arXiv:2303.01903

  40. Hu Y, Hua H, Yang Z, Shi W, Smith NA, Luo J (2022) Promptcap: prompt-guided task-aware image captioning. arXiv:2211.09699

  41. Gui L, Wang B, Huang Q, Hauptmann A, Bisk Y, Gao J (2021) Kat: a knowledge augmented transformer for vision-and-language. arXiv:2112.08614

  42. Li S, Luo C, Zhu Y, Wu W (2023) Bold driver and static restart fused adaptive momentum for visual question answering. Knowl Inf Syst 65(2):921–943

    Article  Google Scholar 

  43. Muscetti M, Rinaldi AM, Russo C, Tommasino C (2022) Multimedia ontology population through semantic analysis and hierarchical deep features extraction techniques. Knowl Inf Syst 64(5):1283–1303

    Article  Google Scholar 

  44. Gao J, Al-Sabri R, Oloulade BM, Chen J, Lyu T, Wu Z (2023) Gm2nas: multitask multiview graph neural architecture search. Knowl Inf Syst 65(10):4021–4054

    Article  Google Scholar 

  45. Su Z, Gou G (2024) Knowledge enhancement and scene understanding for knowledge-based visual question answering. Knowl Inf Syst 66(3):2193–2208

    Article  Google Scholar 

  46. Ruan S, Zhang Y, Zhang K, Fan Y, Tang F, Liu Q, Chen E (2021) Dae-gan: dynamic aspect-aware gan for text-to-image synthesis. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13960–13969

  47. Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv:1512.02167

  48. Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 457–468

  49. Gao P, Jiang Z, You H, Lu P, Hoi SC, Wang X, Li H (2019) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6639–6648

  50. Hannan D, Jain A, Bansal M (2020) Manymodalqa: Modality disambiguation and qa over diverse inputs. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 7879–7886

  51. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inform Process Syst 30

  52. Singh LG, Singh SR (2024) Sentiment analysis of tweets using text and graph multi-views learning. Knowl Inform Syst. https://doi.org/10.1007/s10115-023-02053-8

    Article  Google Scholar 

  53. Ding Y, Yu J, Liu B, Hu Y, Cui M, Wu Q (2022) Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5089–5098

  54. Salaberria A, Azkune G, Lacalle OL, Soroa A, Agirre E (2023) Image captioning for effective use of language models in knowledge-based visual question answering. Expert Syst Appl 212:118669

    Article  Google Scholar 

  55. Jiang L, Meng Z (2023) Knowledge-based visual question answering using multi-modal semantic graph. Electronics 12(6):1390

    Article  Google Scholar 

  56. Schelling B, Plant C (2020) Dataset-transformation: improving clustering by enhancing the structure with dipscaling and diptransformation. Knowl Inf Syst 62(2):457–484

    Article  Google Scholar 

  57. Wang M, Zhou X, Chen Y (2024) JMFEEL-NET: a joint multi-scale feature enhancement and lightweight transformer network for crowd counting. Knowl Inform Syst. https://doi.org/10.1007/s10115-023-02056-5

    Article  Google Scholar 

  58. Li G, Duan N, Fang Y, Gong M, Jiang D (2020) Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 11336–11344

  59. Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2019) Vl-bert: Pre-training of generic visual-linguistic representations. In: International conference on learning representations

  60. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763

  61. Li J, Li D, Xiong C, Hoi S (2022) Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International conference on machine learning, PMLR, pp 12888–12900

  62. Girdhar R, El-Nouby A, Liu Z, Singh M, Alwala KV, Joulin A, Misra I (2023) Imagebind: one embedding space to bind them all. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15180–15190

  63. Zhang P, Li X, Hu X, Yang J, Zhang L, Wang L, Choi Y, Gao J (2021) Vinvl: revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5579–5588

  64. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73

    Article  MathSciNet  Google Scholar 

  65. Vinyals O, Toshev A, Bengio S, Erhan D (2016) Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663

    Article  Google Scholar 

  66. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, PMLR, pp 2048–2057

  67. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K et al (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144

  68. Liu H, Singh P (2004) Conceptnet-a practical commonsense reasoning tool-kit. BT Technol J 22(4):211–226

    Article  Google Scholar 

  69. Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z (2007) Dbpedia: a nucleus for a web of open data. In: International semantic web conference, Springer, pp 722–735

  70. Bhakthavatsalam S, Richardson K, Tandon N, Clark P (2020) Do dogs have whiskers? a new knowledge base of haspart relations. arXiv:2006.07510

  71. Schlichtkrull M, Kipf TN, Bloem P, Berg Rvd, Titov I, Welling M (2018) Modeling relational data with graph convolutional networks. In: European semantic web conference. Springer, pp 593–607

  72. Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. In: International conference on learning representations

  73. Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In: Proceedings of the ieee conference on computer vision and pattern recognition, pp 6904–6913

  74. Ruan S, Zhang K, Wu L, Xu T, Liu Q, Chen E (2021) Color enhanced cross correlation net for image sentiment analysis. IEEE Trans Multim. https://doi.org/10.1109/TMM.2021.3118208

    Article  Google Scholar 

  75. Sun R, Tao H, Chen Y, Liu Q (2024) HACAN: a hierarchical answer-aware and context-aware network for question generation. Front Comput Sci 18(5):185321

  76. Guo D, Xu C, Tao D (2023) Bilinear graph networks for visual question answering. IEEE Trans Neural Netw Learn Syst 34(2):1023–1034. https://doi.org/10.1109/TNNLS.2021.3104937

    Article  Google Scholar 

  77. Mishra A, Anand A, Guha P (2023) Dual attention and question categorization-based visual question answering. IEEE Trans Artif Intell 4(1):81–91. https://doi.org/10.1109/TAI.2022.3160418

    Article  Google Scholar 

  78. Song L, Li J, Liu J, Yang Y, Shang X, Sun M (2023) Answering knowledge-based visual questions via the exploration of question purpose. Pattern Recogn 133:109015

    Article  Google Scholar 

Download references

Acknowledgements

This research was partially supported by grants from the National Key Research and Development Program of China (Grant No. 2021YFF0901000), the National Natural Science Foundation of China (No. 62337001, No. 62376086), Joint Funds of the National Natural Science Foundation of China (Grant No. U22A2094) and the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

Authors

Contributions

Bin Feng was involved in conceptualization, investigation, methodology, formal analysis, writing—original draft. Shulan Ruan helped in software, validation, writing—review and editing. Likang Wu contributed to data curation, investigation. Huijie Liu assisted in visualization, formal analysis. Kai Zhang was involved in investigation, supervision. Kun Zhang helped in writing—review and editing. Qi Liu assisted in writing—review and editing, resources, supervision. Enhong Chen performed project administration, supervision.

Corresponding author

Correspondence to Bin Feng.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Feng, B., Ruan, S., Wu, L. et al. Caption matters: a new perspective for knowledge-based visual question answering. Knowl Inf Syst 66, 6975–7003 (2024). https://doi.org/10.1007/s10115-024-02166-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-024-02166-8

Keywords

Navigation