Abstract
An image caption is a sentence summarizing the semantic details of an image. It is a blended application of computer vision and natural language processing. The earlier research addressed this domain using machine learning approaches by modeling image captioning frameworks using hand-engineered feature extraction techniques. With the resurgence of deep-learning approaches, the development of improved and efficient image captioning frameworks is on the rise. Image captioning is witnessing tremendous growth in various domains as medical, remote sensing, security, visual assistance, and multimodal search engines. In this survey, we comprehensively study the image captioning frameworks based on our proposed domain-specific taxonomy. We explore the benchmark datasets and metrics leveraged for training and evaluating image captioning models in various application domains. In addition, we also perform a comparative analysis of the reviewed models. Natural image captioning, medical image captioning, and remote sensing image captioning are currently among the most prominent application domains of image captioning. The efficacy of real-time image captioning is a challenging obstacle limiting its implementation in sensitive areas such as visual aid, remote security, and healthcare. Further challenges include the scarcity of rich domain-specific datasets, training complexity, evaluation difficulty, and a deficiency of cross-domain knowledge transfer techniques. Despite the significant contributions made, there is a need for additional efforts to develop steadfast and influential image captioning models.
Similar content being viewed by others
Data availability
This research is based on a comprehensive survey of existing literature and methodologies in the field of image captioning; there are no original datasets associated with this study. The findings and conclusions presented in this manuscript are derived from a thorough analysis of publicly available research papers, articles, and related sources.
References
Alam M, Samad MD, Vidyaratne L, Glandon A, Iftekharuddin KM (2020) Survey on deep neural networks in speech and vision systems. Neurocomputing 417:302–321
Amirian S, Rasheed K, Taha TR, Arabnia HR (2020) Automatic image and video caption generation with deep learning: a concise review and algorithmic overlap. IEEE Access 8:218386–218400
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision—ECCV 2016. Springer International Publishing, Cham, pp 382–398
Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Beddiar DR, Oussalah M, Seppänen T (2022) Automatic captioning for medical imaging (MIC): a rapid review of literature. Artif Intell Rev 56(5):4019–4076
Berg TL, Berg AC, Shih J (2010) Automatic attribute discovery and characterization from noisy web data. In: Computer vision—ECCV 2010: 11th European conference on computer vision, Heraklion, Crete, Greece, September 5–11, 2010, Proceedings, Part I 11. Springer, pp 663–676
Bernardi R, Cakici R, Elliott D, Erdem A, Erdem E, Ikizler-Cinbis N, Keller F, Muscat A, Plank B (2016) Automatic description generation from images: a survey of models, datasets, and evaluation measures. J Artif Intell Res 55:409–442
Bin Y, Ding Y, Peng B, Peng L, Yang Y, Chua T-S (2021) Entity slot filling for visual captioning. IEEE Trans Circuits Syst Video Technol 32(1):52–62
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S(2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5659–5667
Chen Z, Hu R, Chen X, Nießner M, Chang AX (2023) UniT3D: a unified transformer for 3d dense captioning and visual grounding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 18109–18119
Chen Z, Wang J, Ma A, Zhong Y (2022) TypeFormer: multiscale transformer with type controller for remote sensing image caption. IEEE Geosci Remote Sens Lett 19:1–5
Cheng Q, Huang H, Yuan X, Zhou Y, Li H, Wang Z (2022) NWPU-captions dataset and MLCA-net for remote sensing image captioning. IEEE Trans Geosci Remote Sens 60:1–19
Cheng Q, Zhou Y, Peng F, Yuan X, Zhang L (2021) A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing. IEEE J Sel Top Appl Earth Obs Remote Sens 14:4284–4297
Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional GAN. In: Proceedings of the IEEE international conference on computer vision, pp 2970–2979
Das B, Pal R, Majumder M, Phadikar S, Sekh AA (2023) A visual attention-based model for Bengali image captioning. SN Comput Sci 4(2):208
Das R, Doren Singh T (2022) Assamese news image caption generation using attention mechanism. Multimed Tools Appl 81(7):10051–10069
Demner-Fushman D, Kohli MD, Rosenman MB, Shooshan SE, Rodriguez L, Antani S, Thoma GR, McDonald CJ (2016) Preparing a collection of radiology examinations for distribution and retrieval. J Am Med Inform Assoc 23(2):304–310
Dittakan K, Prompitak K, Thungklang P, Wongwattanakit C (2023) Image caption generation using transformer learning methods: a case study on instagram image. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-17275-9
Dognin P, Melnyk I, Mroueh Y, Padhi I, Rigotti M, Ross J, Schiff Y, Young RA, Belgodere B (2022) Image captioning as an assistive technology: lessons learned from VizWiz 2020 challenge. J Artif Intell Res 73:437–459
Effendi J, Sakti S, Nakamura S (2021) End-to-end image-to-speech generation for untranscribed unknown languages. IEEE Access 9:55144–55154
Elbedwehy S, Medhat T (2023) Improved Arabic image captioning model using feature concatenation with pre-trained word embedding. Neural Comput Appl 35(26):19051–19067
Elliott D, de Vries A (2015) Describing images using inferred visual dependency representations. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: long papers), pp 42–52
Elliott D, Keller F (2013) Image description using visual dependency representations. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1292–1302
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: Computer vision—ECCV 2010: 11th European conference on computer vision, Heraklion, Crete, Greece, September 5–11, 2010, Proceedings, Part IV 11. Springer, pp 15–29
Feng Y, Lapata M (2012) Automatic caption generation for news images. IEEE Trans Pattern Anal Mach Intell 35(4):797–812
Gajbhiye GO, Nandedkar AV (2022) Generating the captions for remote sensing images: a spatial-channel attention based memory-guided transformer approach. Eng Appl Artif Intell 114:105076
Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: European conference on computer vision. Springer, pp 529–545
Grubinger M, Clough PD, Müller H, Deselaers T (2006) The IAPR TC-12 benchmark: a new evaluation resource for visual information systems
Herdade S, Kappeler A, Boakye K, Soares J (2019) Image Captioning: Transforming Objects into Words. Neural Information Processing Systems 11135–11145
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899
Hossain MZ, Sohel F, Shiratuddin MF, Laga H (2019) A comprehensive survey of deep learning for image captioning. ACM Comput Surv (CsUR) 51(6):1–36
Hou D, Zhao Z, Liu Y, Chang F, Sanyuan H (2021) Automatic report generation for chest x-ray images via adversarial reinforcement learning. IEEE Access 9:21236–21250
Hoxha G, Chouaf S, Melgani F, Smara Y (2022) Change captioning: a new paradigm for multitemporal remote sensing image analysis. IEEE Trans Geosci Remote Sens 60:1–14
Hoxha G, Melgani F, Demir B (2020) Toward remote sensing image retrieval under a deep image captioning perspective. IEEE J Sel Top Appl Earth Obs Remote Sens 13:4462–4475
Huang W, Wang Q, Li X (2020) Denoising-based multiscale feature fusion for remote sensing image captioning. IEEE Geosci Remote Sens Lett 18(3):436–440
Huang Y, Chen J, Ouyang W, Wan W, Xue Y (2020) Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Trans Image Process 29:4013–4026
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision, pp 2407–2415
Jiang W, Zhu M, Fang Y, Shi G, Zhao X, Liu Y (2022) Visual cluster grounding for image captioning. IEEE Trans Image Process 31:3920–3934
Jing B, Xie P, Xing E (2018) On the automatic generation of medical imaging reports. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), Melbourne, Australia. Association for Computational Linguistics, pp 2577–2586
Johnson AEW, Pollard TJ, Greenbaum NR, Lungren MP, Deng C-Y, Peng Y, Lu Z, Mark RG, Berkowitz SJ, Horng S (2019) MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. Number: arXiv:1901.07042 [cs, eess]
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Karpathy A, Joulin A, Fei-Fei LF (2014) Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. Neural Information Processing Systems 1889–1897
Kastner MA, Umemura K, Ide I, Kawanishi Y, Hirayama T, Doman K, Deguchi D, Murase H, Satoh S (2021) Imageability-and length-controllable image captioning. IEEE Access 9:162951–162961
Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: International conference on machine learning. PMLR, pp 595–603
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Kumar A, Goel S (2017) A survey of evolution of image captioning techniques. International Journal of Hybrid Intelligent Systems 14(3):123–139
Kumar SC, Hemalatha M, Narayan SB, Nandhini P (2019) Region driven remote sensing image captioning. Procedia Comput Sci 165:32–40
Li W, Qu Z, Song H, Wang P, Xue B (2020) The traffic scene understanding and prediction based on image captioning. IEEE Access 9:1420–1427
Li X, Zhang X, Huang W, Wang Q (2020) Truncation cross entropy loss for remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(6):5246–5257
Lin C-Y (2004) ROUGE: a package for automatic evaluation of summaries. In: Text summarization branches out. Association for Computational Linguistics, Barcelona, Spain, pp 74–81
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. Springer, pp 740–755
Liu C, Zhao R, Chen H, Zou Z, Shi Z (2022) Remote sensing image change captioning with dual-branch transformers: a new method and a large scale dataset. IEEE Trans Geosci Remote Sens 60:1–20
Liu M, Hu H, Li L, Yu Y, Guan W (2020) Chinese image caption generation via visual attention and topic modeling. IEEE Trans Cybern 52(2):1247–1257
Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of spider. In: Proceedings of the IEEE international conference on computer vision, pp 873–881
Liu X, Xu Q, Wang N (2019) A survey on deep neural network-based image captioning. Vis Comput 35(3):445–470
Lu X, Wang B, Zheng X (2019) Sound active attention framework for remote sensing image captioning. IEEE Trans Geosci Remote Sens 58(3):1985–2000
Lu X, Wang B, Zheng X, Li X (2018) Exploring models and data for remote sensing image caption generation. IEEE Trans Geosci Remote Sens 56(4):2183–2195
Ma X, Zhao R, Shi Z (2020) Multiscale methods for optical remote-sensing image captioning. IEEE Geosci Remote Sens Lett 18(11):2001–2005
Makav B, Kılıç V (2019) A new image captioning approach for visually impaired people. In: 2019 11th international conference on electrical and electronics engineering (ELECO). IEEE, pp 945–949
Malla S, Choi C, Dwivedi I, Choi JH, Li J (2023) Drama: joint risk localization and captioning in driving. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1043–1052
Mao J, Huang J, Toshev A, Camburu O, Yuille AL, Murphy K (2016) Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 11–20
Min K, Dang M, Moon H (2021) Deep learning-based short story generation for an image using the encoder-decoder structure. IEEE Access 9:113550–113557
Mishra SK, Dhir R, Saha S, Bhattacharyya P (2021) A Hindi image caption generation framework using deep learning. Trans Asian Low Resour Lang Inf Process 20(2):1–19
Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg A, Berg T, Daumé III H (2012) Midge: Generating Image Descriptions From Computer Vision Detections. Conference of the European Chapter of the Association for Computational Linguistics 747–756
Mokady R, Hertz A, Bermano AH (2021) Clipcap: clip prefix for image captioning. arXiv preprint arXiv:2111.09734
Ordonez V, Kulkarni G, Berg T (2011) Im2Text: Describing Images Using 1 Million Captioned Photographs. Neural Information Processing Systems 1143–1151
Papineni K, Roukos S, Ward T, Zhu WJ (2001) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics—ACL ’02, Philadelphia, Pennsylvania. Association for Computational Linguistics, p 311
Park CC, Kim B, Kim G (2017) Attend to you: personalized image captioning with context sequence memory networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 6432–6440
Park H, Kim K, Park S, Choi J (2021) Medical image captioning model to convey more details: methodological comparison of feature difference generation. IEEE Access 9:150560–150568
Patterson G, Xu C, Su H, Hays J (2014) The sun attribute database: beyond categories for deeper scene understanding. Int J Comput Vis 108:59–81
Qu B, Li X, Tao D, Lu X (2016) Deep semantic understanding of high resolution remote sensing image. In: 2016 international conference on computer, information and telecommunication systems (CITS), Kunming, China. IEEE, pp 1–5
Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using Amazon’s Mechanical Turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, pp 139–147
Selivanov A, Rogov OY, Chesakov D, Shelmanov A, Fedulova I, Dylov DV (2023) Medical image captioning via generative pretrained transformers. Sci Rep 13(1):4171
Shao Z, Zhou W, Deng X, Zhang M, Cheng Q (2020) Multilabel remote sensing image retrieval based on fully convolutional network. IEEE J Sel Top Appl Earth Obs Remote Sens 13:318–328
Sharma H, Padha D (2023) A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues. Artif Intell Rev 56(11):13619–13661
Sharma H, Padha D (2023) From templates to transformers: a survey of multimodal image captioning decoders. In: 2023 international conference on computer, electronics & electrical engineering & their applications (IC2E3). IEEE, pp 1–6
Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), pp 2556–2565
Shen X, Liu B, Zhou Y, Zhao J (2020) Remote sensing image caption generation via transformer and reinforcement learning. Multimed Tools Appl 79:26661–26682
Shuster K, Humeau S, Hu H, Bordes A, Weston J (2019) Engaging image captioning via personality. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12516–12526
Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist 2:207–218
Srihari K, Sikha OK (2022) Partially supervised image captioning model for urban road views. In: Intelligent data communication technologies and internet of things: proceedings of ICICI 2021. Springer, pp 59–73
Sugano Y, Bulling A (2016) Seeing with humans: gaze-assisted neural image captioning. arXiv preprint arXiv:1608.05203
Sumbul G, Nayak S, Demir B (2020) SD-RSIC: summarization-driven deep remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(8):6922–6934
Ushiku Y, Yamaguchi M, Mukuta Y, Harada T (2015) Common subspace for model and similarity: Phrase learning for caption generation from images. In: Proceedings of the IEEE international conference on computer vision, pp 2668–2676
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4566–4575
Verma Y, Jawahar CV (2014) Im2Text and Text2Im: associating images and texts for cross-modal retrieval. In: BMVC, vol 1, p 2
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wang B, Zheng X, Qu B, Lu X (2020) Retrieval topic recurrent memory network for remote sensing image captioning. IEEE J Sel Top Appl Earth Obs Remote Sens 13:256–270
Wang C, Yang H, Bartz C, Meinel Ch (2016) Image captioning with deep bidirectional LSTMs. In: Proceedings of the 24th ACM international conference on Multimedia, pp 988–997
Wang D, Beck D, Cohn T (2019) On the role of scene graphs in image captioning. In: Proceedings of the beyond vision and language: integrating real-world knowledge (LANTERN), pp 29–34
Wang Q, Huang W, Zhang X, Li X (2020) Word-sentence framework for remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(12):10532–10543
Wang Q, Chan AB (2018) CNN+CNN: convolutional decoders for image captioning. ArXiv:1805.09019
Wang S, Ye X, Gu Y, Wang J, Meng Y, Tian J, Hou B, Jiao L (2022) Multi-label semantic feature fusion for remote sensing image captioning. ISPRS J Photogramm Remote Sens 184:1–18
Wu Q, Shen C, Wang P, Dick A, Van Den Hengel A (2017) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. PMLR, pp 2048–2057
Yang M, Liu J, Shen Y, Zhao Z, Chen X, Wu Q, Li C (2020) An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. IEEE Trans Image Process 29:9627–9640
Yang M, Zhao W, Xu W, Feng Y, Zhao Z, Chen X, Lei K (2018) Multitask learning for cross-domain image captioning. IEEE Trans Multimed 21(4):1047–1061
Yang Q, Ni Z, Ren P (2022) Meta captioning: a meta learning based remote sensing image captioning framework. ISPRS J Photogramm Remote Sens 186:190–200
Yang Y, Teo C, Daumé III H, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the 2011 conference on empirical methods in natural language processing, pp 444–454
Yatskar M, Galley M, Vanderwende L, Zettlemoyer L (2014) See no evil, say no evil: description generation from densely labeled images. In: Proceedings of the third joint conference on lexical and computational semantics (* SEM 2014), pp 110–120
Ye X, Wang S, Gu Y, Wang J, Wang R, Hou B, Giunchiglia F, Jiao L (2022) A joint-training two-stage method for remote sensing image captioning. IEEE Trans Geosci Remote Sens 60:1–16
Yuan Z, Li X, Wang Q (2019) Exploring multi-level attention and semantic relationship for remote sensing image captioning. IEEE Access 8:2608–2620
Yumeng Z, Jing Y, Shuo G, Limin L (2021) News image-text matching with news knowledge graph. IEEE Access 9:108017–108027
Zeng X, Wen L, Xu Y, Ji C (2020) Generating diagnostic report for medical image by high-middle-level visual information incorporation on double deep learning models. Comput Methods Progr Biomed 197:105700
Zhang Z, Zhang W, Diao W, Yan M, Gao X, Sun X (2019) VAA: visual aligning attention model for remote sensing image captioning. IEEE Access 7:137355–137364
Zhang Z, Xie Y, Xing F, McGough M, Yang L (2017) MDNet: a semantically and visually interpretable medical image diagnosis network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6428–6436
Zhao B (2021) A systematic survey of remote sensing image captioning. IEEE Access 9:154086–154111
Zhao W, Wu X, Luo J (2020) Cross-domain image captioning via cross-modal retrieval and model adaptation. IEEE Trans Image Process 30:1180–1192
Zhou J, Zhu Y, Zhang Y, Yang C, Pan H (2023) Spatial-aware topic-driven-based image Chinese caption for disaster news. Neural Comput Appl 35(13):9481–9500
Zhou L, Zhang Y, Jiang YG, Zhang T, Fan W (2019) Re-caption: saliency-enhanced image captioning through two-phase learning. IEEE Trans Image Process 29:694–709
Zohourianshahzadi Z, Kalita JK (2011) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55(5):3833–3862
Acknowledgements
The authors extend sincere gratitude to the Editor and Reviewers for their insightful remarks and helpful opinions, which contributed to the enhancement of the work.
Author information
Authors and Affiliations
Contributions
H. Sharma conceptualized the survey, collected and analyzed the data, and drafted the manuscript. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
Thank you for your query regarding the Conflict of Interest and Informed Consent statements in our manuscript. We have reviewed the manuscript, and the Conflict of Interest and Informed Consent statements are correctly identified in accordance with the journal’s guidelines.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sharma, H., Padha, D. Domain-specific image captioning: a comprehensive review. Int J Multimed Info Retr 13, 20 (2024). https://doi.org/10.1007/s13735-024-00328-6
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13735-024-00328-6