Domain-specific image captioning: a comprehensive review

Sharma, Himanshu; Padha, Devanand

doi:10.1007/s13735-024-00328-6

Domain-specific image captioning: a comprehensive review

Trends and Surveys
Published: 18 April 2024

Volume 13, article number 20, (2024)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Himanshu Sharma¹ &
Devanand Padha¹

565 Accesses
Explore all metrics

Abstract

An image caption is a sentence summarizing the semantic details of an image. It is a blended application of computer vision and natural language processing. The earlier research addressed this domain using machine learning approaches by modeling image captioning frameworks using hand-engineered feature extraction techniques. With the resurgence of deep-learning approaches, the development of improved and efficient image captioning frameworks is on the rise. Image captioning is witnessing tremendous growth in various domains as medical, remote sensing, security, visual assistance, and multimodal search engines. In this survey, we comprehensively study the image captioning frameworks based on our proposed domain-specific taxonomy. We explore the benchmark datasets and metrics leveraged for training and evaluating image captioning models in various application domains. In addition, we also perform a comparative analysis of the reviewed models. Natural image captioning, medical image captioning, and remote sensing image captioning are currently among the most prominent application domains of image captioning. The efficacy of real-time image captioning is a challenging obstacle limiting its implementation in sensitive areas such as visual aid, remote security, and healthcare. Further challenges include the scarcity of rich domain-specific datasets, training complexity, evaluation difficulty, and a deficiency of cross-domain knowledge transfer techniques. Despite the significant contributions made, there is a need for additional efforts to develop steadfast and influential image captioning models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues

Article 17 April 2023

A Survey on Automatic Image Captioning Approaches: Contemporary Trends and Future Perspectives

Article 16 October 2024

A comprehensive literature review on image captioning methods and metrics based on deep learning technique

Article 20 February 2024

Data availability

This research is based on a comprehensive survey of existing literature and methodologies in the field of image captioning; there are no original datasets associated with this study. The findings and conclusions presented in this manuscript are derived from a thorough analysis of publicly available research papers, articles, and related sources.

Notes

https://codalab.lisn.upsaclay.fr/competitions/7404#results.

References

Alam M, Samad MD, Vidyaratne L, Glandon A, Iftekharuddin KM (2020) Survey on deep neural networks in speech and vision systems. Neurocomputing 417:302–321
Article Google Scholar
Amirian S, Rasheed K, Taha TR, Arabnia HR (2020) Automatic image and video caption generation with deep learning: a concise review and algorithmic overlap. IEEE Access 8:218386–218400
Article Google Scholar
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision—ECCV 2016. Springer International Publishing, Cham, pp 382–398
Chapter Google Scholar
Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304
Article Google Scholar
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Beddiar DR, Oussalah M, Seppänen T (2022) Automatic captioning for medical imaging (MIC): a rapid review of literature. Artif Intell Rev 56(5):4019–4076
Article Google Scholar
Berg TL, Berg AC, Shih J (2010) Automatic attribute discovery and characterization from noisy web data. In: Computer vision—ECCV 2010: 11th European conference on computer vision, Heraklion, Crete, Greece, September 5–11, 2010, Proceedings, Part I 11. Springer, pp 663–676
Bernardi R, Cakici R, Elliott D, Erdem A, Erdem E, Ikizler-Cinbis N, Keller F, Muscat A, Plank B (2016) Automatic description generation from images: a survey of models, datasets, and evaluation measures. J Artif Intell Res 55:409–442
Article Google Scholar
Bin Y, Ding Y, Peng B, Peng L, Yang Y, Chua T-S (2021) Entity slot filling for visual captioning. IEEE Trans Circuits Syst Video Technol 32(1):52–62
Article Google Scholar
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S(2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5659–5667
Chen Z, Hu R, Chen X, Nießner M, Chang AX (2023) UniT3D: a unified transformer for 3d dense captioning and visual grounding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 18109–18119
Chen Z, Wang J, Ma A, Zhong Y (2022) TypeFormer: multiscale transformer with type controller for remote sensing image caption. IEEE Geosci Remote Sens Lett 19:1–5
Google Scholar
Cheng Q, Huang H, Yuan X, Zhou Y, Li H, Wang Z (2022) NWPU-captions dataset and MLCA-net for remote sensing image captioning. IEEE Trans Geosci Remote Sens 60:1–19
Google Scholar
Cheng Q, Zhou Y, Peng F, Yuan X, Zhang L (2021) A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing. IEEE J Sel Top Appl Earth Obs Remote Sens 14:4284–4297
Article Google Scholar
Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional GAN. In: Proceedings of the IEEE international conference on computer vision, pp 2970–2979
Das B, Pal R, Majumder M, Phadikar S, Sekh AA (2023) A visual attention-based model for Bengali image captioning. SN Comput Sci 4(2):208
Article Google Scholar
Das R, Doren Singh T (2022) Assamese news image caption generation using attention mechanism. Multimed Tools Appl 81(7):10051–10069
Article Google Scholar
Demner-Fushman D, Kohli MD, Rosenman MB, Shooshan SE, Rodriguez L, Antani S, Thoma GR, McDonald CJ (2016) Preparing a collection of radiology examinations for distribution and retrieval. J Am Med Inform Assoc 23(2):304–310
Article Google Scholar
Dittakan K, Prompitak K, Thungklang P, Wongwattanakit C (2023) Image caption generation using transformer learning methods: a case study on instagram image. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-17275-9
Article Google Scholar
Dognin P, Melnyk I, Mroueh Y, Padhi I, Rigotti M, Ross J, Schiff Y, Young RA, Belgodere B (2022) Image captioning as an assistive technology: lessons learned from VizWiz 2020 challenge. J Artif Intell Res 73:437–459
Article Google Scholar
Effendi J, Sakti S, Nakamura S (2021) End-to-end image-to-speech generation for untranscribed unknown languages. IEEE Access 9:55144–55154
Article Google Scholar
Elbedwehy S, Medhat T (2023) Improved Arabic image captioning model using feature concatenation with pre-trained word embedding. Neural Comput Appl 35(26):19051–19067
Article Google Scholar
Elliott D, de Vries A (2015) Describing images using inferred visual dependency representations. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: long papers), pp 42–52
Elliott D, Keller F (2013) Image description using visual dependency representations. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1292–1302
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: Computer vision—ECCV 2010: 11th European conference on computer vision, Heraklion, Crete, Greece, September 5–11, 2010, Proceedings, Part IV 11. Springer, pp 15–29
Feng Y, Lapata M (2012) Automatic caption generation for news images. IEEE Trans Pattern Anal Mach Intell 35(4):797–812
Article Google Scholar
Gajbhiye GO, Nandedkar AV (2022) Generating the captions for remote sensing images: a spatial-channel attention based memory-guided transformer approach. Eng Appl Artif Intell 114:105076
Article Google Scholar
Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: European conference on computer vision. Springer, pp 529–545
Grubinger M, Clough PD, Müller H, Deselaers T (2006) The IAPR TC-12 benchmark: a new evaluation resource for visual information systems
Herdade S, Kappeler A, Boakye K, Soares J (2019) Image Captioning: Transforming Objects into Words. Neural Information Processing Systems 11135–11145
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899
Article MathSciNet Google Scholar
Hossain MZ, Sohel F, Shiratuddin MF, Laga H (2019) A comprehensive survey of deep learning for image captioning. ACM Comput Surv (CsUR) 51(6):1–36
Article Google Scholar
Hou D, Zhao Z, Liu Y, Chang F, Sanyuan H (2021) Automatic report generation for chest x-ray images via adversarial reinforcement learning. IEEE Access 9:21236–21250
Article Google Scholar
Hoxha G, Chouaf S, Melgani F, Smara Y (2022) Change captioning: a new paradigm for multitemporal remote sensing image analysis. IEEE Trans Geosci Remote Sens 60:1–14
Google Scholar
Hoxha G, Melgani F, Demir B (2020) Toward remote sensing image retrieval under a deep image captioning perspective. IEEE J Sel Top Appl Earth Obs Remote Sens 13:4462–4475
Article Google Scholar
Huang W, Wang Q, Li X (2020) Denoising-based multiscale feature fusion for remote sensing image captioning. IEEE Geosci Remote Sens Lett 18(3):436–440
Article Google Scholar
Huang Y, Chen J, Ouyang W, Wan W, Xue Y (2020) Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Trans Image Process 29:4013–4026
Article Google Scholar
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision, pp 2407–2415
Jiang W, Zhu M, Fang Y, Shi G, Zhao X, Liu Y (2022) Visual cluster grounding for image captioning. IEEE Trans Image Process 31:3920–3934
Article Google Scholar
Jing B, Xie P, Xing E (2018) On the automatic generation of medical imaging reports. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), Melbourne, Australia. Association for Computational Linguistics, pp 2577–2586
Johnson AEW, Pollard TJ, Greenbaum NR, Lungren MP, Deng C-Y, Peng Y, Lu Z, Mark RG, Berkowitz SJ, Horng S (2019) MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. Number: arXiv:1901.07042 [cs, eess]
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Karpathy A, Joulin A, Fei-Fei LF (2014) Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. Neural Information Processing Systems 1889–1897
Kastner MA, Umemura K, Ide I, Kawanishi Y, Hirayama T, Doman K, Deguchi D, Murase H, Satoh S (2021) Imageability-and length-controllable image captioning. IEEE Access 9:162951–162961
Article Google Scholar
Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: International conference on machine learning. PMLR, pp 595–603
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Article Google Scholar
Kumar A, Goel S (2017) A survey of evolution of image captioning techniques. International Journal of Hybrid Intelligent Systems 14(3):123–139
Article Google Scholar
Kumar SC, Hemalatha M, Narayan SB, Nandhini P (2019) Region driven remote sensing image captioning. Procedia Comput Sci 165:32–40
Article Google Scholar
Li W, Qu Z, Song H, Wang P, Xue B (2020) The traffic scene understanding and prediction based on image captioning. IEEE Access 9:1420–1427
Article Google Scholar
Li X, Zhang X, Huang W, Wang Q (2020) Truncation cross entropy loss for remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(6):5246–5257
Article Google Scholar
Lin C-Y (2004) ROUGE: a package for automatic evaluation of summaries. In: Text summarization branches out. Association for Computational Linguistics, Barcelona, Spain, pp 74–81
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. Springer, pp 740–755
Liu C, Zhao R, Chen H, Zou Z, Shi Z (2022) Remote sensing image change captioning with dual-branch transformers: a new method and a large scale dataset. IEEE Trans Geosci Remote Sens 60:1–20
Google Scholar
Liu M, Hu H, Li L, Yu Y, Guan W (2020) Chinese image caption generation via visual attention and topic modeling. IEEE Trans Cybern 52(2):1247–1257
Article Google Scholar
Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of spider. In: Proceedings of the IEEE international conference on computer vision, pp 873–881
Liu X, Xu Q, Wang N (2019) A survey on deep neural network-based image captioning. Vis Comput 35(3):445–470
Article Google Scholar
Lu X, Wang B, Zheng X (2019) Sound active attention framework for remote sensing image captioning. IEEE Trans Geosci Remote Sens 58(3):1985–2000
Article Google Scholar
Lu X, Wang B, Zheng X, Li X (2018) Exploring models and data for remote sensing image caption generation. IEEE Trans Geosci Remote Sens 56(4):2183–2195
Article Google Scholar
Ma X, Zhao R, Shi Z (2020) Multiscale methods for optical remote-sensing image captioning. IEEE Geosci Remote Sens Lett 18(11):2001–2005
Article Google Scholar
Makav B, Kılıç V (2019) A new image captioning approach for visually impaired people. In: 2019 11th international conference on electrical and electronics engineering (ELECO). IEEE, pp 945–949
Malla S, Choi C, Dwivedi I, Choi JH, Li J (2023) Drama: joint risk localization and captioning in driving. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1043–1052
Mao J, Huang J, Toshev A, Camburu O, Yuille AL, Murphy K (2016) Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 11–20
Min K, Dang M, Moon H (2021) Deep learning-based short story generation for an image using the encoder-decoder structure. IEEE Access 9:113550–113557
Article Google Scholar
Mishra SK, Dhir R, Saha S, Bhattacharyya P (2021) A Hindi image caption generation framework using deep learning. Trans Asian Low Resour Lang Inf Process 20(2):1–19
Article Google Scholar
Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg A, Berg T, Daumé III H (2012) Midge: Generating Image Descriptions From Computer Vision Detections. Conference of the European Chapter of the Association for Computational Linguistics 747–756
Mokady R, Hertz A, Bermano AH (2021) Clipcap: clip prefix for image captioning. arXiv preprint arXiv:2111.09734
Ordonez V, Kulkarni G, Berg T (2011) Im2Text: Describing Images Using 1 Million Captioned Photographs. Neural Information Processing Systems 1143–1151
Papineni K, Roukos S, Ward T, Zhu WJ (2001) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics—ACL ’02, Philadelphia, Pennsylvania. Association for Computational Linguistics, p 311
Park CC, Kim B, Kim G (2017) Attend to you: personalized image captioning with context sequence memory networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 6432–6440
Park H, Kim K, Park S, Choi J (2021) Medical image captioning model to convey more details: methodological comparison of feature difference generation. IEEE Access 9:150560–150568
Article Google Scholar
Patterson G, Xu C, Su H, Hays J (2014) The sun attribute database: beyond categories for deeper scene understanding. Int J Comput Vis 108:59–81
Article Google Scholar
Qu B, Li X, Tao D, Lu X (2016) Deep semantic understanding of high resolution remote sensing image. In: 2016 international conference on computer, information and telecommunication systems (CITS), Kunming, China. IEEE, pp 1–5
Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using Amazon’s Mechanical Turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, pp 139–147
Selivanov A, Rogov OY, Chesakov D, Shelmanov A, Fedulova I, Dylov DV (2023) Medical image captioning via generative pretrained transformers. Sci Rep 13(1):4171
Article Google Scholar
Shao Z, Zhou W, Deng X, Zhang M, Cheng Q (2020) Multilabel remote sensing image retrieval based on fully convolutional network. IEEE J Sel Top Appl Earth Obs Remote Sens 13:318–328
Article Google Scholar
Sharma H, Padha D (2023) A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues. Artif Intell Rev 56(11):13619–13661
Article Google Scholar
Sharma H, Padha D (2023) From templates to transformers: a survey of multimodal image captioning decoders. In: 2023 international conference on computer, electronics & electrical engineering & their applications (IC2E3). IEEE, pp 1–6
Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), pp 2556–2565
Shen X, Liu B, Zhou Y, Zhao J (2020) Remote sensing image caption generation via transformer and reinforcement learning. Multimed Tools Appl 79:26661–26682
Article Google Scholar
Shuster K, Humeau S, Hu H, Bordes A, Weston J (2019) Engaging image captioning via personality. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12516–12526
Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist 2:207–218
Article Google Scholar
Srihari K, Sikha OK (2022) Partially supervised image captioning model for urban road views. In: Intelligent data communication technologies and internet of things: proceedings of ICICI 2021. Springer, pp 59–73
Sugano Y, Bulling A (2016) Seeing with humans: gaze-assisted neural image captioning. arXiv preprint arXiv:1608.05203
Sumbul G, Nayak S, Demir B (2020) SD-RSIC: summarization-driven deep remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(8):6922–6934
Article Google Scholar
Ushiku Y, Yamaguchi M, Mukuta Y, Harada T (2015) Common subspace for model and similarity: Phrase learning for caption generation from images. In: Proceedings of the IEEE international conference on computer vision, pp 2668–2676
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4566–4575
Verma Y, Jawahar CV (2014) Im2Text and Text2Im: associating images and texts for cross-modal retrieval. In: BMVC, vol 1, p 2
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wang B, Zheng X, Qu B, Lu X (2020) Retrieval topic recurrent memory network for remote sensing image captioning. IEEE J Sel Top Appl Earth Obs Remote Sens 13:256–270
Article Google Scholar
Wang C, Yang H, Bartz C, Meinel Ch (2016) Image captioning with deep bidirectional LSTMs. In: Proceedings of the 24th ACM international conference on Multimedia, pp 988–997
Wang D, Beck D, Cohn T (2019) On the role of scene graphs in image captioning. In: Proceedings of the beyond vision and language: integrating real-world knowledge (LANTERN), pp 29–34
Wang Q, Huang W, Zhang X, Li X (2020) Word-sentence framework for remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(12):10532–10543
Article Google Scholar
Wang Q, Chan AB (2018) CNN+CNN: convolutional decoders for image captioning. ArXiv:1805.09019
Wang S, Ye X, Gu Y, Wang J, Meng Y, Tian J, Hou B, Jiao L (2022) Multi-label semantic feature fusion for remote sensing image captioning. ISPRS J Photogramm Remote Sens 184:1–18
Article Google Scholar
Wu Q, Shen C, Wang P, Dick A, Van Den Hengel A (2017) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381
Article Google Scholar
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. PMLR, pp 2048–2057
Yang M, Liu J, Shen Y, Zhao Z, Chen X, Wu Q, Li C (2020) An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. IEEE Trans Image Process 29:9627–9640
Article MathSciNet Google Scholar
Yang M, Zhao W, Xu W, Feng Y, Zhao Z, Chen X, Lei K (2018) Multitask learning for cross-domain image captioning. IEEE Trans Multimed 21(4):1047–1061
Article Google Scholar
Yang Q, Ni Z, Ren P (2022) Meta captioning: a meta learning based remote sensing image captioning framework. ISPRS J Photogramm Remote Sens 186:190–200
Article Google Scholar
Yang Y, Teo C, Daumé III H, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the 2011 conference on empirical methods in natural language processing, pp 444–454
Yatskar M, Galley M, Vanderwende L, Zettlemoyer L (2014) See no evil, say no evil: description generation from densely labeled images. In: Proceedings of the third joint conference on lexical and computational semantics (* SEM 2014), pp 110–120
Ye X, Wang S, Gu Y, Wang J, Wang R, Hou B, Giunchiglia F, Jiao L (2022) A joint-training two-stage method for remote sensing image captioning. IEEE Trans Geosci Remote Sens 60:1–16
Google Scholar
Yuan Z, Li X, Wang Q (2019) Exploring multi-level attention and semantic relationship for remote sensing image captioning. IEEE Access 8:2608–2620
Article Google Scholar
Yumeng Z, Jing Y, Shuo G, Limin L (2021) News image-text matching with news knowledge graph. IEEE Access 9:108017–108027
Article Google Scholar
Zeng X, Wen L, Xu Y, Ji C (2020) Generating diagnostic report for medical image by high-middle-level visual information incorporation on double deep learning models. Comput Methods Progr Biomed 197:105700
Article Google Scholar
Zhang Z, Zhang W, Diao W, Yan M, Gao X, Sun X (2019) VAA: visual aligning attention model for remote sensing image captioning. IEEE Access 7:137355–137364
Article Google Scholar
Zhang Z, Xie Y, Xing F, McGough M, Yang L (2017) MDNet: a semantically and visually interpretable medical image diagnosis network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6428–6436
Zhao B (2021) A systematic survey of remote sensing image captioning. IEEE Access 9:154086–154111
Article Google Scholar
Zhao W, Wu X, Luo J (2020) Cross-domain image captioning via cross-modal retrieval and model adaptation. IEEE Trans Image Process 30:1180–1192
Article MathSciNet Google Scholar
Zhou J, Zhu Y, Zhang Y, Yang C, Pan H (2023) Spatial-aware topic-driven-based image Chinese caption for disaster news. Neural Comput Appl 35(13):9481–9500
Article Google Scholar
Zhou L, Zhang Y, Jiang YG, Zhang T, Fan W (2019) Re-caption: saliency-enhanced image captioning through two-phase learning. IEEE Trans Image Process 29:694–709
Article MathSciNet Google Scholar
Zohourianshahzadi Z, Kalita JK (2011) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55(5):3833–3862
Article Google Scholar

Download references

Acknowledgements

The authors extend sincere gratitude to the Editor and Reviewers for their insightful remarks and helpful opinions, which contributed to the enhancement of the work.

Author information

Authors and Affiliations

Department of Computer Science and Information Technology, Central University of Jammu, Jammu & Kashmir, 181124, India
Himanshu Sharma & Devanand Padha

Authors

Himanshu Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Devanand Padha
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H. Sharma conceptualized the survey, collected and analyzed the data, and drafted the manuscript. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Himanshu Sharma.

Ethics declarations

Conflict of interest

Thank you for your query regarding the Conflict of Interest and Informed Consent statements in our manuscript. We have reviewed the manuscript, and the Conflict of Interest and Informed Consent statements are correctly identified in accordance with the journal’s guidelines.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Tables 11, 12, and 13.

Table 11 Comprehensive information about the publishers of the research articles that are cited in this survey

Full size table

Table 12 A frequency distribution table of the article publishers cited in this survey

Full size table

Table 13 A frequency distribution table of article types cited in this survey

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sharma, H., Padha, D. Domain-specific image captioning: a comprehensive review. Int J Multimed Info Retr 13, 20 (2024). https://doi.org/10.1007/s13735-024-00328-6

Download citation

Received: 05 November 2023
Revised: 02 February 2024
Accepted: 22 March 2024
Published: 18 April 2024
DOI: https://doi.org/10.1007/s13735-024-00328-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Domain-specific image captioning: a comprehensive review

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues

A Survey on Automatic Image Captioning Approaches: Contemporary Trends and Future Perspectives

A comprehensive literature review on image captioning methods and metrics based on deep learning technique

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Domain-specific image captioning: a comprehensive review

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues

A Survey on Automatic Image Captioning Approaches: Contemporary Trends and Future Perspectives

A comprehensive literature review on image captioning methods and metrics based on deep learning technique

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation