iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://unpaywall.org/10.1007/S13735-024-00328-6
Domain-specific image captioning: a comprehensive review | International Journal of Multimedia Information Retrieval Skip to main content

Advertisement

Log in

Domain-specific image captioning: a comprehensive review

  • Trends and Surveys
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

An image caption is a sentence summarizing the semantic details of an image. It is a blended application of computer vision and natural language processing. The earlier research addressed this domain using machine learning approaches by modeling image captioning frameworks using hand-engineered feature extraction techniques. With the resurgence of deep-learning approaches, the development of improved and efficient image captioning frameworks is on the rise. Image captioning is witnessing tremendous growth in various domains as medical, remote sensing, security, visual assistance, and multimodal search engines. In this survey, we comprehensively study the image captioning frameworks based on our proposed domain-specific taxonomy. We explore the benchmark datasets and metrics leveraged for training and evaluating image captioning models in various application domains. In addition, we also perform a comparative analysis of the reviewed models. Natural image captioning, medical image captioning, and remote sensing image captioning are currently among the most prominent application domains of image captioning. The efficacy of real-time image captioning is a challenging obstacle limiting its implementation in sensitive areas such as visual aid, remote security, and healthcare. Further challenges include the scarcity of rich domain-specific datasets, training complexity, evaluation difficulty, and a deficiency of cross-domain knowledge transfer techniques. Despite the significant contributions made, there is a need for additional efforts to develop steadfast and influential image captioning models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Data availability

This research is based on a comprehensive survey of existing literature and methodologies in the field of image captioning; there are no original datasets associated with this study. The findings and conclusions presented in this manuscript are derived from a thorough analysis of publicly available research papers, articles, and related sources.

Notes

  1. https://codalab.lisn.upsaclay.fr/competitions/7404#results.

References

  1. Alam M, Samad MD, Vidyaratne L, Glandon A, Iftekharuddin KM (2020) Survey on deep neural networks in speech and vision systems. Neurocomputing 417:302–321

    Article  Google Scholar 

  2. Amirian S, Rasheed K, Taha TR, Arabnia HR (2020) Automatic image and video caption generation with deep learning: a concise review and algorithmic overlap. IEEE Access 8:218386–218400

    Article  Google Scholar 

  3. Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision—ECCV 2016. Springer International Publishing, Cham, pp 382–398

    Chapter  Google Scholar 

  4. Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304

    Article  Google Scholar 

  5. Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

  6. Beddiar DR, Oussalah M, Seppänen T (2022) Automatic captioning for medical imaging (MIC): a rapid review of literature. Artif Intell Rev 56(5):4019–4076

    Article  Google Scholar 

  7. Berg TL, Berg AC, Shih J (2010) Automatic attribute discovery and characterization from noisy web data. In: Computer vision—ECCV 2010: 11th European conference on computer vision, Heraklion, Crete, Greece, September 5–11, 2010, Proceedings, Part I 11. Springer, pp 663–676

  8. Bernardi R, Cakici R, Elliott D, Erdem A, Erdem E, Ikizler-Cinbis N, Keller F, Muscat A, Plank B (2016) Automatic description generation from images: a survey of models, datasets, and evaluation measures. J Artif Intell Res 55:409–442

    Article  Google Scholar 

  9. Bin Y, Ding Y, Peng B, Peng L, Yang Y, Chua T-S (2021) Entity slot filling for visual captioning. IEEE Trans Circuits Syst Video Technol 32(1):52–62

    Article  Google Scholar 

  10. Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S(2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5659–5667

  11. Chen Z, Hu R, Chen X, Nießner M, Chang AX (2023) UniT3D: a unified transformer for 3d dense captioning and visual grounding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 18109–18119

  12. Chen Z, Wang J, Ma A, Zhong Y (2022) TypeFormer: multiscale transformer with type controller for remote sensing image caption. IEEE Geosci Remote Sens Lett 19:1–5

    Google Scholar 

  13. Cheng Q, Huang H, Yuan X, Zhou Y, Li H, Wang Z (2022) NWPU-captions dataset and MLCA-net for remote sensing image captioning. IEEE Trans Geosci Remote Sens 60:1–19

    Google Scholar 

  14. Cheng Q, Zhou Y, Peng F, Yuan X, Zhang L (2021) A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing. IEEE J Sel Top Appl Earth Obs Remote Sens 14:4284–4297

    Article  Google Scholar 

  15. Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional GAN. In: Proceedings of the IEEE international conference on computer vision, pp 2970–2979

  16. Das B, Pal R, Majumder M, Phadikar S, Sekh AA (2023) A visual attention-based model for Bengali image captioning. SN Comput Sci 4(2):208

    Article  Google Scholar 

  17. Das R, Doren Singh T (2022) Assamese news image caption generation using attention mechanism. Multimed Tools Appl 81(7):10051–10069

    Article  Google Scholar 

  18. Demner-Fushman D, Kohli MD, Rosenman MB, Shooshan SE, Rodriguez L, Antani S, Thoma GR, McDonald CJ (2016) Preparing a collection of radiology examinations for distribution and retrieval. J Am Med Inform Assoc 23(2):304–310

    Article  Google Scholar 

  19. Dittakan K, Prompitak K, Thungklang P, Wongwattanakit C (2023) Image caption generation using transformer learning methods: a case study on instagram image. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-17275-9

    Article  Google Scholar 

  20. Dognin P, Melnyk I, Mroueh Y, Padhi I, Rigotti M, Ross J, Schiff Y, Young RA, Belgodere B (2022) Image captioning as an assistive technology: lessons learned from VizWiz 2020 challenge. J Artif Intell Res 73:437–459

    Article  Google Scholar 

  21. Effendi J, Sakti S, Nakamura S (2021) End-to-end image-to-speech generation for untranscribed unknown languages. IEEE Access 9:55144–55154

    Article  Google Scholar 

  22. Elbedwehy S, Medhat T (2023) Improved Arabic image captioning model using feature concatenation with pre-trained word embedding. Neural Comput Appl 35(26):19051–19067

    Article  Google Scholar 

  23. Elliott D, de Vries A (2015) Describing images using inferred visual dependency representations. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: long papers), pp 42–52

  24. Elliott D, Keller F (2013) Image description using visual dependency representations. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1292–1302

  25. Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: Computer vision—ECCV 2010: 11th European conference on computer vision, Heraklion, Crete, Greece, September 5–11, 2010, Proceedings, Part IV 11. Springer, pp 15–29

  26. Feng Y, Lapata M (2012) Automatic caption generation for news images. IEEE Trans Pattern Anal Mach Intell 35(4):797–812

    Article  Google Scholar 

  27. Gajbhiye GO, Nandedkar AV (2022) Generating the captions for remote sensing images: a spatial-channel attention based memory-guided transformer approach. Eng Appl Artif Intell 114:105076

    Article  Google Scholar 

  28. Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: European conference on computer vision. Springer, pp 529–545

  29. Grubinger M, Clough PD, Müller H, Deselaers T (2006) The IAPR TC-12 benchmark: a new evaluation resource for visual information systems

  30. Herdade S, Kappeler A, Boakye K, Soares J (2019) Image Captioning: Transforming Objects into Words. Neural Information Processing Systems 11135–11145

  31. Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899

    Article  MathSciNet  Google Scholar 

  32. Hossain MZ, Sohel F, Shiratuddin MF, Laga H (2019) A comprehensive survey of deep learning for image captioning. ACM Comput Surv (CsUR) 51(6):1–36

    Article  Google Scholar 

  33. Hou D, Zhao Z, Liu Y, Chang F, Sanyuan H (2021) Automatic report generation for chest x-ray images via adversarial reinforcement learning. IEEE Access 9:21236–21250

    Article  Google Scholar 

  34. Hoxha G, Chouaf S, Melgani F, Smara Y (2022) Change captioning: a new paradigm for multitemporal remote sensing image analysis. IEEE Trans Geosci Remote Sens 60:1–14

    Google Scholar 

  35. Hoxha G, Melgani F, Demir B (2020) Toward remote sensing image retrieval under a deep image captioning perspective. IEEE J Sel Top Appl Earth Obs Remote Sens 13:4462–4475

    Article  Google Scholar 

  36. Huang W, Wang Q, Li X (2020) Denoising-based multiscale feature fusion for remote sensing image captioning. IEEE Geosci Remote Sens Lett 18(3):436–440

    Article  Google Scholar 

  37. Huang Y, Chen J, Ouyang W, Wan W, Xue Y (2020) Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Trans Image Process 29:4013–4026

    Article  Google Scholar 

  38. Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision, pp 2407–2415

  39. Jiang W, Zhu M, Fang Y, Shi G, Zhao X, Liu Y (2022) Visual cluster grounding for image captioning. IEEE Trans Image Process 31:3920–3934

    Article  Google Scholar 

  40. Jing B, Xie P, Xing E (2018) On the automatic generation of medical imaging reports. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), Melbourne, Australia. Association for Computational Linguistics, pp 2577–2586

  41. Johnson AEW, Pollard TJ, Greenbaum NR, Lungren MP, Deng C-Y, Peng Y, Lu Z, Mark RG, Berkowitz SJ, Horng S (2019) MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. Number: arXiv:1901.07042 [cs, eess]

  42. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

  43. Karpathy A, Joulin A, Fei-Fei LF (2014) Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. Neural Information Processing Systems 1889–1897

  44. Kastner MA, Umemura K, Ide I, Kawanishi Y, Hirayama T, Doman K, Deguchi D, Murase H, Satoh S (2021) Imageability-and length-controllable image captioning. IEEE Access 9:162951–162961

    Article  Google Scholar 

  45. Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: International conference on machine learning. PMLR, pp 595–603

  46. Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903

    Article  Google Scholar 

  47. Kumar A, Goel S (2017) A survey of evolution of image captioning techniques. International Journal of Hybrid Intelligent Systems 14(3):123–139

    Article  Google Scholar 

  48. Kumar SC, Hemalatha M, Narayan SB, Nandhini P (2019) Region driven remote sensing image captioning. Procedia Comput Sci 165:32–40

    Article  Google Scholar 

  49. Li W, Qu Z, Song H, Wang P, Xue B (2020) The traffic scene understanding and prediction based on image captioning. IEEE Access 9:1420–1427

    Article  Google Scholar 

  50. Li X, Zhang X, Huang W, Wang Q (2020) Truncation cross entropy loss for remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(6):5246–5257

    Article  Google Scholar 

  51. Lin C-Y (2004) ROUGE: a package for automatic evaluation of summaries. In: Text summarization branches out. Association for Computational Linguistics, Barcelona, Spain, pp 74–81

  52. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. Springer, pp 740–755

  53. Liu C, Zhao R, Chen H, Zou Z, Shi Z (2022) Remote sensing image change captioning with dual-branch transformers: a new method and a large scale dataset. IEEE Trans Geosci Remote Sens 60:1–20

    Google Scholar 

  54. Liu M, Hu H, Li L, Yu Y, Guan W (2020) Chinese image caption generation via visual attention and topic modeling. IEEE Trans Cybern 52(2):1247–1257

    Article  Google Scholar 

  55. Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of spider. In: Proceedings of the IEEE international conference on computer vision, pp 873–881

  56. Liu X, Xu Q, Wang N (2019) A survey on deep neural network-based image captioning. Vis Comput 35(3):445–470

    Article  Google Scholar 

  57. Lu X, Wang B, Zheng X (2019) Sound active attention framework for remote sensing image captioning. IEEE Trans Geosci Remote Sens 58(3):1985–2000

    Article  Google Scholar 

  58. Lu X, Wang B, Zheng X, Li X (2018) Exploring models and data for remote sensing image caption generation. IEEE Trans Geosci Remote Sens 56(4):2183–2195

    Article  Google Scholar 

  59. Ma X, Zhao R, Shi Z (2020) Multiscale methods for optical remote-sensing image captioning. IEEE Geosci Remote Sens Lett 18(11):2001–2005

    Article  Google Scholar 

  60. Makav B, Kılıç V (2019) A new image captioning approach for visually impaired people. In: 2019 11th international conference on electrical and electronics engineering (ELECO). IEEE, pp 945–949

  61. Malla S, Choi C, Dwivedi I, Choi JH, Li J (2023) Drama: joint risk localization and captioning in driving. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1043–1052

  62. Mao J, Huang J, Toshev A, Camburu O, Yuille AL, Murphy K (2016) Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 11–20

  63. Min K, Dang M, Moon H (2021) Deep learning-based short story generation for an image using the encoder-decoder structure. IEEE Access 9:113550–113557

    Article  Google Scholar 

  64. Mishra SK, Dhir R, Saha S, Bhattacharyya P (2021) A Hindi image caption generation framework using deep learning. Trans Asian Low Resour Lang Inf Process 20(2):1–19

    Article  Google Scholar 

  65. Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg A, Berg T, Daumé III H (2012) Midge: Generating Image Descriptions From Computer Vision Detections. Conference of the European Chapter of the Association for Computational Linguistics 747–756

  66. Mokady R, Hertz A, Bermano AH (2021) Clipcap: clip prefix for image captioning. arXiv preprint arXiv:2111.09734

  67. Ordonez V, Kulkarni G, Berg T (2011) Im2Text: Describing Images Using 1 Million Captioned Photographs. Neural Information Processing Systems 1143–1151

  68. Papineni K, Roukos S, Ward T, Zhu WJ (2001) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics—ACL ’02, Philadelphia, Pennsylvania. Association for Computational Linguistics, p 311

  69. Park CC, Kim B, Kim G (2017) Attend to you: personalized image captioning with context sequence memory networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 6432–6440

  70. Park H, Kim K, Park S, Choi J (2021) Medical image captioning model to convey more details: methodological comparison of feature difference generation. IEEE Access 9:150560–150568

    Article  Google Scholar 

  71. Patterson G, Xu C, Su H, Hays J (2014) The sun attribute database: beyond categories for deeper scene understanding. Int J Comput Vis 108:59–81

    Article  Google Scholar 

  72. Qu B, Li X, Tao D, Lu X (2016) Deep semantic understanding of high resolution remote sensing image. In: 2016 international conference on computer, information and telecommunication systems (CITS), Kunming, China. IEEE, pp 1–5

  73. Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using Amazon’s Mechanical Turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, pp 139–147

  74. Selivanov A, Rogov OY, Chesakov D, Shelmanov A, Fedulova I, Dylov DV (2023) Medical image captioning via generative pretrained transformers. Sci Rep 13(1):4171

    Article  Google Scholar 

  75. Shao Z, Zhou W, Deng X, Zhang M, Cheng Q (2020) Multilabel remote sensing image retrieval based on fully convolutional network. IEEE J Sel Top Appl Earth Obs Remote Sens 13:318–328

    Article  Google Scholar 

  76. Sharma H, Padha D (2023) A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues. Artif Intell Rev 56(11):13619–13661

    Article  Google Scholar 

  77. Sharma H, Padha D (2023) From templates to transformers: a survey of multimodal image captioning decoders. In: 2023 international conference on computer, electronics & electrical engineering & their applications (IC2E3). IEEE, pp 1–6

  78. Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), pp 2556–2565

  79. Shen X, Liu B, Zhou Y, Zhao J (2020) Remote sensing image caption generation via transformer and reinforcement learning. Multimed Tools Appl 79:26661–26682

    Article  Google Scholar 

  80. Shuster K, Humeau S, Hu H, Bordes A, Weston J (2019) Engaging image captioning via personality. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12516–12526

  81. Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist 2:207–218

    Article  Google Scholar 

  82. Srihari K, Sikha OK (2022) Partially supervised image captioning model for urban road views. In: Intelligent data communication technologies and internet of things: proceedings of ICICI 2021. Springer, pp 59–73

  83. Sugano Y, Bulling A (2016) Seeing with humans: gaze-assisted neural image captioning. arXiv preprint arXiv:1608.05203

  84. Sumbul G, Nayak S, Demir B (2020) SD-RSIC: summarization-driven deep remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(8):6922–6934

    Article  Google Scholar 

  85. Ushiku Y, Yamaguchi M, Mukuta Y, Harada T (2015) Common subspace for model and similarity: Phrase learning for caption generation from images. In: Proceedings of the IEEE international conference on computer vision, pp 2668–2676

  86. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4566–4575

  87. Verma Y, Jawahar CV (2014) Im2Text and Text2Im: associating images and texts for cross-modal retrieval. In: BMVC, vol 1, p 2

  88. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  89. Wang B, Zheng X, Qu B, Lu X (2020) Retrieval topic recurrent memory network for remote sensing image captioning. IEEE J Sel Top Appl Earth Obs Remote Sens 13:256–270

    Article  Google Scholar 

  90. Wang C, Yang H, Bartz C, Meinel Ch (2016) Image captioning with deep bidirectional LSTMs. In: Proceedings of the 24th ACM international conference on Multimedia, pp 988–997

  91. Wang D, Beck D, Cohn T (2019) On the role of scene graphs in image captioning. In: Proceedings of the beyond vision and language: integrating real-world knowledge (LANTERN), pp 29–34

  92. Wang Q, Huang W, Zhang X, Li X (2020) Word-sentence framework for remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(12):10532–10543

    Article  Google Scholar 

  93. Wang Q, Chan AB (2018) CNN+CNN: convolutional decoders for image captioning. ArXiv:1805.09019

  94. Wang S, Ye X, Gu Y, Wang J, Meng Y, Tian J, Hou B, Jiao L (2022) Multi-label semantic feature fusion for remote sensing image captioning. ISPRS J Photogramm Remote Sens 184:1–18

    Article  Google Scholar 

  95. Wu Q, Shen C, Wang P, Dick A, Van Den Hengel A (2017) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381

    Article  Google Scholar 

  96. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. PMLR, pp 2048–2057

  97. Yang M, Liu J, Shen Y, Zhao Z, Chen X, Wu Q, Li C (2020) An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. IEEE Trans Image Process 29:9627–9640

    Article  MathSciNet  Google Scholar 

  98. Yang M, Zhao W, Xu W, Feng Y, Zhao Z, Chen X, Lei K (2018) Multitask learning for cross-domain image captioning. IEEE Trans Multimed 21(4):1047–1061

    Article  Google Scholar 

  99. Yang Q, Ni Z, Ren P (2022) Meta captioning: a meta learning based remote sensing image captioning framework. ISPRS J Photogramm Remote Sens 186:190–200

    Article  Google Scholar 

  100. Yang Y, Teo C, Daumé III H, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the 2011 conference on empirical methods in natural language processing, pp 444–454

  101. Yatskar M, Galley M, Vanderwende L, Zettlemoyer L (2014) See no evil, say no evil: description generation from densely labeled images. In: Proceedings of the third joint conference on lexical and computational semantics (* SEM 2014), pp 110–120

  102. Ye X, Wang S, Gu Y, Wang J, Wang R, Hou B, Giunchiglia F, Jiao L (2022) A joint-training two-stage method for remote sensing image captioning. IEEE Trans Geosci Remote Sens 60:1–16

    Google Scholar 

  103. Yuan Z, Li X, Wang Q (2019) Exploring multi-level attention and semantic relationship for remote sensing image captioning. IEEE Access 8:2608–2620

    Article  Google Scholar 

  104. Yumeng Z, Jing Y, Shuo G, Limin L (2021) News image-text matching with news knowledge graph. IEEE Access 9:108017–108027

    Article  Google Scholar 

  105. Zeng X, Wen L, Xu Y, Ji C (2020) Generating diagnostic report for medical image by high-middle-level visual information incorporation on double deep learning models. Comput Methods Progr Biomed 197:105700

    Article  Google Scholar 

  106. Zhang Z, Zhang W, Diao W, Yan M, Gao X, Sun X (2019) VAA: visual aligning attention model for remote sensing image captioning. IEEE Access 7:137355–137364

    Article  Google Scholar 

  107. Zhang Z, Xie Y, Xing F, McGough M, Yang L (2017) MDNet: a semantically and visually interpretable medical image diagnosis network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6428–6436

  108. Zhao B (2021) A systematic survey of remote sensing image captioning. IEEE Access 9:154086–154111

    Article  Google Scholar 

  109. Zhao W, Wu X, Luo J (2020) Cross-domain image captioning via cross-modal retrieval and model adaptation. IEEE Trans Image Process 30:1180–1192

    Article  MathSciNet  Google Scholar 

  110. Zhou J, Zhu Y, Zhang Y, Yang C, Pan H (2023) Spatial-aware topic-driven-based image Chinese caption for disaster news. Neural Comput Appl 35(13):9481–9500

    Article  Google Scholar 

  111. Zhou L, Zhang Y, Jiang YG, Zhang T, Fan W (2019) Re-caption: saliency-enhanced image captioning through two-phase learning. IEEE Trans Image Process 29:694–709

    Article  MathSciNet  Google Scholar 

  112. Zohourianshahzadi Z, Kalita JK (2011) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55(5):3833–3862

    Article  Google Scholar 

Download references

Acknowledgements

The authors extend sincere gratitude to the Editor and Reviewers for their insightful remarks and helpful opinions, which contributed to the enhancement of the work.

Author information

Authors and Affiliations

Authors

Contributions

H. Sharma conceptualized the survey, collected and analyzed the data, and drafted the manuscript. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Himanshu Sharma.

Ethics declarations

Conflict of interest

Thank you for your query regarding the Conflict of Interest and Informed Consent statements in our manuscript. We have reviewed the manuscript, and the Conflict of Interest and Informed Consent statements are correctly identified in accordance with the journal’s guidelines.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Tables 11, 12, and 13.

Table 11 Comprehensive information about the publishers of the research articles that are cited in this survey
Table 12 A frequency distribution table of the article publishers cited in this survey
Table 13 A frequency distribution table of article types cited in this survey

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharma, H., Padha, D. Domain-specific image captioning: a comprehensive review. Int J Multimed Info Retr 13, 20 (2024). https://doi.org/10.1007/s13735-024-00328-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13735-024-00328-6

Keywords

Navigation