Context-Assisted Attention for Image Captioning

Lian, Zheng; Wang, Rui; Li, Haichang; Hu, Xiaohui

doi:10.1007/978-3-031-15919-0_60

Zheng Lian ORCID: orcid.org/0000-0003-0682-7589^12,13,
Rui Wang¹³,
Haichang Li¹³ &
…
Xiaohui Hu¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13529))

Included in the following conference series:

International Conference on Artificial Neural Networks

2535 Accesses
1 Citations

Abstract

Temporal attention has demonstrated its crucial role with regard to modelling the relationships between semantic queries and image regions in current image captioning task. Nevertheless, most existing attention-based methods ignore the potential effect of the previously attended information on the generation of current attention context. In this paper, we propose a simple but effective Context-Assisted Attention (CA$^2$) for image captioning, which considers the temporal coherence of the attention contexts in the process of sequence prediction. Specifically, CA$^2$ combines the attention contexts from previous time steps with the features of image regions to serve as the input key-value pairs of the attention module for current context generation, which enables the sentence decoder to not only attend to the image regions by tradition but also focus on the historical attention contexts when necessary. Furthermore, we present a regularization method tailored to our CA$^2$, namely Weight Transferring Constraint (WTC), to restrict the total weight assigned to the historical contexts in each decoding step. Experiments on the popular MS COCO dataset demonstrate that our method consistently improves LSTM-based baselines and achieves a competitive performance with 38.7 BLEU-4 and 128.5 CIDEr-D scores.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Image Captioning Using Region-Based Attention Joint with Time-Varying Attention

Article 20 February 2019

Contextual and selective attention networks for image captioning

Article 18 November 2022

Image captioning with adaptive incremental global context attention

Article 13 September 2021

References

Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In International Conference on Machine Learning, pp. 2048–2057. PMLR (2015)
Google Scholar
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10685–10694 (2019)
Google Scholar
Yang, X., Zhang, H., Qi, G., Cai, J.: Causal attention for vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9847–9857 (2021)
Google Scholar
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 375–383 (2017)
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Qin, Y., Du, J., Zhang, Y., Lu, H.: Look back and predict forward in image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8367–8375 (2019)
Google Scholar
Lian, Z., Li, H., Wang, R., Hu, X.: Enhanced soft attention mechanism with an inception-like module for image captioning. In: 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence, pp. 748–752 (2020)
Google Scholar
Devlin, J., Chang, M. W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024 (2017)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W. J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Google Scholar
Lin, C. Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 684–699 (2018)
Google Scholar
Guo, L., Liu, J., Lu, S., Lu, H.: Show, tell, and polish: ruminant decoding for image captioning. IEEE Trans. Multimed. 22(8), 2149–2162 (2019)
Article Google Scholar
Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464–468 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Chinese Academy of Sciences, Beijing, 100049, China
Zheng Lian
Institute of Software Chinese Academy of Sciences, Beijing, 100190, China
Zheng Lian, Rui Wang, Haichang Li & Xiaohui Hu

Authors

Zheng Lian
View author publications
You can also search for this author in PubMed Google Scholar
Rui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Haichang Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohui Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zheng Lian .

Editor information

Editors and Affiliations

University of the West of England, Bristol, UK
Elias Pimenidis
Lancaster University, Lancaster, UK
Plamen Angelov
Digital Innovation, Teesside University, Middlesbrough, UK
Chrisina Jayne
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
The University of the West of England, Bristol, UK
Mehmet Aydin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lian, Z., Wang, R., Li, H., Hu, X. (2022). Context-Assisted Attention for Image Captioning. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds) Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol 13529. Springer, Cham. https://doi.org/10.1007/978-3-031-15919-0_60

Download citation

DOI: https://doi.org/10.1007/978-3-031-15919-0_60
Published: 07 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15918-3
Online ISBN: 978-3-031-15919-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Context-Assisted Attention for Image Captioning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Image Captioning Using Region-Based Attention Joint with Time-Varying Attention

Contextual and selective attention networks for image captioning

Image captioning with adaptive incremental global context attention

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Context-Assisted Attention for Image Captioning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Image Captioning Using Region-Based Attention Joint with Time-Varying Attention

Contextual and selective attention networks for image captioning

Image captioning with adaptive incremental global context attention

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation