iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://api.crossref.org/works/10.3390/SYM15010190
{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,9,23]],"date-time":"2024-09-23T04:30:37Z","timestamp":1727065837381},"reference-count":47,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2023,1,9]],"date-time":"2023-01-09T00:00:00Z","timestamp":1673222400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Hubei Technology Innovation Project","award":["2019AAA045"]},{"name":"Graduate Innovative Fund of Wuhan Institute of Technology","award":["CX2021244"]},{"name":"National Natural Science Foundation of China","award":["62072350"]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Symmetry"],"abstract":"The Transformer-based approach represents the state-of-the-art in image captioning. However, existing studies have shown Transformer has a problem that irrelevant tokens with overlapping neighbors incorrectly attend to each other with relatively large attention scores. We believe that this limitation is due to the incompleteness of the Self-Attention Network (SAN) and Feed-Forward Network (FFN). To solve this problem, we present the Full-Memory Transformer method for image captioning. The method improves the performance of both image encoding and language decoding. In the image encoding step, we propose the Full-LN symmetric structure, which enables stable training and better model generalization performance by symmetrically embedding Layer Normalization on both sides of the SAN and FFN. In the language decoding step, we propose the Memory Attention Network (MAN), which extends the traditional attention mechanism to determine the correlation between attention results and input sequences, guiding the model to focus on the words that need to be attended to. Our method is evaluated on the MS COCO dataset and achieves good performance, improving the result in terms of BLEU-4 from 38.4 to 39.3.<\/jats:p>","DOI":"10.3390\/sym15010190","type":"journal-article","created":{"date-parts":[[2023,1,9]],"date-time":"2023-01-09T09:42:17Z","timestamp":1673257337000},"page":"190","source":"Crossref","is-referenced-by-count":5,"title":["Full-Memory Transformer for Image Captioning"],"prefix":"10.3390","volume":"15","author":[{"ORCID":"http:\/\/orcid.org\/0000-0002-3900-6456","authenticated-orcid":false,"given":"Tongwei","family":"Lu","sequence":"first","affiliation":[{"name":"School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan 430205, China"},{"name":"Hubei Key Laboratory of Intelligent Robot Wuhan Institute of Technology, Wuhan 430205, China"}]},{"given":"Jiarong","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan 430205, China"},{"name":"Hubei Key Laboratory of Intelligent Robot Wuhan Institute of Technology, Wuhan 430205, China"}]},{"ORCID":"http:\/\/orcid.org\/0000-0002-6846-8916","authenticated-orcid":false,"given":"Fen","family":"Min","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan 430205, China"},{"name":"Hubei Key Laboratory of Intelligent Robot Wuhan Institute of Technology, Wuhan 430205, China"}]}],"member":"1968","published-online":{"date-parts":[[2023,1,9]]},"reference":[{"key":"ref_1","unstructured":"Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., and Xing, E.P. (July, January 29). Toward controlled generation of text. Proceedings of the International Conference on Machine Learning, San Juan, PR, USA."},{"key":"ref_2","unstructured":"Johnson, J., Karpathy, A., and Fei-Fei, L. (1997, January 17\u201319). Densecap: Fully convolutional localization networks for dense captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA."},{"key":"ref_3","unstructured":"Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. (1997, January 17\u201319). Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA."},{"key":"ref_4","unstructured":"Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015, January 6\u201311). Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_5","unstructured":"Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., and Goel, V. (1997, January 17\u201319). Self-critical sequence training for image captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA."},{"key":"ref_6","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017). Attention is all you need. Adv. Neural Inf. Process. Syst., 30, Available online: https:\/\/proceedings.neurips.cc\/paper\/2017\/hash\/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2002, January 7\u201312). Bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_8","unstructured":"Anderson, P., Fernando, B., Johnson, M., and Gould, S. (2012, January 7\u201313). Spice: Semantic propositional image caption evaluation. Proceedings of the European Conference on Computer Vision, Florence, Italy."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1485","DOI":"10.1109\/JPROC.2010.2050411","article-title":"I2t: Image parsing to text description","volume":"98","author":"Yao","year":"2010","journal-title":"Proc. IEEE"},{"key":"ref_10","unstructured":"Karpathy, A., and Fei-Fei, L. (1997, January 17\u201319). Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA."},{"key":"ref_11","unstructured":"Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Stratos, K., Han, X., Mensch, A., Berg, A., Berg, T., and Daum\u00e9 III, H. (2012, January 23\u201327). Midge: Generating image descriptions from computer vision detections. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., Zweig, G., and Mitchell, M. (2015). Language models for image captioning: The quirks and what works. arXiv.","DOI":"10.3115\/v1\/P15-2017"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Liu, F., Ren, X., Liu, Y., Wang, H., and Sun, X. (2018). simnet: Stepwise image-topic merging network for generating detailed and comprehensive image captions. arXiv.","DOI":"10.18653\/v1\/D18-1013"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Xu, Y., Wu, B., Shen, F., Fan, Y., Zhang, Y., Shen, H.T., and Liu, W. (2019, January 15\u201320). Exact adversarial attack to image captioning via structured output learning with latent variables. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00426"},{"key":"ref_15","unstructured":"Wang, W., Chen, Z., and Hu, H. (February, January 27). Hierarchical attention network for image captioning. Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA."},{"key":"ref_16","unstructured":"Huang, L., Wang, W., Xia, Y., and Chen, J. (2019). Adaptively aligned image captioning via adaptive attention time. Adv. Neural Inf. Process. Syst., 32, Available online: https:\/\/proceedings.neurips.cc\/paper\/2019\/file\/fecc3a370a23d13b1cf91ac3c1e1ca92-Paper.pdf."},{"key":"ref_17","unstructured":"Ramanishka, V., Das, A., Zhang, J., and Saenko, K. (1997, January 17\u201319). Top-down visual saliency guided by captions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA."},{"key":"ref_18","unstructured":"Dai, B., Fidler, S., Urtasun, R., and Lin, D. (1997, January 17\u201319). Towards diverse and natural image descriptions via a conditional gan. Proceedings of the IEEE International Conference on Computer Vision, San Juan, PR, USA."},{"key":"ref_19","unstructured":"Gao, Y., Beijbom, O., Zhang, N., and Darrell, T. (1997, January 17\u201319). Compact bilinear pooling. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA."},{"key":"ref_20","unstructured":"Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. (2015). Sequence level training with recurrent neural networks. arXiv."},{"key":"ref_21","unstructured":"Chen, C., Mu, S., Xiao, W., Ye, Z., Wu, L., and Ju, Q. (February, January 27). Improving image captioning with conditional generative adversarial nets. Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA."},{"key":"ref_22","unstructured":"Gu, J., Cai, J., Wang, G., and Chen, T. (February, January 27). Stack-captioning: Coarse-to-fine learning for image captioning. Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA."},{"key":"ref_23","unstructured":"Lu, J., Xiong, C., Parikh, D., and Socher, R. (1997, January 17\u201319). Knowing when to look: Adaptive attention via a visual sentinel for image captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA."},{"key":"ref_24","unstructured":"You, Q., Jin, H., Wang, Z., Fang, C., and Luo, J. (1997, January 17\u201319). Image captioning with semantic attention. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA."},{"key":"ref_25","unstructured":"Pedersoli, M., Lucas, T., Schmid, C., and Verbeek, J. (1997, January 17\u201319). Areas of attention for image captioning. Proceedings of the IEEE International Conference on Computer Vision, San Juan, PR, USA."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Lu, J., Yang, J., Batra, D., and Parikh, D. (2018, January 17\u201319). Neural baby talk. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA.","DOI":"10.1109\/CVPR.2018.00754"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Yang, X., Tang, K., Zhang, H., and Cai, J. (2019, January 15\u201320). Auto-encoding scene graphs for image captioning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01094"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Chen, S., Jin, Q., Wang, P., and Wu, Q. (2020, January 13\u201319). Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00998"},{"key":"ref_29","unstructured":"Mathews, A., Xie, L., and He, X. (1997, January 17\u201319). Semstyle: Learning to generate stylised image captions using unaligned text. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA."},{"key":"ref_30","unstructured":"Cornia, M., Baraldi, L., and Cucchiara, R. (1997, January 17\u201319). Show, control and tell: A framework for generating controllable and grounded captions. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA."},{"key":"ref_31","unstructured":"Chunseong Park, C., Kim, B., and Kim, G. (1997, January 17\u201319). Attend to you: Personalized image captioning with context sequence memory networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA."},{"key":"ref_32","unstructured":"Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. (July, January 29). On layer normalization in the transformer architecture. Proceedings of the International Conference on Machine Learning, San Juan, PR, USA."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Jiang, W., Ma, L., Jiang, Y.G., Liu, W., and Zhang, T. (2018, January 8\u201314). Recurrent fusion network for image captioning. Proceedings of the European conference on computer vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01216-8_31"},{"key":"ref_34","unstructured":"Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., and Le, Q.V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst., 32, Available online: https:\/\/proceedings.neurips.cc\/paper\/2019\/hash\/dc6a7e655d7e5840e66733e9ee67cc69-Abstract.html."},{"key":"ref_35","unstructured":"Zhuang, L., Wayne, L., Ya, S., and Jun, Z. (2021, January 13\u201315). A robustly optimized BERT pre-training approach with post-training. Proceedings of the 20th Chinese National Conference on Computational Linguistics, Hohhot, China."},{"key":"ref_36","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., and Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv.","DOI":"10.18653\/v1\/P19-1285"},{"key":"ref_38","unstructured":"Ravula, A., Alberti, C., Ainslie, J., Yang, L., Pham, P.M., Wang, Q., Ontanon, S., Sanghai, S.K., Cvicek, V., and Fisher, Z. (2020, January 16\u201320). ETC: Encoding long and structured inputs in transformers. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online. Available online: https:\/\/aclanthology.org\/volumes\/2020.emnlp-demos\/."},{"key":"ref_39","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford","year":"2019","journal-title":"OpenAI blog"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"453","DOI":"10.1162\/tacl_a_00276","article-title":"Natural questions: A benchmark for question answering research","volume":"7","author":"Kwiatkowski","year":"2019","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_41","unstructured":"Herdade, S., Kappeler, A., Boakye, K., and Soares, J. (2019). Image captioning: Transforming objects into words. Adv. Neural Inf. Process. Syst., 32, Available online: https:\/\/proceedings.neurips.cc\/paper\/2019\/hash\/680390c55bbd9ce416d1d69a9ab4760d-Abstract.html."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Osolo, R.I., Yang, Z., and Long, J. (2021). An Attentive Fourier-Augmented Image-Captioning Transformer. Appl. Sci., 11.","DOI":"10.3390\/app11188354"},{"key":"ref_43","first-page":"1","article-title":"Semantic association enhancement transformer with relative position for image captioning","volume":"15","author":"Jia","year":"2022","journal-title":"Multimed. Tools Appl."},{"key":"ref_44","unstructured":"Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., and Zitnick, C.L. (2012, January 7\u201313). Microsoft coco: Common objects in context. Proceedings of the European Conference on Computer Vision, Florence, Italy."},{"key":"ref_45","unstructured":"Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (1997, January 17\u201319). Cider: Consensus-based image description evaluation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA."},{"key":"ref_46","unstructured":"Banerjee, S., and Lavie, A. (2005, January 29). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization, Ann Arbor, MI, USA."},{"key":"ref_47","unstructured":"Lin, C.Y. (2004, January 25\u201326). Rouge: A package for automatic evaluation of summaries. Proceedings of the Text Summarization Branches Out, Barcelona, Spain."}],"container-title":["Symmetry"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-8994\/15\/1\/190\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,24]],"date-time":"2024-08-24T21:16:25Z","timestamp":1724534185000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-8994\/15\/1\/190"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,1,9]]},"references-count":47,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,1]]}},"alternative-id":["sym15010190"],"URL":"https:\/\/doi.org\/10.3390\/sym15010190","relation":{},"ISSN":["2073-8994"],"issn-type":[{"type":"electronic","value":"2073-8994"}],"subject":[],"published":{"date-parts":[[2023,1,9]]}}}