Abstract
Scene text with occlusions is common in the real world, and occluded text recognition is important for many machine vision applications. However, corresponding techniques are not well explored as public datasets cannot represent the situation well, and methods designed for occluded text are still scarce. In this work, we discuss different kinds of occlusions and propose an occluded scene text enhancing network to improve recognition performance. The network is based on generative adversarial networks, and we design accretion blocks to help the network generate the occluded image regions. The model is independent of the recognition networks, so it can be readily used in different frameworks and can be easily trained without the annotations of text content. We also refine the training objective to improve the framework. Experiments on several public benchmarks demonstrate that the proposed method effectively enhances occluded text images, improving recognition accuracy by over 10% on several state-of-the-art frameworks. Meanwhile, the network has no severe impact on the text images without occlusions.
Similar content being viewed by others
References
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2016)
Luo, C., Jin, L., Sun, Z.: Moran: a multi-object rectified attention network for scene text recognition. Pattern Recogn. 90, 109–118 (2019)
Qiao, Z., Zhou, Y., Yang, D., Zhou, Y., Wang, W.: Seed: semantics enhanced encoder-decoder framework for scene text recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13528–13537 (2020)
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: Aster: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2035–2048 (2019)
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Proceedings of the 28th international conference on neural information processing systems. 2, 2017–2025 (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4168–4176 (2016)
Liu, W., Chen, C., Wong, K.-Y.K., Su, Z., Han, J.: Star-net: a spatial attention residue network for scene text recognition. In: British machine vision conference. 2, 7 (2016)
Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Adv. Neural Inf. Process. Syst. 3, 2672–2680 (2014)
Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1125–1134 (2017)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision. pp. 694–711 (2016)
Zhang, S., Liu, Y., Jin, L., Huang, Y., Lai, S.: Ensnet: Ensconce text in the wild. In: Proceedings of the AAAI conference on artificial intelligence. 33, 801–808 (2019)
Gong, Y., Deng, L., Zhang, Z., Duan, G., Ma, Z., Xie, M.: Unattached irregular scene text rectification with refined objective. Neurocomputing 463, 101–108 (2021)
Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., Oh, S.J., Lee, H.: What is wrong with scene text recognition model comparisons? dataset and model analysis. In: 2019 IEEE/CVF International conference on computer vision, pp. 4714–4722 (2019)
Lei, Z., Zhao, S., Song, H., Shen, J.: Scene text recognition using residual convolutional recurrent neural network. Mach. Vis. Appl. 29(5), 861–871 (2018)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: 3rd International conference on learning representations (2015)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on machine learning, pp. 369–376 (2006)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta: Large scale system for text detection and recognition in images. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 71–79 (2018)
Lee, C.-Y., Osindero, S.: Recursive recurrent nets with attention modeling for ocr in the wild. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp. 2231–2239 (2016)
Deng, L., Gong, Y., Lu, X., Yi, X., Ma, Z., Xie, M.: Focus-enhanced scene text recognition with deformable convolutions. In: 2019 IEEE 5th International conference on computer and communications, pp. 1685–1689 (2019). IEEE
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp. 764–773 (2017)
Wang, C., Liu, C.-L.: Multi-branch guided attention network for irregular text recognition. Neurocomputing 425, 278–289 (2021)
Qiu, S., Wen, G., Fan, Y.: Occluded object detection in high-resolution remote sensing images using partial configuration object model. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 10(5), 1909–1925 (2017)
Wang, J., Yuan, Y., Yu, G.: Face attention network: An effective face detector for the occluded faces. arXiv preprint arXiv:1711.07246 (2017)
Wang, X., Xiao, T., Jiang, Y., Shao, S., Sun, J., Shen, C.: Repulsion loss: Detecting pedestrians in a crowd. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7774–7783 (2018)
Ding, D., Ram, S., Rodríguez, J.J.: Image inpainting using nonlocal texture matching and nonlinear filtering. IEEE Trans. Image Process. 28(4), 1705–1719 (2018)
Duan, J., Pan, Z., Zhang, B., Liu, W., Tai, X.-C.: Fast algorithm for color texture image inpainting using the non-local ctv model. J. Global Optim. 62(4), 853–876 (2015)
Fan, Q., Zhang, L.: A novel patch matching algorithm for exemplar-based image inpainting. Multimedia Tools Appl. 77(9), 10807–10821 (2018)
Tutsoy, O., Barkana, D.E., Balikci, K.: A novel exploration-exploitation-based adaptive law for intelligent model-free control approaches. IEEE Trans. Cybern. (2021). https://doi.org/10.1109/TCYB.2021.3091680
Tutsoy, O., Colak, S.: Adaptive estimator design for unstable output error systems: a test problem and traditional system identification based analysis. Proc. Inst. Mech. Eng. Part I J. Syste. Control Eng. 229(10), 902–916 (2015)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Dinh, L., Krueger, D., Bengio, Y.: Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516 (2014)
Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis. In: 7th International conference on learning representations (2019)
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: 4th International conference on learning representations (2016)
Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807 (2018)
Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp. 2223–2232 (2017)
Lahiri, A., Jain, A.K., Agrawal, S., Mitra, P., Biswas, P.K.: Prior guided gan based semantic inpainting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13696–13705 (2020)
Zhao, L., Mo, Q., Lin, S., Wang, Z., Zuo, Z., Chen, H., Xing, W., Lu, D.: Uctgan: Diverse image inpainting based on unsupervised cross-space translation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5741–5750 (2020)
Vo, D.M., Sugimoto, A.: Paired-d++ gan for image manipulation with text. Mach. Vis. Appl. 33(3), 1–15 (2022)
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Dumoulin, V., Visin, F.: A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285 (2016)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. Adv. Neural Inf. Process. Syst. pp. 2234–2242 (2016)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 6627–6638 (2017)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L.G., Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A., De Las Heras, L.P.: Icdar 2013 robust reading competition. In: 2013 12th international conference on document analysis and recognition, pp. 1484–1493 (2013). IEEE
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., : Icdar 2015 competition on robust reading. In: 2015 13th international conference on document analysis and recognition, pp. 1156–1160 (2015). IEEE
Mishra, A., Alahari, K., Jawahar, C.: Scene text recognition using higher order language priors. In: British machine vision conference (2012)
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: IEEE Conference on computer vision and pattern recognition, pp. 2315–2324 (2016)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. In: Workshop on deep learning, NIPS (2014)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. Int. J. Comput. Vision 116(1), 1–20 (2016)
Zheng, C., Cham, T.-J., Cai, J.: Pluralistic image completion. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp. 1438–1447 (2019)
Funding
This work was partly supported by the National Key Research and Development Program of China (Grant number 2018AAA0103203).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gong, Y., Zhang, Z., Duan, G. et al. AccNet: occluded scene text enhancing network with accretion blocks. Machine Vision and Applications 34, 1 (2023). https://doi.org/10.1007/s00138-022-01351-5
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00138-022-01351-5