RecurrentBEV: A Long-Term Temporal Fusion Framework for Multi-view 3D Detection

Chang, Ming; Zhang, Xishan; Zhang, Rui; Zhao, Zhipeng; He, Guanhua; Liu, Shaoli

doi:10.1007/978-3-031-73220-1_8

Ming Chang¹³,
Xishan Zhang^13,14,
Rui Zhang¹⁴,
Zhipeng Zhao¹³,
Guanhua He^13,14,15 &
…
Shaoli Liu¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15130))

Included in the following conference series:

European Conference on Computer Vision

41 Accesses

Abstract

Long-term temporal fusion is frequently employed in camera-based Bird’s-Eye-View (BEV) 3D object detection to improve detection of occluded objects. Existing methods can be divided into two categories, parallel fusion and recurrent fusion. Recurrent fusion reduces inference latency and memory consumption but fails to exploit the long-term information as well as parallel fusion. In this paper, we first find two shortcomings of recurrent fusion paradigm: (1) Gradients of previous BEV features cannot directly contribute to the fusion module. (2) Semantic ambiguity are caused by coarse granularity of the BEV grids during aligning BEV features. Then based on the above analysis, we propose RecurrentBEV, a novel recurrent temporal fusion method for BEV based 3D object detector. By adopting RNN-style back-propagation and new-designed inner grid transformation, RecurrentBEV improves the long-term fusion ability while still enjoying efficient inference latency and memory consumption during inference. Extensive experiments on the nuScenes benchmark demonstrate its effectiveness, achieving a new state-of-the-art performance of 57.4$\%$ mAP and 65.1$\%$ NDS on the test set. The real-time version (25.6 FPS) achieves 44.5$\%$ mAP and 54.9$\%$ NDS without external dataset, outperforming the previous best method StreamPETR by 1.3$\%$ mAP and 0.9$\%$ NDS. The code is available at https://github.com/lucifer443/RecurrentBEV.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers

MPPNet: Multi-frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection

Cyclic Refiner: Object-Aware Temporal Representation Learning for Multi-view 3D Detection and Tracking

Article Open access 16 July 2024

Notes

1.
Details of evaluation metrics: https://www.nuscenes.org/object-detection.
2.
Similar performance to VideoBEV [8] without stereo matching (VideoBEV-D) with 60 training epochs.

References

Ballas, N., Yao, L., Pal, C., Courville, A.: Delving deeper into convolutional networks for learning video representations. In: ICLR (2016)
Google Scholar
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR (2020)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)
Google Scholar
Chen, D., Li, J., Guizilini, V., Ambrus, R.A., Gaidon, A.: Viewpoint equivariance for multi-view 3D object detection. In: CVPR (2023)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE TPAMI 40(4), 834–848 (2017)
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Feng, C., Jie, Z., Zhong, Y., Chu, X., Ma, L.: Aedet: azimuth-invariant multi-view 3D object detection. In: CVPR (2023)
Google Scholar
Han, C., et al.: Exploring recurrent long-term temporal fusion for multi-view 3D perception. arXiv preprint arXiv:2303.05970 (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Huang, B., et al.: Fast-bev: towards real-time on-vehicle bird’s-eye view perception. arXiv preprint arXiv:2301.07870 (2023)
Huang, J., Huang, G.: Bevdet4d: exploit temporal cues in multi-camera 3D object detection. arXiv preprint arXiv:2203.17054 (2022)
Huang, J., Huang, G.: Bevpoolv2: a cutting-edge implementation of bevdet toward deployment. arXiv preprint arXiv:2211.17111 (2022)
Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: Bevdet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021)
Huang, L., et al.: Leveraging vision-centric multi-modal expertise for 3D object detection. In: NeurIPS (2023)
Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: NeurIPS (2015)
Google Scholar
Jiang, Y., et al.: Polarformer: multi-camera 3D object detection with polar transformer. In: AAAI (2023)
Google Scholar
Lee, Y., Hwang, J.W., Lee, S., Bae, Y., Park, J.: An energy and GPU-computation efficient backbone network for real-time object detection. In: CVPRW (2019)
Google Scholar
Li, H., et al.: DFA3D: 3D deformable attention for 2D-to-3D feature lifting. In: ICCV (2023)
Google Scholar
Li, Y., Chen, Y., Qi, X., Li, Z., Sun, J., Jia, J.: Unifying voxel-based representation with transformer for 3D object detection. In: NeurIPS (2022)
Google Scholar
Li, Y., Bao, H., Ge, Z., Yang, J., Sun, J., Li, Z.: Bevstereo: enhancing depth estimation in multi-view 3D object detection with temporal stereo. In: AAAI (2023)
Google Scholar
Li, Y., et al.: Bevdepth: acquisition of reliable depth for multi-view 3D object detection. In: AAAI (2023)
Google Scholar
Li, Z., et al.: Bevformer: lbird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: ECCV (2022)
Google Scholar
Li, Z., Yu, Z., Wang, W., Anandkumar, A., Lu, T., Alvarez, J.M.: FB-BEV: BEV representation from forward-backward view transformations. In: ICCV (2023)
Google Scholar
Lin, T.Y., et al.: Microsoft coco: common objects in context. In: ECCV (2014)
Google Scholar
Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: Sparse4d: multi-view 3D object detection with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581 (2022)
Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: Sparse4d v2: recurrent temporal fusion with sparse model. arXiv preprint arXiv:2305.14018 (2023)
Liu, H., Teng, Y., Lu, T., Wang, H., Wang, L.: Sparsebev: high-performance sparse 3D object detection from multi-camera videos. In: ICCV (2023)
Google Scholar
Liu, Y., Wang, T., Zhang, X., Sun, J.: Petr: position embedding transformation for multi-view 3D object detection. In: ECCV (2022)
Google Scholar
Liu, Y., et al.: PETRV2: a unified framework for 3D perception from multi-camera images. In: ICCV (2023)
Google Scholar
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR (2022)
Google Scholar
Luo, Z., Zhou, C., Zhang, G., Lu, S.: DETR4D: direct multi-view 3D object detection with sparse attention. arXiv preprint arXiv:2212.07849 (2022)
Park, D., Ambrus, R., Guizilini, V., Li, J., Gaidon, A.: Is pseudo-lidar needed for monocular 3D object detection? In: ICCV (2021)
Google Scholar
Park, J., et al.: Time will tell: new outlooks and a baseline for temporal multi-view 3D object detection. In: ICLR (2023)
Google Scholar
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: ECCV (2020)
Google Scholar
Wang, S., Liu, Y., Wang, T., Li, Y., Zhang, X.: Exploring object-centric temporal modeling for efficient multi-view 3D object detection. In: ICCV (2023)
Google Scholar
Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: Conference on Robot Learning (2022)
Google Scholar
Wang, Z., Huang, Z., Fu, J., Wang, N., Liu, S.: Object as query: lifting any 2D object detector to 3D detection. In: ICCV (2023)
Google Scholar
Xiong, K., et al.: Cape: camera view position embedding for multi-view 3D object detection. In: CVPR (2023)
Google Scholar
Yang, C., et al.: Bevformer V2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: CVPR (2023)
Google Scholar
Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3D object detection and tracking. In: CVPR (2021)
Google Scholar
Zhang, J., Zhang, Y., Liu, Q., Wang, Y.: SA-BEV: generating semantic-aware bird’s-eye-view feature for multi-view 3D object detection. In: ICCV (2023)
Google Scholar
Zhou, H., Ge, Z., Li, Z., Zhang, X.: Matrixvt: efficient multi-camera to BEV transformation for 3D perception. In: ICCV (2023)
Google Scholar
Zhu, B., Jiang, Z., Zhou, X., Li, Z., Yu, G.: Class-balanced grouping and sampling for point cloud 3D object detection. arXiv preprint arXiv:1908.09492 (2019)

Download references

Acknowledgements

This work is partially supported by the National Key R&D Program of China (under Grant 2023YFB4502200), the NSF of China (under Grants U22A2028, 61925208, 62102399, 62222214, 62341411, 62102398, U20A20227, 62372436, 623-02478, 62302482, 62302483, 62302480), Strategic Priority Research Program of the Chinese Academy of Sciences, (Grant No. XDB0660200, XDB0660201, XDB0660202), CAS Project for Young Scientists in Basic Research (YSBR-029), Youth Innovation Promotion Association CAS and Xplore Prize.

Author information

Authors and Affiliations

Cambricon Technologies, Beijing, China
Ming Chang, Xishan Zhang, Zhipeng Zhao, Guanhua He & Shaoli Liu
SKL of Processors, Institute of Computing Technology, CAS, Beijing, China
Xishan Zhang, Rui Zhang & Guanhua He
University of Chinese Academy of Sciences, UCAS, Beijing, China
Guanhua He

Authors

Ming Chang
View author publications
You can also search for this author in PubMed Google Scholar
Xishan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Rui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhipeng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Guanhua He
View author publications
You can also search for this author in PubMed Google Scholar
Shaoli Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xishan Zhang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chang, M., Zhang, X., Zhang, R., Zhao, Z., He, G., Liu, S. (2025). RecurrentBEV: A Long-Term Temporal Fusion Framework for Multi-view 3D Detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15130. Springer, Cham. https://doi.org/10.1007/978-3-031-73220-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-73220-1_8
Published: 03 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73219-5
Online ISBN: 978-3-031-73220-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

RecurrentBEV: A Long-Term Temporal Fusion Framework for Multi-view 3D Detection

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers

MPPNet: Multi-frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection

Cyclic Refiner: Object-Aware Temporal Representation Learning for Multi-view 3D Detection and Tracking

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

RecurrentBEV: A Long-Term Temporal Fusion Framework for Multi-view 3D Detection

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers

MPPNet: Multi-frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection

Cyclic Refiner: Object-Aware Temporal Representation Learning for Multi-view 3D Detection and Tracking

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation