Abstract
Deep learning method for 6D object pose estimation based on RGB image and depth (RGB-D) has been successfully applied to robot grasping. The fusion of RGB and depth is one of the most important difficulties. Previous works on the fusion of these two features are mostly concatenated together without considering the different contributions of the two types of features to pose estimation. We propose a selective embedding with gated fusion structure called SEGate, which can adjust the weights of RGB and depth features adaptively. Furthermore, we aggregate the local features of point clouds according to the distance between them. More specifically, the close point clouds contribute a lot to local features, while the distant point clouds contribute a little. Experiments show that our approach achieves the state-of-art performance in both LineMOD and YCB-Video datasets. Meanwhile, our approach is more robust to the pose estimation of occluded objects.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Munaro M, Menegatti E (2014) Fast RGB-D people tracking for service robots. Auton Robot 37(3):227–242
Hinterstoisser S, Cagniart C, Ilic S, Sturm P, Navab N, Fua P, Lepetit V (2011) Gradient response maps for real-time detection oftextureless objects. IEEE Trans PAMI 34(5):876–888
Hinterstoisser S, Lepetit V, Ilic S, Holzer S, Bradski G, Konolige K, Navab N (2012) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In: Asian conference on computer vision, pp 548–562
Besl PJ, McKay ND (1992) Method for registration of 3-D shapes. Sens fus IV: Control Paradig Data Struct 1611:586–606
Drost B, Ulrich M, Navab N, Ilic S (2010) Model globally, match locally: efficient and robust 3D object recognition. In: IEEE computer society conference on computer vision and pattern recognition, pp 998–1005
Papazov C, Burschka D (2010) An efficient ransac for 3d object recognition in noisy and occluded scenes. In: Asian conference on computer vision, pp 135–148
Hinterstoisser S, Lepetit V, Rajkumar N, Konolige K (2016) Going further with point pair features. In: European conference on computer vision, pp 834–848
Kiforenko L, Drost B, Tombari F, Kruger N, Buch AG (2018) A performance evaluation of point pair features. Comput Vis Image Underst 166:66–80
Schnabel R, Wahl R, Klein R (2007) Efficient RANSAC for point-cloud shape detection. Comput Gr forum 26(2):214–226
Aldoma A, Marton ZC, Tombari F, Wohlkinger W, Potthast C, Zeisl B, Vincze M (2012) Tutorial: point cloud library: three-dimensional object recognition and 6 dof pose estimation. IEEE Robot Automation Mag 19(3):80–91
Aldoma A, Tombari F, Stefano LD, Vincze M (2012) A global hypotheses verification method for 3d object recognition. In: European conference on computer vision, pp 511–524
Guo Y, Bennamoun M, Sohel F, Lu M, Wan J, Kwok NM (2016) A comprehensive performance evaluation of 3D local feature descriptors. Int J Comput Vis 116(1):66–89
Doumanoglou A, Kouskouridas R, Malassiotis S, Kim TK (2016) Recovering 6D object pose and predicting next-best-view in the crowd. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3583–3592
Tejani A, Kouskouridas R, Doumanoglou A, Tang D, Kim TK (2017) Latent-class hough forests for 6 DoF object pose estimation. IEEE Trans PAMI 40(1):119–132
Brachmann E, Krull A, Michel F, Gumhold S, Shotton J, Rother C (2014) Learning 6d object pose estimation using 3d object coordinates. In: European conference on computer vision, pp 536–551
Brachmann E, Michel F, Krull A, Yang MY, Gumhold S (2016) Uncertainty-driven 6d pose estimation of objects and scenes from a single rgb image. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3364–3372
Rangaprasad AS (2017) Probabilistic approaches for pose estimation, Carnegie Mellon University
Rad M, Lepetit V (2017) BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: Proceedings of the IEEE international conference on computer vision, pp 3828–3836
Kehl W, Manhardt F, Tombari F, Ilic S, Navab N (2017) SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. In: Proceedings of the IEEE international conference on computer vision, pp 1521–1529
Xiang Y, Schmidt T, Narayanan V, Fox D (2017) Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes. Preprint arXiv:1711.00199
Li C, Bai J, Hager GD (2018) A unified framework for multi-view multi-class object pose estimation. In: Proceedings of the european conference on computer vision (ECCV), pp 254–269
Wang C, Xu D, Zhu Y, Martín-Martín R, Lu C, Fei-Fei L (2019) Densefusion: 6d object pose estimation by iterative dense fusion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3343-3352
Suwajanakorn S, Snavely N, Tompson JJ, Norouzi M (2018) Discovery of latent 3d keypoints via end-to-end geometric reasoning. In: Advances in neural information processing systems, pp 2059–2070
Tremblay J, To T, Sundaralingam B, Xiang Y, Fox D, Birchfield S (2018) Deep object pose estimation for semantic robotic grasping of household objects. Preprint arXiv:1809.10790
Kendall A, Grimes M, Cipolla R (2015) Posenet: a convolutional network for real-time 6-dof camera relocalization. In: Proceedings of the IEEE international conference on computer vision, pp 2938–2946
Song S, Xiao J (2016) Deep sliding shapes for amodal 3d object detection in rgb-d images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 808–816
Li C, Lu B, Zhang Y, Liu H, Qu Y (2018) 3D reconstruction of indoor scenes via image registration. Neural Process Lett 48(3):1281–1304
Qi CR, Liu W, Wu C, Su H, Guibas LJ (2018) Frustum pointnets for 3d object detection from rgb-d data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 918–927
Zhou Y, Tuzel O (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 4490–4499
Guo D, Li W, Fang X (2017) Capturing temporal structures for video captioning by spatio-temporal contexts and channel attention mechanism. Neural Process Lett 46(1):313–328
Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Park J, Woo S, Lee JY, Kweon IS (2018) Bam: bottleneck attention module. Preprint arXiv:1807.06514
Woo S, Park J, Lee JY, Kweon IS (2018) Cbam: convolutional block attention module. In: Proceedings of the european conference on computer vision (ECCV), pp 3–19
Wojek C, Walk S, Roth S, Schiele B (2011) Monocular 3D scene understanding with explicit occlusion reasoning. CVPR 2011:1993–2000
Xu Y, Zhou X, Liu P, Xu H (2019) Rapid pedestrian detection based on deep omega-shape features with partial occlusion handing. Neural Process Lett 49(3):923–937
Sanyal R, Ahmed SM, Jaiswal M, Chaudhury KN (2017) A scalable ADMM algorithm for rigid registration. IEEE Signal Process Lett 24(10):1453–1457
Eitel A, Springenberg J T, Spinello L, Riedmiller M, Burgard W (2015) Multimodal deep learning for robust RGB-D object recognition. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 681–687
Wang W, Neumann U (2018) Depth-aware cnn for rgb-d segmentation. In: Proceedings of the european conference on computer vision (ECCV), pp 135–150
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Bell S, Lawrence Zitnick C, Bala K, Girshick R (2016) Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2874–2883
Cheng Y, Cai R, Li Z, Zhao X, Huang K (2017) Locality-sensitive deconvolution networks with gated fusion for rgb-d indoor semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3029–3037
Qi CR, Su H, Mo K, Guibas LJ (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 652–660
Qi CR, Yi L, Su H, Guibas LJ (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in neural information processing systems, pp 5099–5108
Sundermeyer M, Marton ZC, Durner M, Brucker M, Triebel R (2018) Implicit 3d orientation learning for 6d object detection from rgb images. In: Proceedings of the european conference on computer vision (ECCV), pp 699–715
Xu D, Anguelov D, Jain A (2018) Pointfusion: deep sensor fusion for 3d bounding box estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 244–253
Hinterstoisser S, Holzer S, Cagniart C, Ilic S, Konolige K, Navab N, Lepetit V (2011) Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In: International conference on computer vision, pp 858–865
Acknowledgements
This work was supported by National Natural Science Foundation of China under grant 61231010.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
The LineMOD dataset contains more than 18,000 real images in 13 video sequences, each of which contains several low-textured objects. It is one of the classic datasets, which is widely used in traditional approaches and recent deep learning methods. Figure 7 shows the qualitative results for all remaining objects on the LineMOD dataset.
The YCB-Video dataset consists of 21 objects in 92 RGB-D video sequences. It contains a total of 133,827 frames with 3 to 9 objects per frame. The YCB-Video dataset contains many heavily occluded objects. Therefore, it is often used to verify severe occlusion problem in object pose estimation. We compare the pose estimation of objects with more severe occlusion on the YCB-Video dataset in Fig. 8. As shown in Fig. 8, PoseCNN+ICP and DenseFusion perform poorly in estimating the pose of the objects (tomato_soup_can, bowl, cracker_box, wood_block, large_clamp and pudding_box) due to severe occlusion. Our approach is more robust to the pose estimation of heavily occluded objects.
Table 4 lists the results of all 21 objects on the YCB-Video dataset under the ADD metrics. Ours(SEGate) method outperforms DenseFusion by 1.3% on the ADD(< 2 cm) metric and is basically the same on the ADD(AUC) metric. PoseCNN+ICP performs slightly better on the AUC metric, but it is extremely time consuming due to ICP. Note that the MEAN value of ADD(AUC) and the MEAN value of ADD(< 2 cm) are the area under the accuracy curve of all objects in the dataset and the accuracy of pose estimation error which is less than 2 cm of all objects in the dataset respectively, not the average of all 21 objects in Table 4.
Rights and permissions
About this article
Cite this article
Sun, S., Liu, R., Du, Q. et al. Selective Embedding with Gated Fusion for 6D Object Pose Estimation. Neural Process Lett 51, 2417–2436 (2020). https://doi.org/10.1007/s11063-020-10198-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-020-10198-8