Abstract
Natural language object retrieval is a highly useful yet challenging task for robots in human-centric environments. Previous work has primarily focused on commands specifying the desired object’s type such as “scissors” and/or visual attributes such as “red,” thus limiting the robot to only known object classes. We develop a model to retrieve objects based on descriptions of their usage. The model takes in a language command containing a verb, for example “Hand me something to cut,” and RGB images of candidate objects; and outputs the object that best satisfies the task specified by the verb. Our model directly predicts an object’s appearance from the object’s use specified by a verb phrase, without needing an object’s class label. Based on contextual information present in the language commands, our model can generalize to unseen object classes and unknown nouns in the commands. Our model correctly selects objects out of sets of five candidates to fulfill natural language commands, and achieves a mean reciprocal rank of 77.4% on a held-out test set of unseen ImageNet object classes and 69.1% on unseen object classes and unknown nouns. Our model also achieves a mean reciprocal rank of 71.8% on unseen YCB object classes, which have a different image distribution from ImageNet. We demonstrate our model on a KUKA LBR iiwa robot arm, enabling the robot to retrieve objects based on natural language descriptions of their usage (Video recordings of the robot demonstrations can be found at https://youtu.be/WMAdGhMmXEQ). We also present a new dataset of 655 verb-object pairs denoting object usage over 50 verbs and 216 object classes (The dataset and code for the project can be found at https://github.com/Thaonguyen3095/affordance-language).
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
We did not use the COCO dataset as our training image set as it has a much smaller number of object classes in comparison to ImageNet’s 1000 object classes.
References
Calli, B.,Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., and Dollar, A.,M. (2015). The YCB object and Model set: Towards common benchmarks for manipulation research. In Proceedings of the IEEE international conference on advanced robotics, pp. 510–517.
Chao, Y-W., Wang Z., Mihalcea R., and Deng, J. (2015). Mining semantic affordances of visual object categories. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4259–4267.
Chen, K., Choy, C. B., Savva, M., Chang, A. X., Funkhouser, T and Savarese, S. (2018). Text2Shape: Generating shapes from natural language by learning joint embeddings. In Asian conference on computer vision, pp. 100–116. Springer.
Cohen, V., Burchfiel, B., Nguyen, T., Gopalan, N., Tellex, S., and Konidaris, G. (2019). Grounding language attributes to objects using Bayesian Eigen objects. In Proceedings of the IEEE international conference on intelligent robots and systems.
Do, T-T., Nguyen, A., and Reid, I. (2018). AffordanceNet: An end-to-end deep learning approach for object affordance detection. In Proceedings of the IEEE international conference on robotics and automation, pp. 1–5.
Elman, Jeffrey L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211. https://doi.org/10.1207/s15516709cog1402_1.
Fulda, N., Ricks, D., Murdoch, B., and Wingate D. (2017). What can you do with a rock? affordance extraction via word embeddings. arXiv preprint arXiv:1703.03429.
Hatori, J., Kikuchi, Y., Kobayashi, S., Takahashi, K., Tsuboi, Y., Unno, Y., Ko, W., and Tan, J. (2018). Interactively picking real-world objects with unconstrained spoken language instructions. In Proceedings of the IEEE international conference on robotics and automation, pp. 3774–3781.
He, K., Gkioxari, G., Dollár, P., and Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
Honnibal, M and Montani, I (2017). spaCy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear.
Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., and Darrell, T. (2016). Natural language object retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4555–4564.
James, J. G. (1977). The theory of affordances. USA: Hilldale.
Joseph, L. F. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378.
Kelvin, X., Jimmy, B., Ryan, K., Kyunghyun, C., Aaron, C., Ruslan, S., Rich, Z and Yoshua, B. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057.
Kingma, D. P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Krishnamurthy, J., & Kollar, T. (2013). Jointly learning to parse and perceive: Connecting natural language to the physical world. Transactions of the Association for Computational Linguistics, 1, 193–206.
Lin, T-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Lawrence, Z. C. (2014). Microsoft COCO: Common objects in context. In European conference on computer vision, pp. 740–755. Springer.
Mallick, A., Pobil, A. P. D., and Cervera, E. (2018). Deep learning based object recognition for robot picking task. In Proceedings of the 12th international conference on ubiquitous information management and communication, pp. 1–9.
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. (2014). Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632.
Mikolov, T., Chen, K., Corrado, G., and Dean, J., (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Myers, A., Teo, C. L., Fermüller, C., and Aloimonos, Y. (2015). Affordance detection of tool parts from geometric features. In Proceedings of the IEEE international conference on robotics and automation, pp. 1374–1381.
Nguyen, T., Gopalan, N., Patel, R., Corsaro, M., Pavlick, E., and Tellex, S. (2020). Robot object retrieval with contextual natural language queries. In Proceedings of robotics: Science and systems, Corvalis, Oregon, USA. https://doi.org/10.15607/RSS.2020.XVI.080.
Patterson, G., and Hays, J. (2012). SUN Attribute database: Discovering, annotating, and recognizing scene attributes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2751–2758.
Pennington, J., Socher, R., and Manning, C. D. (2014) GloVe: Global vectors for word representation. In Proceedings of the conference on empirical methods in natural language processing, pp. 1532–1543.
Russakovsky, O., Deng, J., Hao, S., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y.
Schlangen, D., Zarriess, S., and Kennington, C., (2016). Resolving references to objects in photographs using the words-as-classifiers model. In Proceedings of the 54th annual meeting of the association for computational linguistics, pp. 1213–1223. ISBN 9781510827585. http://arxiv.org/abs/1510.02125.
Shridhar, M., and Hsu, D. (2018). Interactive visual grounding of referring expressions for human-robot interaction. arXiv preprint arXiv:1806.03831.
Speer, R., Chin, J., and Havasi, C. (2017). ConceptNet 5.5: An open multilingual graph of general knowledge. In Thirty-First AAAI conference on artificial intelligence.
Tan, M., and Le, Q. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pp. 6105–6114. PMLR.
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: A Neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164.
Whitney, D., Rosen, E., MacGlashan, J., Lawson L.S.W, and Stefanie, T. (2017). Reducing errors in object-fetching interactions through social feedback. In Proceedings of the IEEE international conference on robotics and automation, pp. 1006–1013.
Acknowledgements
The authors would like to thank Prof. James Tompkin for advice on selecting the image dataset and encoder, and Eric Rosen for help with video editing. This work is supported by the National Science Foundation under award numbers IIS-1652561 and IIS-1717569, NASA under award number NNX16AR61G, and with support from the Hyundai NGV under the Hyundai-Brown Idea Incubation award and the Alfred P. Sloan Foundation.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This is one of the several papers published in Autonomous Robots comprising the Special Issue on Robotics: Science and Systems 2020.
Rights and permissions
About this article
Cite this article
Nguyen, T., Gopalan, N., Patel, R. et al. Affordance-based robot object retrieval. Auton Robot 46, 83–98 (2022). https://doi.org/10.1007/s10514-021-10008-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10514-021-10008-7