header photo

Embodied Reasoning for Discovering Object Properties viaManipulation

[PDF, Video, Source codes ]


Jan Kristof Behrens (1), Michal Nazarczuk (2), Karla Stepanova (3,1), Matej Hoffmann (3), Yiannis Demiris (2), and Krystian Mikolajczyk (2)

(1)Czech Institute of Informatics, Robotics, and Cybernetics, CTU in Prague, Czech Republic jan.kristof.behrens@cvut.cz,karla.stepanova@cvut.cz, (2) Department of Electrical and Electronic Engineering, Imperial College London, London, UK, michal.nazarczuk17@imperial.ac.uk, y.demiris@imperial.ac.uk, k.mikolajczyk]@imperial.ac.uk, (3) Department of Cybernetics, Faculty of Electrical Engineering, CTU in Prague, matej.hoffmann@fel.cvut.cz


In this paper we present an integrated system that includes a reasoning from visual and natural language inputs, action and motion planning, executing tasks by a robotic arm, manipulating objects and discovering their properties. The vision to action module recognises the scene with objects and their attributes and analyses enquiries formulated in natural language. It performs multi-modal reasoning and generates a sequence of simple actions that can be executed by the embodied agent. The scene model and action sequence are sent to the planning and execution module that generates motion plan with collision avoidance, simulates the actions as well as executes them by the embodied agent. We extensively use simulated data to train various components of the system which make it more robust to changes in the real environment thus generalise better. We focus on the tabletop scene with objects that can be grasped by our embodied agent, which is 7DoF manipulator with a two-finger gripper. We evaluate the agent on 60 representative queries repeated 3 times (e.g., 'Check what is on the other side of the soda can') concerning different objects and tasks in the scene. We perform experiments in simulated and real environment and report the success rate for various components of the system. Our system achieves up to 80.6\% success rate on challenging scenes and queries. We also analyse and discuss the challenges that such intelligent embodied system faces.

Attachment video

Video showing individual components of the system presented in the article. Most importantly, it describes the system architecture and contains illustrative videos of executing individual action sequences. Main focus is to show actions, where the real setup is bringing additional challenges to execution and is affecting the outcome of these actions. In addition, we visualize also example failure cases.

Here is a link to the video.

Additional resources

Submitted manuscript [PDF]

Source codes

  • [GitHub with source codes]: Source code for execution of individual primitive robotic actions.
  • Scene 1, set of questions and corresponding generated action sequences by V2A:
    • Scene 1 - questions.json
    • Scene 1 - scenes.json
        Scene 1 - Visualisation of the scene 1 Scene 1 - visualisation

        Related articles

        • Manuscript published at ICRA 2020: Nazarczuk and K. Mikolajczyk, SHOP-VRB: A Visual ReasoningBenchmark for Object Perception, in International Conference onRobotics and Automation (ICRA), 2020. [GitHub of the dataset]
        • Manuscript published at ACCV 2020: M. Nazarczuk and K. Mikolajczyk, V2A-Vision to Action: Learningrobotic arm actions based on vision and language, in Asian Conference on Computer Vision (ACCV), 2020. [GitHub of the code (under preparation)]
        • Manuscript published at RAL and ICRA 2019: J. K. Behrens, K.Stepanova, R.Lange, R.Skoviera, “Specifying Dual-Arm Robot Planning Problems Through Natural Language and Demonstration", webpage including link to the manuscript and source codes: imitrob.ciirc.cvut.cz/planning.html
        • Manuscript published at ICRA 2019: J. K. Behrens, R. Lange, and M. Mansouri, “A constraint programming approach to simultaneous task allocation and motion scheduling for industrial dual-arm manipulation tasks,” [PDF], The code and the setup details: https://github.com/boschresearch/STAAMS-SOLVER

      Details about the simulation setup

      The state of the objects in the MuJoCo environment is determined by the physical simulation taking into account the contact states, the resulting forces, the object geometry, and the mass properties of the objects. However, the object models must be adapted to perform physical simulation. Directly using the high-resolution meshes from Blender does not result in a stable simulation. The meshes are too complex (increase the complexity of collision checking) and do not consist of closed volumes, which prevents calculation of mass properties such as moment of inertia. MuJoCo calculates for each pair of rigid bodies in contact a contact point. If an object consists only of a single rigid body only a single contact point can be generated. Because a point contact removes 3 DoF from the object, at least two contact points are required to immobilise an object on a resting surface. Therefore, we break down the complex meshes into convex bodies using V-HACD (Mamou,Ghorbel, 2009) and organize them as URDF. The object URDF holds the original mesh for visualisation, the mass properties, and the convex decomposition bodies as collision geometry (joined with fixed joints). We estimate the mass properties of the objects with thin walls using voxelization and convex decomposition. The mesh manipulations are executed using the trimesh library.