The goal of this thesis was to be able to let a robot “see” an object, identify it, and figure out the best way to grab it. My focus was on improving interactions between humans and robots, specifically using soft robotic hands for assistive applications. It was performed under the supervision of Professor Xinyu Liu.
The hope for this thesis was to potentially improve people's lives through robotic interaction. The use of a soft robotic hand opens up applications in areas requiring gentle and precise handling, a crucial aspect in assistive devices.
Seemingly a simple process, this involved neural network driven image segmentation and 3D data classification, with a touch of inverse kinematics. I approached object recognition through Mask R-CNN, which allowed me to make a 3D model of the object, and perform decision-making with a 3D CNN.
The images were captured using the Intel RealSense LiDAR L515 camera, a depth-sensing device that combines an RGB camera with a LiDAR sensor to produce high-resolution and accurate depth maps. For the masking CNN, only the 2D images were used.
The dataset used in this study consists of ten different objects: an apple, a banana, a Pringles can, Cheez-It crackers, a coffee cup, a Coke can, a chocolate jello box, a strawberry, a pear, and a mustard bottle. These objects were chosen to represent a diverse range of shapes, sizes, and textures, enabling the evaluation of the proposed framework's performance in various real-world scenarios.
In order to train the neural network on our objects, each of the 5000 training images had to be labelled. Each label consisted of points, the vertices of a polygon surrounding the image.
Mask R-CNN is like a detailed eye for the computer, helping it not just see objects in images but also understand their shapes clearly. This tool was vital for my thesis as it ensured the robot could identify and accurately outline objects, a critical step for precise grasping.
When training the model on the above dataset, special attention was given to ensuring a diverse range of object shapes and sizes to mimic real-world scenarios. The training process involved tuning parameters to balance detection accuracy and computational efficiency, making the model suitable for real-time applications.
The output of Mask R-CNN includes bounding boxes, class labels, and masks for each detected object. I used this information to create a detailed spatial map, enabling the robotic hand to understand object shapes and positions.
To isolate the fingers and the object, I also computed inverse masks to outline everything but the object within a circular area, given the size of the object. These three image set-ups allowed the 3D CNN to also analyze the finger placement for more accurate grasping.
After obtaining detailed object outlines from the object detection process, I transformed these 2D images into 3D point clouds. This involved mapping each pixel in the object's mask to its corresponding spatial coordinates using depth information from the LiDAR camera.
Point clouds can be dense and noisy, making them challenging to work with directly. To fix this, I applied filtration techniques to remove noise and reduce point density while preserving the essential shape and size of the objects.
Using raw point clouds for the 3D CNN proved computationally expensive. To optimize this, I converted the point clouds into voxel grids, a 3D equivalent of pixels. This representation significantly reduced the computational load, making it feasible to train the 3D CNN.
Convolutional Neural Networks (CNNs) are great at processing spatial data, which makes them perfect for interpreting the 3D voxels of the objects. Unlike the 2D CNN used for object detection, which only analyze images for height and width, 3D CNNs also consider depth, allowing for a more accurate understanding of objects in a three-dimensional space.
With the 3D CNN trained and functioning at 93% accuracy, I was able to finish the perception pipeline for the thesis. When running, the perception algorithm would be able to output the location of the object along with the ideal grasping direction and mode for the object, given the orientation in which it's being held.
For the grasping tasks, I used a UR5 robotic arm. The arm’s setup, including its controller and sensor arrangement, is designed for precision and flexibility in various grasping scenarios.
The robotic arm's movements are based on inverse kinematics calculations. Starting with the desired position and orientation for grasping, determined by the 3D CNN output, I calculated a transformation matrix that guides the hand to the right spot in 3D space, aligning it for the grasp.
The calculated transformation matrix leads to possible joint configurations for the arm. I then select the best configuration, ensuring it respects the UR5's joint limits for safe operation. If there's no suitable configuration, the system flags it, preventing unsafe movements.
Different grasping modes — pinch, grasp, and hold — are programmed to handle various objects, from small and delicate to large and irregular. Each mode adjusts the motor angles of the robotic hand for optimal contact with the object. Pinch mode is for precision, grasp mode for minimal contact, and hold mode for maximum grip. This adaptability ensures the robot can efficiently handle a wide range of objects.
Sebastian Levy - Portfolio