Thesis

The Goal

The goal of this thesis was to be able to let a robot “see” an object, identify it, and figure out the best way to grab it. My focus was on improving interactions between humans and robots, specifically using soft robotic hands for assistive applications. It was performed under the supervision of Professor Xinyu Liu.

The Reason

The hope for this thesis was to potentially improve people's lives through robotic interaction. The use of a soft robotic hand opens up applications in areas requiring gentle and precise handling, a crucial aspect in assistive devices.

The Process

Seemingly a simple process, this involved neural network driven image segmentation and 3D data classification, with a touch of inverse kinematics. I approached object recognition through Mask R-CNN, which allowed me to make a 3D model of the object, and perform decision-making with a 3D CNN.

Collecting Data

The images were captured using the Intel RealSense LiDAR L515 camera, a depth-sensing device that combines an RGB camera with a LiDAR sensor to produce high-resolution and accurate depth maps. For the masking CNN, only the 2D images were used.

The Dataset

The dataset used in this study consists of ten different objects: an apple, a banana, a Pringles can, Cheez-It crackers, a coffee cup, a Coke can, a chocolate jello box, a strawberry, a pear, and a mustard bottle. These objects were chosen to represent a diverse range of shapes, sizes, and textures, enabling the evaluation of the proposed framework's performance in various real-world scenarios.

Polygonal Features

In order to train the neural network on our objects, each of the 5000 training images had to be labelled. Each label consisted of points, the vertices of a polygon surrounding the image.

Object Detection and Image Segmentation

What is Mask R-CNN?

Mask R-CNN is like a detailed eye for the computer, helping it not just see objects in images but also understand their shapes clearly. This tool was vital for my thesis as it ensured the robot could identify and accurately outline objects, a critical step for precise grasping.

Input Images

When training the model on the above dataset, special attention was given to ensuring a diverse range of object shapes and sizes to mimic real-world scenarios. The training process involved tuning parameters to balance detection accuracy and computational efficiency, making the model suitable for real-time applications.

Model Output

The output of Mask R-CNN includes bounding boxes, class labels, and masks for each detected object. I used this information to create a detailed spatial map, enabling the robotic hand to understand object shapes and positions.

Finger Isolation

To isolate the fingers and the object, I also computed inverse masks to outline everything but the object within a circular area, given the size of the object. These three image set-ups allowed the 3D CNN to also analyze the finger placement for more accurate grasping.

Cloudy with a Chance of Points

After obtaining detailed object outlines from the object detection process, I transformed these 2D images into 3D point clouds. This involved mapping each pixel in the object's mask to its corresponding spatial coordinates using depth information from the LiDAR camera.

The Great Filter

Cloudy with a Chance of Points

Point clouds can be dense and noisy, making them challenging to work with directly. To fix this, I applied filtration techniques to remove noise and reduce point density while preserving the essential shape and size of the objects.

Cubic Clarity

Cloudy with a Chance of Points

Cubic Clarity

Using raw point clouds for the 3D CNN proved computationally expensive. To optimize this, I converted the point clouds into voxel grids, a 3D equivalent of pixels. This representation significantly reduced the computational load, making it feasible to train the 3D CNN.

Convolutional Cognition

Convolutional Neural Networks (CNNs) are great at processing spatial data, which makes them perfect for interpreting the 3D voxels of the objects. Unlike the 2D CNN used for object detection, which only analyze images for height and width, 3D CNNs also consider depth, allowing for a more accurate understanding of objects in a three-dimensional space.

Bringing It All Together

With the 3D CNN trained and functioning at 93% accuracy, I was able to finish the perception pipeline for the thesis. When running, the perception algorithm would be able to output the location of the object along with the ideal grasping direction and mode for the object, given the orientation in which it's being held.

Armed and Ready

Grasping the Geometry

For the grasping tasks, I used a UR5 robotic arm. The arm’s setup, including its controller and sensor arrangement, is designed for precision and flexibility in various grasping scenarios.

Grasping the Geometry

The robotic arm's movements are based on inverse kinematics calculations. Starting with the desired position and orientation for grasping, determined by the 3D CNN output, I calculated a transformation matrix that guides the hand to the right spot in 3D space, aligning it for the grasp.

Playing it Safe

The calculated transformation matrix leads to possible joint configurations for the arm. I then select the best configuration, ensuring it respects the UR5's joint limits for safe operation. If there's no suitable configuration, the system flags it, preventing unsafe movements.

Sebastian Levy

3D Vision & Deep Learning-Based Robotic Grasping

The Goal

The Reason

The Process

Seek and Segment

Collecting Data

The Dataset

Polygonal Features

Polygonal Features

Object Detection and Image Segmentation

What is Mask R-CNN?

Input Images

Model Output

Finger Isolation

Into the 3rd Dimension

Cloudy with a Chance of Points

Cloudy with a Chance of Points

Cloudy with a Chance of Points

The Great Filter

Cloudy with a Chance of Points

Cloudy with a Chance of Points

Cubic Clarity

Cloudy with a Chance of Points

Cubic Clarity

Convolutional Cognition

Bringing It All Together

One Size Grips All

Armed and Ready

Grasping the Geometry

Grasping the Geometry

Grasping the Geometry

Playing it Safe

Playing it Safe

A Helping Hand

3D Vision & Deep Learning-Based Robotic Grasping

The Goal

The Reason

The Process

Seek and Segment

Collecting Data

The Dataset

Polygonal Features

Polygonal Features

Object Detection and Image Segmentation

What is Mask R-CNN?

Input Images

Model Output

Finger Isolation

Into the 3rd Dimension

Cloudy with a Chance of Points

Cloudy with a Chance of Points

Cloudy with a Chance of Points

The Great Filter

Cloudy with a Chance of Points

Cloudy with a Chance of Points

Cubic Clarity

Cloudy with a Chance of Points

Cubic Clarity

Convolutional Cognition

Bringing It All Together

One Size Grips All

Armed and Ready

Grasping the Geometry

Grasping the Geometry

Grasping the Geometry

Playing it Safe

Playing it Safe

A Helping Hand

This website uses cookies.