Kinect Registration

Problem Description

The Microsoft Kinect is a RGBD camera that comes as a sensor with the Xbox 360. It effectively provides us a coloured point cloud with 3D points. This project aimed at combining data from multiple Kinects, that is to register the point clouds obtained from them.
This requires that we estimate the rotation and translation that relates the coordinate frames of the individual Kinects.

Why do this?

We have identified 3 benefits:-

Cover larger area
Overcome occlusions
RGBD Slam

The Kinect is definitely mobile in the third case, but not so in the first two. However, the third case has the added benefit of signifcant overlap between consecutive frames. A mobile Kinect demands that the R & T be estimated in real time. In this project, we have assumed stationary kinects and hence the R & T estimation is a one time initialization and need not be real time.

Past Research

Point cloud registration has been studied in the context of matching laser range scans.

Iterative Closest Points

A popular algorithm in this field is ICP and its variants. ICP is the iterative closest point algorithm. It, however, requires large overlap between the point clouds.

Conditional Random Fields

F. Ramos, D. Fox, and H. Durrant-Whyte. CRF-matching: Conditional random fields for feature-based scan matching. In Proc. of Robotics: Science and Systems (RSS), 2007.
The Conditional Random Fields based approach is, however, computation expensive and not practical for point clouds with ~160k points. This makes its application for Kinect Registration very difficult.

Image feature matching

Peter Henry, Michael Krainin, Evan Herbst, Xiaofeng Ren and Dieter Fox. RGB-D Mapping: Using Depth Cameras for Dense 3D Modeling of Indoor Environments (Workshop). RSS Workshop on Advanced Reasoning with Depth Cameras, 2010
Given two point clouds and the corresponding 2D images, one can find image features such as SIFT and match these in the pixel space itself. Using a sample consensus approach to choose good matches and propogating these to matches in the point cloud lets us compute the R & T required for the registration. A video showing the results of their research is available here.

FPFH and PFH feature matching

Rusu, Radu Bogdan., Blodow, Nico., and Beetz, Michael, Fast Point Feature Histograms (FPFH) for 3D Registration, ICRA(2009)
Computing FPFH features/PFH features in the point cloud itself and matching them also provides the rigid transformation required.

Approaches Used

Calibration target - final result

Chessboard corners detected by Opencv

Calibration Target

Though the initial code base could handle more than 1 kinect, we tested with only two at a time. In the calibration target based approach, an asymmetric chessboard was placed such that it was visible to both the kinects. Opencv's chessboard detection code was used to detect it in the image data available from the kinect. Since, the chessboard was asymmetric the order in which the chessboard points were detected was always the same. This results in us obtaining immediate 1-1 correspondence between image pixels. The point cloud streamed by the OpenNI kinect driver has a one to one correspondence with pixel points, with nan values filling those points where the kinect failed to estimate the depth. The image pixel correspondences across the two kinects, can now be easily propograted to correspondences between 3D points. With a sufficient number of correspondences we can compute the rigid transformation required.

calibration target - chessboard corners

Results with calibration target

Pros

Very accurate estimate of transform is obtained - the chessboard points are detected robustly and accurately (sub pixel accuracy).
Fast

Cons

Placing chessboard so as to be visible by all kinects is difficult.
Chessboard occupies a portion of the field of view. The result overfits these points leading to a mis-alignment in other points. Everything does not fit smulgy because of error in kinect data. Features need to be spread across the entire image.

Image Feature matching

Image feature matching

Ransac filtered good matches of SIFT key points

Following the same approach as used by Dieter Fox et. al., we found SIFT features for the image points obtained. Using euclidean distance as the distance metric and RANSAC as the sample consensus approach, we choose corresponding image points that are spatially aligned. Here the spatial alignment was with respect to alighning the corresponding 3D points. The least error transformation was chosen as the desired rigid transform. We also experimented with 3D features namely, FHFP and PFH.

Images features - final result good

Good result using Image features

Images features - final result bad

Bad result using Image features

Pros

Features spread through out the field of view.

Cons

Does not work for low overlap as correspondences are not good enough.

Plane matching

In an attempt to improve the results, we wanted to use both shape and color information. FPFH and PFH features use only shape information, and SIFT based feature matching uses only color information. The overall aim was to match pathes of point cloud rather than mere points. For this matching both color (eg. color histogram) and shape information could be used. Since, working with general patches is very challenging, we started with planes and hence I have titled this approach as plane matching. This approach is novel and involves three steps:-

Plane extraction

Plane extraction

Point cloud library implements a ransac based plane extraction algorithm. This returns the plane that fits the largest number of points in the supplied point cloud. However, this does not discriminate based on the distance between the points. In other words, we get multiple patches of points lying on a plane rather than one patch alone. Using muliple patches as one plane at a time makes is difficult to reply on color matching. Also, not all of these patches may exist in the other kinect's point cloud. Hence, we tried to obtain only one patch per plane. This was achieved by a recursive implemtation of spatial clustering of points and plane fitting within clusters. The recusion was required because a plane fit and plane extract (pulling out of points that lie on that plane) operation, in general breaks the cluster into multiple clusters some of which could be planar.

Plane description

Each plane is described by its color histogram, which is obained from the RGB values corresponding to the points that lie on the plane. The plane is also described by the coefficients that form its equation. This constitutes the shape part of the description of a plane.

Plane matching and transform calculation

Plane matching - an illustration

The rotation came right

Rotation estimation
Rotation is obtained by aligning the normals. To make the plane detection robust we restricted ourselves to work with planes having atleast 1000 points. In a general scene we get ~10 good planes and these can be used for plane matching and transform calculation. Since, the number of planes is very small a brute force approach will also work. We search for the least error set of 4 corresponding planes and use these to calculate the transformation. The error metric is defined based on the cosine distance between the color histogram and the angle between the normals after rotation. Ideally the normals should exactly align but error in Kinect data and normal estimation leads to an error of ≤ 6 degress. We thresholded for these errors individually and choose the minimum of those which less than the threshold.
Translation estimation
Translation required an additional distance metric. To estimate the translation the candidate plane correspondences were used in the following manner:-

We traversed the set of correspondences and for each correspondence we computed the displacement along the aligned normal.
This displacement was added to the translation vector and subtracted (vector subtraction) from the other similar displacements between the remaining corresponding planes.
At this end of this process, the net translation vector was used to translate the already rotated planes.

The distance based error metric was further defined as the sum of any residual displacements, along their aligned normals, between corresponding planes. This was weighed and added to the original error metric. This process, however, does not ensure that translations for all three degrees of freedom are known. The correspondences may all end up being between parallel planes and the translation will end up being partially determined. To solve this problem, it was ensured that the normals of the planes vary substantially in their direction. The normals were treated as points on a unit sphere and their covariance matrix was calculated. The eigen values of this matrix determine the spread of these normals along three distinct directions (eigen vectors). The entire process was carried out iff the three eigen values were all greater than a threshold (in our case 0.1).

Cons

The implementation has bugs and hence, translation is not correctly determined.
We need to generalize beyond using mere planes. Finding matching planes along three distinct directions is a tighter condition than having large overlap. The problem is already solved for large overlap by Dieter Fox et. al.
Very slow - The clustering algorithm is implemented by region growing and is very slow.