A Case for Sensor Fusion for 3D Object Annotations

4 minutes read

Achieving the highest quality of annotated perception data while increasing the automation level is a central challenge for computer vision AI. In our blog about the use of 2D and 3D sensor data for machine perception we’ve established that we’ll need both types to make this work. So how do we combine two and three dimensional worlds?

What is Sensor Fusion?

In machine perception, there’s one approach that utilizes the benefits of both 2D and 3D data while avoiding the downsides of both. It’s Sensor Fusion - a method of merging the information coming from two or more sensor sources to create a new representation of the environment - one superior to each of the individual ones.

We’ll illustrate it on a specific case of sensor fusion between 2D camera data and data coming from a 3D sensor such as LiDAR. Joining those two modalities creates a richer representation of the surrounding world and is essential for numerous Computer Vision approaches.

Look at this interesting real-life example of sensor fusion of a camera image and a lidar point cloud from Intelligent Vehicles Section at TU Delft:

A sensor fusion example, source: Intelligent Vehicles Section, TU Delft

A World of Computer Vision Possibilities

Computer vision approaches benefit greatly from having multiple representations of a scene as input. In the specific case of 3D object detection and localization, sensor fusion enables a much more accurate output - you get extremely good 3D annotations out of the labeling process.

Some of the many exciting examples are:

3D Point Coloring

When provided with an image and a 3D LiDAR scan of the same scene, the RGB information from the camera can be applied to augment the 3D points and ‘paint’ them in the colors of the pixels, which match their location. This makes the point cloud not only easier to interpret by humans, but also much more useful for the 3D Object Detection and 3D Object Recognition algorithms used for automating the annotation process.

Distance Estimation

The ‘depth’ information from 3D data for a matched 2D scene helps us to estimate the distance and dimensions of objects visible in a 2D image but too far away to be detected by a LiDAR. Predicting the exact distance is especially useful for tracking and planning approaches, which analyze and predict the behavior and trajectories of traffic participants, so necessary in autonomous driving.

Dense Point Clouds

LiDAR point clouds are sparse in their nature, especially compared to the visual density of a standard 2D image. By utilizing the depth information and the 2D scene match, we can create a much denser version by augmenting the original point cloud and adding additional points. Again, this is especially useful for the tasks of 3D Object Detection and 3D Object Recognition.

depth estimation model used to make denser point clouds
A pointcloud generated from a 2D camera image using a depth estimation model. The results are not perfectly accurate, but show a much higher information density than LiDAR sensors

Of course, nothing so good comes easy. In order to reap those amazing benefits, some prerequisites have to be met.

How to Enable Sensor Fusion?

If you want to make the most of Sensor Fusion, your data should meet a couple of conditions:

1. Intersecting fields of view

This one is a bit obvious but it has to be mentioned - the sensors must have a sufficiently large overlap between their FOVs (fields of view). The process of sensor fusion is dependent on creating a match between the two views, so the augmentation can only happen in overlapping regions.

2. Calibrated sensors

Having correctly calibrated sensors is key to achieving good results. The camera needs to have both its intrinsic matrix and distortion parameters correctly estimated. Additionally, an extrinsic calibration matrix representing the relation between the coordinate systems of the two sensors needs to be computed and provided. It is essential for bringing the two data streams into a unified representation.

3. Synchronized recording

Last but not least - the two sensors and their recordings have to be synchronized and happen at the same time. If there’s a delay between the recordings of the sensors, it can render all of the data ‘infusible’. The sensors will produce different snapshots of the world around them, which can’t be precisely matched, and introduces distortion to the fused version. This doesn’t play out well for Computer Vision models.


Automated Sensor Fusion Annotation

Long story short - both 2D and 3D data is useful on its own, but the true value is realized only when combined. By fusing the information from both sources and processing it through our automation engine pipeline, the UAI annotation tooling produces extremely accurate object annotations, consistent with the real position, size, and orientation of the objects in 2D and 3D. If the most accurate annotations are what you’re after, get in touch!

Georgi Urumov, Deep Learning engineer, technical writer, entrepreneur for understand.ai