A Matter of Perspective - 3D vs. 2D Sensor Data for Machine Perception

6 minutes read

One of the biggest differences between humans and machines is in the way we perceive our surroundings. We both exist in a three dimensional world, but while humans are naturally capable of figuring out complex geometry, the effects of perspective, occlusion, vanishing points, object permanence and much more, machines have quite the difficult time dealing with the simplest of cases.

Enabling them to ‘see’ is still a field of active research. New sensors have all surpassed natural human capabilities, yet machines’ ability to understand what they perceive is still limited. They struggle with ‘connecting the dots’ between observations. We tried to teach them by mimicking our own learning behaviour - by providing a stream of video data and expecting them to learn from it. Sadly, this has proven useful only for a limited number of tasks. Extensive research in recent years is indicating the missing link might be ‘depth’, adding the context of the third dimension.

3D object in a lidar point cloud
3D object (a car) in a lidar point cloud

Why do you need 3D data for your perception algorithms

Simply put, you can’t truly capture the nature of a 3D object in its 2D representation. The number gives it away - you are losing a whole dimension in the process - namely ‘depth’. Humans know how geometrical projection works and have a mental model of how distances and perspective affect the appearance of different objects. This enables us to estimate their relative position, size and orientation, even from a single image. Machines are yet to get there.

In order to try and bridge this gap, various sensors capable of providing a complete 3D representation of the surroundings are being used. LiDAR, radar and ultrasound are amongst the most popular options, each having their own advantages, disadvantages and applications.

The wonders and burdens of using 3D data

As to everything in life, there are two sides to using 3D data. Getting that extra dimension really comes at a price. Some of the major challenges include:

- Complex and expensive sensors

No matter what, money is always a factor. 3D capable sensors greatly vary in build complexity and accordingly - in price, ranging from hundreds to thousands of dollars. Choosing them over the standard camera setup is not cheap, especially given that you would usually need multiple units in order to guarantee a large enough field of view.

- Low-resolution data

In many cases, the data gathered by 3D sensors is nowhere as dense or high-resolution as the one from conventional cameras. In the case of LiDARs, a standard sensor discretizes the vertical space in lines (number of lines vary), each having a couple hundreds detection points. This produces approximately 1000 times fewer data points than what is contained in a standard HD picture. Furthermore, the further away the object is located, the fewer samples land on it, due to the conical shape of the laser beams’ spread. Thus the difficulty of detecting objects increases exponentially with their distance from the sensor.

Clearly visible bus in a camera image compared with a sparse representation in a 3D point cloud
A bus clearly visible in 2D compared to a sparse representation in a 3D point cloud

- Technically challenging data representation

Working with 3D data is a challenge on its own. In contrast to the instantaneous data acquisition provided by cameras, some 3D sensors might experience a delay between the signal’s induction and registration due to round-trip time, especially if the sensor is placed on a fastly moving car. In the case of a rotating sensor, such as a 360 degree LiDAR, additional corrections need to be made in order to negate the distortion caused by the movement. Even though solutions exist, they are highly dependent on well calibrated sensors and additional information, such as IMU (Inertial measurement unit) or GPS data, which makes the whole process of preparing and using the scans much more complex than a simple camera-based image acquisition.

Non-egomotion-corrected pointcloud
Non-egomotion-corrected pointcloud (the whole scene is moving)
Egomotion-corrected pointcloud
Egomotion-corrected pointcloud

Despite those setbacks 3D data has some serious advantages:

+ Natural representation

When using 3D sensors, the recorded data presents a digital clone of the real world, which provides some extremely useful properties. Unlike in 2D projections, where perspective changes objects’ appearance and their perceived size, in 3D they have a consistent size true to their real-world dimensions, no matter the distance to the sensor. Furthermore, the exact orientation of the object with respect to the sensor’s position can be estimated. None of those can be achieved with such accuracy given a simple 2D representation.

2 sequential frames 2D bounding box with changing size
2D bounding box changing size in sequential frames
3D bounding box with a consistent size
3D bounding box with a consistent size through all frames

+ Accurate position measurements

3D mapping methods are frequently used to measure objects’ location with extreme precision. The lasers of a LiDAR sensor and the electromagnetic waves of a radar are both capable of providing information about the exact distance at which an obstacle is encountered, giving access to the ‘depth’ dimension. This enables algorithms to reason about the scene as whole, the distances between the objects in it, and most importantly - the area free of obstructions.

+ Invariant to lighting conditions

There is another major drawback when it comes to cam images - lighting conditions have an extreme effect on the quality of the acquired data. Time of day as well as some weather conditions can have a negative effect and either significantly reduce the effective range of the sensor or render it useless altogether. In contrast, 3D sensors are unaffected by lighting conditions and can guarantee a consistent stream of high quality data.

A pedestrian invisible in a camera image, but visible in lidar point cloud
A pedestrian invisible in a camera image, clearly visible in lidar point cloud

Given all that, should we ditch 2D for 3D? We’ve established that 3D data gives us a natural representation of the world, yet it is sparse and its resolution diminishes with the distance. On the other hand 2D data is easy to obtain and provides a complementary set of features. Both, 3D and 2D data has a place in our hearts 💜 and most importantly - in the sensor set-ups for most computer vision use cases, including autonomous driving.

Whether you’re using 2D, 3D, or a combination of those two to train and validate your algorithms, understand.ai offers annotation tooling with unprecedented automation rates that reduces the time and cost of your AI projects. Get in touch with us for a free UAI Annotator demo today.

Georgi Urumov, Deep Learning Engineer at understand.ai