Autonomous Driving: Synthetic Data versus Real Data

Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD) have become a hot topic in the automotive industry, with many companies using their highly automated driving functions as a key differentiator to their competition. Most of these functions rely on machine learning algorithms which necessitate a heavy use of data, both real and synthetic.

In this article, we will explore the differences between synthetic data and real data, and their respective advantages and disadvantages for the development of ADAS/AD. We also venture a prediction on the future use of real-world data in the ADAS/AD environment.

Real data is obtained from the physical world. In the context of autonomous driving, real data is typically collected by equipping vehicles with sensors and cameras that record information about the environment around them. 

Real data represents the real world more precisely than synthetic data. This means that it can be used to develop and particularly validate machine learning algorithms that are more tailored towards real-world situations. This is highly safety relevant because accurate perception is the prerequisite for comprehensive scene understanding and safe path planning.

At the same time, real data is still expensive. Many kilometers have to be driven to collect the data needed to train an ADAS/AD system – and a multiple of data is required to validate algorithms. When it comes to edge cases – situations we do not face too often like people jumping on a road, or near accidents, real data is a bottleneck that is cost- and time consuming.

Synthetic data is generated by algorithms. In the context of autonomous driving, synthetic data is typically gathered by creating virtual environments that simulate real-world driving conditions. 

Synthetic data can be produced at scale and can be tailored to specific scenarios that would be difficult or impossible to replicate in the real world – so called edge cases. For example, it may be difficult to create a scenario in which a child suddenly jumps out into the middle of the road, but this scenario can easily be generated via simulation. This can be leveraged for training and validation of scene understanding, prediction and planning algorithms.

As the use of data in ADAS/AD is always safety relevant, synthetic data requires that the parameters for generating these data are carefully set and checked. 

The most powerful argument for synthetic data is the extremely low cost and time it takes to create it.

For validation of the perception stack it is still unclear if neural networks work like the human eye and brain. From our experience at understand.ai objects that appear similarly difficult to identify sometimes yield vastly different detection rates. However, perception algorithms often detect objects easily overlooked by the human eye. Just because synthetic data may appear similar to the real world to the human eye, this does not imply that it has a similar look to a machine learning algorithm. Consequently, it is impossible to infer that a perception stack works well in the real world if it has been validated on synthetic data only.

Simon Romanski
Simon Romanski, Product Owner Computer Vision at understand.ai

Semi-synthetic data in the context of autonomous driving refers to a type of data that is generated through a combination of real-world data and augmentations.

In semi-synthetic data, real-world data is used as a base, and is then altered to create new scenarios or variations of the existing data. 

For example, in the case of image based ADAS/AD systems, real-world images of different driving scenarios can be used as a base, and then synthetic images can be generated by manipulating several factors such as lighting, weather conditions, or adding virtual objects such as vehicles or pedestrians. It can also be “rotated” and “shifted” using augmentation to depict a real situation from a different point of view.

Will synthetic data replace real data in ADAS/AD in the future?

The general attitude today tends to be against the use of real data for ADAS/AD systems. Too time consuming, sensitive to poor visibility and not GDPR compliant. But above all, too expensive to generate. 

However, as more and more vehicles on our roads are connected and equipped with cameras and sensors, they could potentially collect reliable data at low cost, which can be used for further development of ADAS/AD systems. The handling of the collected data can take place in appropriately secure environments and while preserving the privacy of the individual.

At understand.ai we are sure that real data won’t be completely replaced by synthetic data in the field of ADAS/AD in the near future. Think of crash tests: Years ago, crash tests were the only reliable way to test the safety of a car. Today, crash tests have been drastically replaced by simulated crash tests to bring down the cost in early development phases. To verify the simulation results and for final validation, real crash tests are still needed.

Most likely a combination of real-world, synthetic, and semi-synthetic data will continue to be used in the development of advanced driver assistance systems (ADAS) and autonomous driving (AD) in the future. Each type of data will have its own playing field in the process of autonomous driving.

Real-world-data provides important insights into the behavior of drivers and other road users, and is necessary for testing and validation of ADAS/AD systems. Synthetic data can be useful for generating scenarios that are difficult or dangerous to reproduce in the real world, and for training machine learning algorithms when real-world data is limited.

Semi-synthetic data can provide the benefits of both types of data, and can help to improve the performance and safety of ADAS/AD systems.

As autonomous driving technology continues to evolve, more data will be generated from on-board sensors and other sources, such as connected vehicles and infrastructure. This data will provide valuable insights into the driving environment and can be used to further improve the performance of ADAS/AD systems.

Moreover, regulatory bodies and safety organizations will likely require that autonomous vehicles undergo extensive testing in real-world conditions before they are allowed on public roads. This testing includes the use of real-world data to validate and verify the performance of autonomous driving systems.