What are ground truth labels?
‘Ground truth’ represents the objective, humanly verifiable observation of the state of an object or an information that might be considered a fact. The term ‘ground truth’ has recently risen in popularity thanks to the adoption by various Machine Learning and Deep Learning approaches. In that context we speak of ‘ground truth data annotations’ or ‘ground truth labels’ - humanly provided classifications of the data on which the algorithms are trained or against which they are evaluated.
Why does ground truth matter
In order to better understand the significance of ground truth labels for the performance of various Machine Learning approaches, we need to understand how they are being trained. There are three categories, depending on the ‘teaching’ method:
In order for supervised approaches to ‘learn’ a target domain, they need to have a large amount of diverse data with the corresponding correct ground truth labels. During the training process, the algorithm is run on multiple examples, for which the correct response is known. The produced results are then compared to the expected ones and the resulting error quantified. This error is used as a learning impulse and the whole model is being modified so that the error amount is driven as close to zero as possible.
Here, only some of the data points need to have pre-known ground truth labels. The problem with this approach is that its success heavily depends on the properties of the data, which it is trying to learn from. It performs very well for problems that require the identification of distinct structures in the data. It’s not well suited to complex problem spaces, where the nuances between different categories are small.
These are mostly statistically driven approaches, trying to group examples into distinct clusters based on their features. Although no ground truth labels are needed, one still needs to specify the exact number of expected clusters in advance. This makes it unsuitable for a large number of machine learning tasks, but extremely convenient for some.
Both semi-supervised and unsupervised approaches are being actively researched as they represent the future of the field. Yet, only supervised algorithms provide the high-quality results needed by the industry today. They depend heavily on having good ground truth data annotations during the training process in order to learn the target data domain.
There is a saying in our field, which accurately describes why you need to have high-quality ground truth labels - ‘garbage in, garbage out’. It’s as simple as that - a model provided with wrong, imprecise or inconsistent labels as input will not be able to learn well and the results will be far from good. Bad annotations tend to have a detrimental effect on the learning process, slowing it down or completely disrupting it.
No matter what type of approach is being used, high-quality ground truth labels are needed for the test set of data examples. Machine Learning approaches act as a sort of a ‘black box’ model, so the only way of evaluating how well they perform is through empirical measurement. This is done by running the algorithm on a previously unseen collection of examples and comparing the resulting predictions with ground truth.
Garbage annotations in, garbage annotations out
Knowing how important the ground truth data annotations are and how the performance of any model depends to a large extent on their correctness, it is only logical to try and make sure that the labels are of high quality. There are some problems that often plague the development and annotation process.
1. Missing annotations
Creating annotations for computer vision models is usually a repetitive, tedious, manual process. It is not uncommon for the labeler to miss some objects due to occlusion, small object size or simply mental fatigue. This leads to a bad learning signal during training, as the model will get ‘punished’ even when correctly recognizing the missed object, simply because the ground truth was false.
2. Inconsistent annotations
As the amount of data to be annotated is usually quite large, the task is parallelized between multiple annotators. This oftentimes leads to inconsistencies in the labels, as one person might interpret things differently from another. One such example might be having to label the direction of the gaze of a person. As this is extremely hard to estimate, labels can vary wildly, and an error of up to 10 degrees between different annotators’ judgment is not uncommon.
This results in an overall bad performance of the model, as there is no consistent learning signal. Due to the way those algorithms learn, the predictions will be strongly biased towards the statistically average example to minimize the error.
3. Expert knowledge and edge cases
Some tasks, such as labeling medical images, are inherently hard and require pre-existing specialized knowledge. In other cases, a previously unseen or unthought-of example is encountered, for which no labeling guidelines are provided.
Both options call for a trained professional. This greatly restricts the throughput of the annotation pipeline, as the result has to get checked by a handful of specialized individuals that are expensive to hire.
Interested in learning more about the impact of annotation errors on Neural Networks? Read this blog by Steffen Enderes.
How do we get data quality right
Annotation quality is paramount in autonomous driving to get as close to ‘ground truth’ as possible. By combining algorithms, workflow checks and the trained eye of Quality Assurance experts, understand.ai delivers ground quality annotations to autonomous driving companies all over the world. Would you like to know in detail how understand.ai secures data quality? Reach out to us!
Georgi Urumov, Deep Learning engineer, technical writer, entrepreneur for understand.ai