The Impact of Annotation Errors on Neural Networks

8 minutes read

Data quality is one of the most critical factors in algorithm training. What’s the impact data annotation quality has on an algorithm’s performance and output in particular? And what’s the price of not getting it right? I’ll point out the most common annotation errors and explain their possible influence on algorithm performance.

Despite the best efforts to automate annotations, the data annotation process is still partially a manual task, performed by humans with varying experience, ranging from casual people in crowd annotation projects to labeling experts in dedicated annotation companies. Most mistakes in annotation are therefore caused by humans.

The most common object annotation errors

In 4 years of annotation project experience with millions of annotations, understand.ai’s analysts have identified the most common errors when labeling objects:

  • Incorrect class: An object is classified incorrectly, e.g. a vehicle is labeled as pedestrian.
  • Incorrect attribute: The state of an object is not described correctly, e.g. a car in motion is labeled as parked.
  • Missing annotation: An object is not annotated even though it should be.
  • Redundant annotation: An object is annotated even though it shouldn’t be
  • Incorrect annotation size: An object is not annotated precisely enough, not fitting to its actual dimensions.
  • Incorrect annotation position: An object is not annotated precisely enough, not placed at its actual position.

    An example of incorrect class and incorrect attributes
    An example of common annotation errors - incorrect class & attributes. The class is supposed to be 'truck', truncation and occlusion are not present, the vehicle is moving and indicators are off.

What happens when a Neural Network is fed mislabeled data?

All these errors would’ve had a significant effect on our customers’ AI model performance if not picked up by our Quality Assurance team. And researchers agree. For the purpose of this blog I’ll skip the various methodologies and approaches and focus on the results of some relevant research conducted to analyze the outcomes of the above-mentioned errors and dimensions.

Incorrect classes

In literature, incorrect classes are generally defined as class noise (Zhu and Wu, 2004), or label noise (Frenay and Verleysen, 2014). For mislabeled classes, the experiment of Fard et al. from 2017 sees a clear dependency on whether the class is mislabeled in an unbiased or biased way.

  • Unbiased mislabeling is defined as “random” mislabeling with an equal likelihood that the class is accidentally replaced by any other class.
  • Biased mislabeling happens when the annotator confuses the class with always the same class, which induces a constant replacement.

The experiment showed that a) mislabeling in general has a negative impact on performance and b) biased mislabeling has a greater impact on degrading classification performance than unbiased mislabeling. Fard et al. performed the experiment with two models, one convolutional neural network (CNN) and one multi layer perceptron (MLP), whereas the CNN performed better, especially in unbiased mislabeling.

An experiment of Flatow and Penner (2017) examined mislabeling / subjective labeling and its impact on CNN’s accuracy. The results suggest a linear correlation between class noise and test accuracy, where an additional 10% of noise leads to a 4% reduction in accuracy. Further experiments in literature concluded a negative impact of class noise on other machine learning algorithms as well, e.g. impact on decision trees, support vector machines and k nearest neighbors (knn) (Nazari et al., 2018).

To be fair to the labelers, an incorrect class does not need to originate from mislabeling. Changing of specifications in the midst of an annotation project can lead to class name changes, too. When the name change is not well communicated, the model might interpret the data differently, ultimately leading to a worse output.

Incorrect attributes

The impact of incorrect attributes - or so-called feature noise on a model’s output was explored comprehensively by Zhu and Wu (2004). Zhu and Wu considered attribute noise to be largely understudied while too much attention was being paid to class noise. They conducted a study containing over 100.000 instances, 2 classes each with an attribute count ranging from 0 to 60. Here, attribute noise suggesting misset or subjectively set attributes was introduced to test the impact on classification.

Zhu and Wu’s most relevant conclusions from the experiment were:

  1. Feature noise is not as harmful as class noise, but can still lead to severe classification problems.
  2. The higher the correlation between an attribute and the class, the more negative impact the attribute has on the classifier.
  3. Eliminating instances containing class noise or noise cleaning will likely enhance the classification accuracy.
Distorting VOC bounding boxes
Ground truth on the left, small noise (0.08) in the middle and large noise (0.13) on the right. Source: VOC dataset

Missing annotation

The effect of a missed but relevant object can have different consequences in different contexts.

1. A model focusing on labels only

A model considers only the labeled objects in a frame. When an object is not labeled, less data will be present for training.

2. A model focusing on labels and the greater context

Here, not only the labeled objects will be looked at, but other things are considered as well. Two examples:

  • A model considers the whole frame as input, so it also looks at the non-labeled parts to decide on true and false negatives. A missing annotation for a relevant object induces the suggestion that e.g. a car is not a car, even though it is one.
  • Trajectory tracking: If a car is tracked throughout the frames but is not annotated for some frames in between, it can lead to poorer trajectory estimation.

A paper of Xu et al (2019) deals with the influence of missing labels on fully supervised object detection models. The experiment was conducted on a RCNN (Region based CNN), Faster-RCNN (Faster-Region based CNN), YOLO (you only look once) and SSD (Single Shot Detector) and a WSOD (weakly supervised object detection model).

The results show that the performance of all FSOD methods drops significantly as the missing rate increases (see graph below). It’s worth mentioning that missing annotations had no impact on the WSOD model, whereas it suffers from generally inferior detection performance.

The Mean Average Precision graph - Xu et al., 2019
The Mean Average Precision (mAP) for one WSOD detector and four classical FSOD detectors training under different instance-level missing label rates (Mr =0:0.1:0.9) of the training data set [Xu et al., 2019]

Changing labeling specs can introduce new relevant objects. If the labeling process for older data is already completed and is forgotten to be relabeled, this as well can lead to missing annotations.

Redundant annotation

I found no literature describing the direct impact of redundant annotations on object detection algorithms. Nevertheless, labeling irrelevant objects is a wasted effort. Unnecessarily labeled objects are a potential source for labeling errors as incorrect classes or incorrect attributes.

To reverse the changing labeling specs situation here, new objects can be introduced but also removed. Initially relevant objects deleted from the specifications make those objects irrelevant and a waste of time and resources.


Built-in labeling quality checks and Quality Assurance

To summarize the above, every frequent error has a specific impact on data quality and the model trained with that data. That’s why a built-in labeling quality into automated annotation and a Quality Assurance step is an essential step in every data annotation project. Understand.ai can deliver both. Reach out to us for more details.

Steffen Enderes, Customer Success Manager at understand.ai


This blog is based on my master thesis ‘Enhancing data quality in annotation projects through improving specification handling and design - A Design Science approach’, written in 2021 for Karlsruhe Institute of Technology.

Resources:

  • [Zhu and Wu, 2004] Zhu, X. and Wu, X. (2004). Class noise vs. attribute noise: A quantitative study. Artif. Intell. Rev., 22:177–210.

  • [Frenay and Verleysen, 2014] Frenay, B. and Verleysen, M. (2014). Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5):845–869.

  • [Flatow and Penner, 2017] Flatow, D. and Penner, D. (2017). On the robustness of convnets to training on noisy labels.

  • [Nazari et al., 2018] Nazari, Z., Nazari, M., Danish, M. S. S., and Kang, D. (2018). Evaluation of class noise impact on performance of machine learning algorithms.

  • [Xu et al., 2019] Xu, M., Bai, Y., Ghanem, B., et al. (2019). Missing labels in object detection.