We all strive towards what is considered ‘the best’. We want to provide the best service to our clients, to have the best product, the best team and the best software. In the case of Deep Learning solutions for computer vision, the best software often means the most accurate model, one capable of distinguishing between all of the different classes and representing all of the intricate nuances. But how do you get such a model? Well, by having the best data and the best annotations, of course! This naturally leads us to our next question:
What is the best annotation type?
The simple answer is - the one that fits your needs. At understand.ai we work with all data annotation types in the visual domain, be it 2D or 3D, and we help our clients to find the one fitting to their use-case. Most of the time, a simpler annotation type, such as a bounding box, is sufficient for proof-of-concept, MVP (minimum viable product) and early stage projects. But what do you use when you want to get the most out of your data? There is no doubt, it’s semantic segmentation.
What is Semantic Segmentation
The name speaks for itself - a combination of the Greek ‘sēmantikos‘ (significant) and ‘segmentation’, meaning division. This approach aims at not only finding all categories (also known as ‘classes’) in a data sample, but providing a clear distinction between their precise locations. There are multiple subtypes of semantic segmentation, but they all are the result of choosing a pair of attributes among two categories - the dimensionality of the data and the granularity of the produced annotations.
This refers to the number of dimensions present in the data source. An example of a 2D object is a standard camera image - it has only two dimensions - height and width. 3D data is an extension of the 2D case - it has an additional ‘depth’ dimension. Some sensor data examples are Lidar and Radar scans. When multiple consequent 3D objects are stacked along the time ‘axis’, they create a 4D representation, also known as a video.
Depending on the dimensionality of the data, we use a different type of semantic segmentation to produce what is known as segmentation masks. In the 2D case segmentation is performed in one of two ways - either a pixel-based or a polygon-based coloring. As pixels are the smallest atomic part in this representation, each gets assigned to exactly one of the possible annotation classes. For 3D this translates to a point-based segmentation, where each 3D point gets annotated. When given enough points on a single object, a segmentation mesh can be extracted.
Granularity refers to the level of precision of the produced annotations. There are two widespread categories - class-based and instance-aware segmentation. In the first case, the segmentation mask for a given class covers all regions which represent a member of the class. In the second case, a separate segmentation mask is created for each individual object of the selected class, thus making possible the distinguishment of different instances ( like separating two different cars for example ).
Which semantic segmentation type is the most useful in machine learning?
Objectively speaking, to get the most out of semantic segmentation you should use the instance-aware subtype. Here are some of the reasons.
Extremely versatile format
Having segmentation masks of your data allows you to train and experiment with all types of machine learning models - classification, detection and localization, image generation, foreground/ background separation, handwriting recognition, content change and many more. This is why it is used in many industries, such as autonomous driving, fashion, video creation and post processing, agriculture etc.
There is simply nothing more precise than segmentation masks, as they cover only the location of the actual object. Bounding boxes on the other hand, often include neighbouring regions or intersect with neighbouring boxes. This is due to objects being inside of, or on top of other objects, or being non-rigid.
Two annotations in one
Despite segmentation masks being more precise, there are still many approaches that work with bounding boxes. Luckily, you can always estimate the enclosing bounding box from a segmentation mask. That’s how you get all your bases covered!
Despite all these advantages going for semantic segmentation as your choice of annotation type has some caveats to it.
The tricky part
1. It’s hard and slow to annotate by hand
Creating semantic masks by hand is a hard and tedious process. The labeler needs to follow the contours of each object precisely and faces great challenges when presented with irregular shapes or regions where the edge between objects is not clearly distinguishable (see pictures below). Without specialised tooling, annotating a single frame is prone to errors, inconsistencies and can often take more than 30 minutes of work.
2. Fully automatic approaches are not capable of providing high quality
Wouldn’t it be great if we could just train a neural network to perform semantic segmentation once and then have all of our annotations with zero effort?
I have some bad news for you… such a neural network is yet to be discovered. There are many high performing models out there, which show impressive accuracy of more than 90% for some object classes. Despite these high numbers, the resulting quality is not at all sufficient for training precise models.
The reason behind this lies in the discrepancy between how we perceive quality and how accuracy is measured. A segmentation mask is created by finding the outline of an object, whereas the quality is measured on the percentage of area that was correctly recognized. As you can see from the example below, a network might segment more than 90% of the area of an object, but still have most of the outline wrong. In order to fulfill the standard quality requirement of a maximum deviation by no more than 3 pixels (in the 2D case), a network would need to have an accuracy of approximately 99%. Because of this, segmentation masks produced from fully automatic approaches are not considered to be of a high enough quality to be used as a ‘learning material’ for other networks.
3. Correcting errors is time consuming
In both of the aforementioned approaches mistakes can be costly. Having to correct an imprecise segmentation mask requires correcting N other masks, where N is the number of neighbouring masks (we’ll come back to it in a bit). The correction procedure takes as much time as annotating the mask from scratch. That’s why manual correction of the output of a fully automatic segmentation is also not feasible. The only way of avoiding this problem is by having specialised annotation tooling and well trained labelers.
4. The cost of semantic segmentation annotation
As you might have noticed creating segmentation masks requires specialized annotators, tooling and automation. This drives up the costs significantly, usually factors above what annotating simple bounding boxes costs and ends up depleting the budget quite fast. So what can you do?
The understand.ai solution to your segmentation problem
UAI is the only tooling provider capable of annotating and exporting 2D segmentations in both pixel-based and polygon-based modes. UAI Annotator eliminates the necessity to adjust any neighbouring region if you need to make a correction.
We remove the complexity associated with semantic segmentation by automating all of the tricky parts. Choosing UAI as your annotation tooling provider reduces the costs associated with the annotation process and takes a lot of the set-up burden away. Contact our dedicated team of consultants and engineers and get a solution tailored to your needs!
Georgi Urumov, Deep Learning Engineer at understand.ai