Towards Software 2.0 - Building AI Enabled Products for Data Annotation

14 minutes read

At we believe that AI is the most powerful tool our generation has at hand. There are two ways how we make it accessible for real-world applications. On the one hand we deliver annotated data - the food that teaches perception algorithms to recognize the world around them. On the other hand, since labeling billions of data manually is not a feasible way forward, we apply machine learning models ourselves to automate annotation.

Neural Network powered Automation Engine

Collecting data and preparing datasets for ML projects is one of the most labor-intensive parts of the whole project. We’ve built an automation engine for data annotation to automate this process, reducing the costs and accelerating the development time. In order to do so, we had to get our AI production-ready. In this piece I’d like to share how we work on a day-to-day basis to get our models into production.

Working Prototypes as Fast as Possible

Efficiency requires more than experimentation speed; the model has to work in the real world. Real-world performance does not come from experimentation speed alone, it comes from data. It was perhaps best put into words by Andrej Karpathy, Director of AI at Tesla:

“Because deep learning is so empirical, success in it is to a large extent proportional to raw experimental throughput - the ability to babysit a large number of experiments at once, staring at plots and tweaking/ re-launching what works. This is necessary, but not sufficient.”

He nailed it - you need to run a lot of experiments ideally at the same time in a structured way and then you need to grab those which work and evolve those. In this way you can get to your production-grade AI at least from an infrastructure model point of view. We’ll come back to the “data part” soon, I promise.

So what does it take to make that happen efficiently?

10 Best Practices to Increase Your Experimentation Speed

  • 1. Don’t Let Yourself Limit by Hardware

It’s easy to be limited by hardware. It might take a couple of days to run a certain experiment. You might have some more ideas about what could work, but you’re not sure about what you need to run. You should do that in parallel otherwise you’re losing too much time. There will be peak hours where you’ll need a lot of GPUs. That’s why we are using cloud providers. We can scale up and scale down at any time; we’ve built our whole infrastructure around that possibility to scale up & down easily.

  • 2. Build Benchmarks First

Without measurement and benchmarks you won’t be able to identify which areas of your data processing pipeline are doing well and which aren’t. It’s easy to put effort and time into the wrong part of your pipeline going nowhere in the end. You have to be able to track all experiments and your progress towards the goal you’re trying to achieve.

You run so many experiments at the same time over a couple of weeks to find a solution that works; it’s easy to lose track. You don’t want to lose those learnings. Especially if you onboard new people. We use reports to record our findings and to share them with other team members. Everyone can look through those reports and see: “okay we already tried this, we tried that, this was the result. Maybe we should do something else or maybe it’s worth the effort to look into that again.”

  • 3. Your Work Needs to be Reproducible

Archive model versions and training data sets for these versions in a structured way so you can pull them out whenever you need to. We have an in-house build solution to organize our artifacts. We can reproduce our work - retrain the exact model on the same data and work forward from that point onwards automatically. Don’t do stuff over and over again, get organized and you have your stuff at hand.

  • 4. Data Processing Pipeline

When we work on customer projects the problems we try to solve are more complex so what we have is a series of steps that we process subsequently.

UAI data processing pipeline
An example of a UAI data processing pipeline

That’s our data processing pipeline. To reach the highest experimentation speed possible, make sure you have separated building blocks you can run individually. You need to be able to control each element out of the pipeline individually, so you can monitor, debug and tweak every single element out of that pipeline on its own.

  • 5. Use Snapshots to Quickly Iterate

Is your data processing pipeline complex and built out of various steps? Snapshots will help you focus on single steps without running the entire pipeline.

Example: Imagine your pipeline has several automation steps and you want to iterate on automation step #3. It takes an hour to run automation step #1 and two hours to run automation step #2. It’s really annoying if you have to always wait three hours so you can make a small change in automation step #3. The snapshot system keeps a snapshot of every single step of the pipeline. You can either run a sequence step on the real data going through or you can run it on a snapshot if you want to do some little tweaks to check out a new iteration of your code.

The best thing with the snapshot approach at UAI is, we can do it locally on our computers. This saves us tons of time and it’s really getting us up to speed.

  • 6. Validate Models in Shadow Mode

Say you’ve developed a perception data annotation automation that feels quite good. You’ve run it on your validation sets and on your ground truth sets and you get really nice results. You can still not be hundred percent sure that it’s actually good, it might be good just on the data you tested it on, but it’s not necessarily all of the data that can come through such a data pipeline.

We deploy something called shadow mode here. Since we’re in the business of data annotation, we can always have a team annotating small samples manually in parallel to the automation engine. If the results are good enough we can scale down the manual team and can scale up our automation. This is how we achieve a higher automation rate while keeping the annotation quality high.

  • 7. Let’s Talk About Models

We usually start with out-of-the-box algorithms. We put them into our jobrunners, we make them run within our data pipeline and we find out what’s the baseline performance of our automation pipelines if we employ those models. Only then do we start to learn what’s needed to solve our customers’ problems.

We identify what’s the limiting factor. Is it the post-processing or maybe the tracking not working? That’s why it’s important to have benchmarking in place. Know where to focus.

Our recommendation is to get your pipeline up and running as quickly as possible to see where you have to put your effort with hyperparameter tuning.

  • 8. Use Hyperparameter Search to Tune your Model

If you’ve settled on a model and you think you have to tune it a little bit, our recommendation is to use tooling to get that part structured. It’s easy to get lost when hyperparameter tuning. We are using for example one from Weights and Biases, we’re using the Hyperparameter Sweeps so to get our models tuned and to make them work better.

Weights and Biases, Hyperparameter Sweeps
Weights and Biases, Hyperparameter Sweeps
  • 9. Visualize, Visualize, Visualize

Make sure you visualize your model (pipeline) output on every single step to allow for efficient debugging. How long does it take to check the impact of a modification? Make sure you can access output easily and in a meaningful way so that you can draw conclusions. It’s hard to monitor a black box. That’s why we’ve created our own in-house visualization tooling to see what our models are producing.

UAI automation engine visualization component
Screenshot from the UAI automation engine visualization component
  • 10. The best recommendation for ML models? Don’t spend too much time on them.

I know - playing around with models and coming up with new architectures and to try out new stuff is really fun. But if you want to make it work that’s not necessarily the smartest thing to spend most of your time on. It’s much better to get up your pipeline quickly, have your benchmarking in place to know where to put the effort into and only then to start playing around with your models if it’s really required.

Machine Learning Model Enters the Real World

So far it’s been all about how to get your experimentation speed. Let’s look into real world performance. Real world performance is the fixed performance threshold your models have to work in all circumstances to make it really usable.

There’s a well known distinction between how AI is developed in academia and in real-world industrial applications. In academic research you’d typically have a fixed data set. It’s important so you can compare algorithms and architectures against each other and say this one works better to solve a complex problem than the algorithm of the competition. But you might not be able to use it in the real world because it’s not good enough; it’s not really helpful.

AI development in academia
Academia vs. Industry: AI development in academia

In real-world applications companies developing or purchasing AI do not care that much about the model . As long as it achieves the required performance they don’t care what the underlying model is. The performance requirement is very high for most industrial applications, often aiming for 95-100%.

Fixed model performance
Fixed model performance

It’s the variable ever-changing training data set where the performance comes from. You’ll still need a suitable algorithm but what makes the difference is data. Data defines how good your model performs in a real world scenario.

AI development for industrial applications
Academia vs. Industry: AI development for industrial applications

Software 2.0 - Data Defines Everything

This is the paradigm shift in classical software engineering. You’re moving away from an explicit code which solves a problem to a general code approach. From now on, it’ll be important to maintain your dataset and make it representative for real world problems. The real world variety is what makes it totally different from the classical paradigm of software development. It’s called Software 2.0.

Data Quality

When we talk about data quality we’re talking about the quality of the labels. We’ve already talked about the data annotation quality and labeling errors and how they impact neural networks on this blog before, I’ll keep this part short.

VOC dataset

High labeling quality is not always desired, it’s good to know your problem. There’s a couple of use cases where you don’t need that label precision. For example scenario extraction in autonomous driving or some validation use cases.

Data Quantity

In general, we can say the more data the better. If your algorithms see more objects in a higher variety of circumstances, they are getting more and more robust.

Source: Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
Source: Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

Data Distribution

Let’s illustrate this with the example of autonomous driving. If you’re driving around with your car, most of the time you’re not ending up in critical or questionable situations. If you want to train a self-driving AI to reliably detect your environment and know what to do even in critical situations and edge cases, then normal scenarios are really not helpful.

Data distribution in autonomous driving
The vast majority of data is useless, it covers normal scenarios only. 90% of all trouble are caused by critical events and edge-cases.

It’s hard to come to a data set which has enough critical and edge cases. The amount of those is much higher than what you’d see in reality. The data has to be balanced from the algorithm point of view, so it can learn how to behave in such situations and it’s not always overwritten with standard scenarios.

You’ll need a data selection tooling to help you manage your datasets. Especially in complex projects with large amounts of data of high variety and diversity. A structured and process-led way to massage and curate your datasets is key.

IVS data selection tool by Intempora
IVS data selection tool by Intempora

An algorithm cannot learn a desired behavior if a critical situation is underrepresented in your data. The algorithm might forget it and might overwrite it with normal scenarios and normal situations. Even if you showed it to the algorithm in an insufficient way, you’ll have a problem.

I can’t stress that enough. A dataset should represent the real world, yet not in the sense of real-world data distribution, but in a way that your algorithm can learn from it.

Real-world Variance and Validation

Let’s take a best practice data split in autonomous driving, 70-80% for the train set, another 10 to 15% for the validation set and another 10 to 15% for the test set. Your algorithm might still not be performing the right way. What if your dataset covers only a small portion of your real world variance?

Real-world Variance

That’s where the validation effort comes in. Let’s say you’re doing something, there is not a lot at stake. If your algorithm fails, it might cause an inconvenience. If financial risk, health risk or fatalities are involved, you’ll want to validate the hell out of your algorithm. Like in so many use cases in autonomous driving proving that your model does as good of a job as a human driver.

Data Engine for AI Validation

Make sure that you understand the size of the problem you want to solve. It might be much bigger with more effort and money needed to make the problem work. You might need a data engine to boost your data collection in a structured and automated way.

Watch this short video where I explain the works of a data engine.


External content This external content is provided by YouTube. To view it, you need to agree to our privacy policy .

Let’s Summarize

  1. Aim for experimentation speed so that you can run as many experiments as possible.
  2. Have a rock-solid data acquisition strategy to achieve your performance.

Only if those two things come together and you’re very structured and efficient in both areas, you can quickly deploy results to production. And this makes your customers and colleagues happy.

Daniel Rödler, Director of Product at