How We Won the First Data-centric AI Competition: GoDataDriven
In this blog post, GoDataDriven, one of the winners of the Data-Centric AI Competition, describes techniques and strategies that led to victory. Participants received a fixed model architecture and a dataset of 1,500 handwritten Roman numerals. Their task was to optimize model performance solely by improving the dataset and dividing it into training and validation sets. The dataset size was capped at 10,000. You can find more details about the competition here.
1. Your personal AI journey
I have an academic background in econometrics, which is the study of statistics applied to economics. There we learned to create statistical models to address real world problems and criticize their performance. After my studies I was drawn into machine learning via Andrew Ng’s Coursera course. I became amazed by the predictive power of new machine learning techniques, in particular transfer learning. Nowadays I’m most excited about the techniques that allow us to criticize these powerful models. Fitting models with high predictive performance is easier every day, but my econometrics professors would weep if they knew that most people only look at accuracy or R2.
I built a strong foundation in AI through my undergraduate and graduate studies at the University of Amsterdam, where my thesis focused on geometric deep learning for medical image applications. Specifically, I developed methods to improve data-efficiency for early lung cancer detection. My focus continued to be on the cross-section of AI and healthcare applications, where I quickly learned that data-quality was often a deciding factor in the success of our machine learning models and a data-centric approach was key. Nowadays, I have a passion for spreading knowledge, which I do through my work as a data science educator at GoDataDriven, where we provide high quality trainings on all things AI & data science, as well as through organising meetups and conferences for PyData Amsterdam & PyData Global.
My AI journey really took off when I joined GoDataDriven. I began learning AI during knowledge sharing sessions with my colleagues. I laid my foundation with a master’s degree in Computer Science and a PhD in Data Mining. But working on practical AI solutions at various companies, big and small, got me even more excited about this field. As a consultant, I have been helping clients take their next steps in AI. I like to approach a problem with a wide angle view to make sure to work on the most valuable aspects. Clearly, this isn’t always the model building part. That is why Data-Centric AI fits me like a glove.
2. Why did you decide to participate in the competition?
We all had our reasons to be excited about the competition, but Rens was the one to take the initiative. He had been working for a client on a chatbot project where the model was fixed and they were looking to improve the training data. Actually we all had our experiences of doing data-centric work while there wasn’t a name or clear definition for it yet. Marysia, for example, worked on improving data quality and applying various realistic transformations for medical images, which greatly increased model performance. We were also intrigued to see how others approached the challenge. We hope that by joining the competition and sharing our experience we can help grow the Data-centric AI movement such that in the future there will be many tools available to aid in this process.
3. The techniques you used
Tip 1 : Use low-tech tools to get started together
We used Google Sheets to correct labels, because it allowed us to easily present all necessary information.
We used a big projector so we could label as a group, which sparked some great discussions. If this was a project for a client we’d take the questions we came up with to the domain experts to improve our understanding of the problem.
Tip 2: Use embeddings to get a sense of typicality and style imbalance
First, we used a nice trick for detecting imbalance. Take a pre-trained computer vision model, take the embeddings in the last layer, and then project them to two dimensions with UMAP. The result is likely to show any imbalance in your data.
Below, we visualized these embeddings so you can explore them interactively with altair. By dragging and moving a selection window over the plot you can see what images are underneath it.
This chart shows that the train/validation split is not balanced, since the validation set images are not distributed similarly to the train set (no purple on the left side).
Another way to use these embeddings is to find different Roman numerals that are drawn in the same style. It can make a pretty plot, but it also helps with figuring out which images you should focus on when applying basic data augmentation tricks such as mirroring, rotation, zooming, cropping. or blurring.
Tip 3: Use streamlit for data augmentation
We found that left-right and top-down mirroring served quite well as an augmentation technique. But not all Roman numerals are suited to this approach (we’re looking at you VII, VIII and IX). For that we came up with a simple trick: we created a streamlit app with the streamlit-drawable-canvas plugin lets you easily augment images one after another.
The strategy we thought of was as follows:
1. Take an image of a Roman 8: VIII
2. Erase the V, and save the III
3. Next erase the right most I, and save the II
4. Next erase the right most I, and save the I
5. Use your saved images of V, I, and II to create synthetic VI and VII
We’re not hosting this app, but you can find the snippet to make this work on your own machine here.
What’s nice about this strategy is that you can create new data quite quickly. This is most interesting for those images that your previous model found quite hard (e.g. those VIIIs that were misclassified as VIIs). In a sense you’re creating counter-factual data that helps your model see what’s the difference between the two classes.
4. Your advice for other learners
Communication is key. All people involved in getting model predictions to users need to communicate about the problem in order to improve the end result. Data-Centric AI shows how to do this through the language of data. Data-Centric AI gets data scientists out of the ivory tower (or basement) where they iterate on model hyper parameters in isolation. It starts conversations: What are we seeing here? Why is this labeled that way? Why would the model make these mistakes? Questions you don’t answer alone, but together with domain experts.
For good data scientists this is nothing new, but we’re happy to see now there’s a label for this perspective: “Data-Centric AI”.