How I Won the First Data-centric AI Competition: Johnson Kuan

In this blog post, Johnson Kuan, one of the winners of the Data-Centric AI Competition, describes techniques and strategies that led to victory. Participants received a fixed model architecture and a dataset of 1,500 handwritten Roman numerals. Their task was to optimize model performance solely by improving the dataset and dividing it into training and validation sets. The dataset size was capped at 10,000. You can find more details about the competition here.

Your personal journey in AI

My journey in AI began in 2014 when I took an introductory Machine Learning (ML) course from the California Institute of Technology (Caltech) delivered online through edX. The course was called “Learning From Data” and it was taught by professor Yaser Abu-Mostafa. After the first few lectures and homework assignments I was hooked! The topics combined my passion for mathematics and computer science into something that seemed magical at the time: teaching machines to learn from data.

After taking my first ML course, I began reading as many books as I could on the topic (e.g. Tom Mitchell’s “Machine Learning”) and working on side projects to build ML models with free open datasets published online. After several months of self-study, I decided to pursue formal education in ML and was accepted to Georgia Tech’s MS Computer Science program (specializing in ML) where I took graduate courses in AI/ML, robotics, computer vision, and big data analytics. I was fortunate to learn from luminaries in the field such as Sebastian Thrun and get exposure to the many use-cases of AI/ML in various industries.

In my career, I first worked on AI/ML in AT&T’s marketing organization where I built ML models to predict subscriber behaviors. I was fortunate to have met leaders at AT&T who gave me new opportunities to apply what I learned at Georgia Tech. After AT&T, I joined an AI startup called Whip Media where I led the Data Science team to develop new models that transformed the global content licensing entertainment ecosystem through proprietary data and predictive insights. Fast forward to today, I am at DIRECTV leading the implementation of MLOps to accelerate the development and deployment of AI/ML models.

As I reflect on my AI journey thus far, there is a common thread: I always continue to learn new topics in the field and I encourage others to do the same. It’s an exciting time to be working on AI because there’s so much to learn everyday and so much innovation left to be done.

Why you decided to participate in the competition

I decided to participate in the competition because I wanted to learn the Data-Centric AI approach with a hands-on project (learn-by-doing).

I stumbled on this Data-Centric AI movement by chance when browsing YouTube earlier this year. I found a video by Andrew Ng going over the shift from model-centric to data-centric approaches. Andrew shared the success he’s had applying these principles at Landing AI, and the ideas really resonated with me. This represented a paradigm shift in the way most AI practitioners build AI systems and I had a feeling that this shift would significantly accelerate the pace in which AI is adopted, allowing it to more quickly transform industries.

The techniques you used

I developed a new technique I’m calling “Data Boosting,” described in more detail in my blog post on Medium: https://medium.com/@johnson.h.kuan/how-i-won-andrew-ngs-very-first-data-centric-ai-competition-e02001268bda?sk=e239d8bc2a219a269e8939a0d14f6290

Before getting into the crux of my solution, the first thing I did was follow the common practice of fixing labels and removing bad data.

To streamline this workflow, I wrote a Python program to evaluate a given dataset (after feeding it into the fixed model and training procedure) and generate a spreadsheet with logged metrics about each image.

This spreadsheet contains the given label, predicted label (using the fixed model), and loss for each image, which are all very useful tools to isolate inaccuracies and edge cases. Example below.

I initially used this spreadsheet to identify images that were incorrectly labeled and images that were clearly not Roman numerals from 1–10 (e.g. there was a heart image in the original training set).

Now onto the “Data Boosting” technique. Below are the high level steps:

I generated a very large set of randomly augmented images from the training data (treating these as “candidates” to source from).
I trained an initial model and predicted on the validation set.
I used another pre-trained model to extract features (aka embeddings) from the validation images and augmented images.
For each misclassified validation image, I retrieved the nearest neighbors (based on cosine similarity) from the set of augmented images using the extracted features. I added these nearest neighbor augmented images to the training set. I call this procedure “Data Boosting”.
I retrained the model with the added augmented images and predict on the validation set.
I repeated steps 4–6 until reaching the limit of 10K images.

See below diagram of this iterative procedure:

A few things to note on the procedure above:

Although I used augmented images for this competition, in practice we can use any large set of images as candidates to source from.
I generated ~1M randomly augmented images from the training set as candidates to source from
I used the data evaluation spreadsheet to keep track of misclassified images and to annotate the data. I also spun up an instance of Label Studio with a PostgreSQL backend but decided not to use it due to the unnecessary overhead.
For the pre-trained model, I used ResNet50 trained on ImageNet.
I used the Annoy package to perform an approximate nearest neighbor search.
The number of nearest neighbors to retrieve per misclassified validation image is a hyper-parameter.

One cool thing about extracting features from images is that we can visualize them in 2D with UMAP to better understand the feature space of the training and validation sets. In the visualization below, we can see that the given training data distribution does not match the given validation data. There’s also a region of the feature space in the bottom left corner where we don’t have validation images. This suggests that there is an opportunity here to experiment with reshuffling the training and validation data splits before running the “Data Boosting” procedure above.

To give some context, my approach was primarily motivated and inspired by two things:

Andrej Karpathy’s talk in 2019 where he describes how the large amounts of data collected from Tesla’s fleet can be efficiently sourced and labeled to address inaccuracies which are often edge cases (long tail of the distribution).
I wanted to develop a data-centric boosting algorithm (analogous to gradient boosting) where inaccuracies in the model predictions are iteratively addressed in each step by automatically sourcing data that are similar to those inaccuracies. This is why I called the approach “data boosting”.

Your advice for other learners

My advice to other learners in AI is to learn by doing and to be persistent. More often than not, the first few attempts at applying new knowledge will not be successful. However, I believe with persistence and determination, anyone can learn AI and apply their skills to build something great.