How I Won the First Data-centric AI Competition: Divakar Roy

In this blog post, Divakar Roy, one of the winners of the Data-Centric AI Competition, describes techniques and strategies that led to victory. Participants received a fixed model architecture and a dataset of 1,500 handwritten Roman numerals. Their task was to optimize model performance solely by improving the dataset and dividing it into training and validation sets. The dataset size was capped at 10,000. You can find more details about the competition here.

1. Your personal journey in AI

At our organization, I was working on foot feature detection using traditional image processing methods. In 2018, we started off using Google’s AutoML for feature-detection and classification tasks. I was quickly exposed to the importance of data quality. After a while, I started with classification tasks with a high-level api that gave me some key insights about the hyper-parameters tweaking. In 2020, I took Coursera courses that helped on the math side for the basics of deep-learning, MLOps, and hands-on approach. I’ve been building up my knowledge base ever since with ML projects. As part of the learning, now I run real-time multi-class classifiers and object detections on mobile devices!

2. Why you decided to participate in the competition

I had a good background in computer vision, Python programming and the data-engineering side of things, but wasn’t sure of my place in this vast AI/ML domain. Looking at the data and rules of this competition, I felt this is something up my alley. It would also give me time to explore different ideas on computer vision while learning about hyper-parameter tweaking. I’d also get a chance to study data in more detail. I felt these skills would be super critical to be successful in computer vision.

Also, data-centric AI has featured as a nascent field and still lacks proper tools and well-defined workflow setups. So, it meant going into an unknown territory and coming up with new ideas.

3. The techniques you used

For the competition, we were given an image set of around 2.9K images with Roman numerals 1-10. The data included a basic level of labelling and a fixed ResNet model. Our task was to enhance the dataset to a maximum of 10K images that would maximize the model accuracy on a hidden test set.

Here are a few examples from the dataset:

I had set up different algorithms to separate noise according to their noise levels. Below are some sample cases with different types of noise:

After we removed the noise we were left with two sets of images, one clean and one noisy. Clean images only have the foreground (letters), while noisy ones only have the background (noise).We then processed the clean images to crop into letter regions. The cleaned images before and after cropping were fed to umap clustering method to study their patterns, as shown below:

Clusters are colored based on their true labels. As can be seen, clustering follows their native label patterns more so with cropping than without and also results in more precise decision boundaries between labels. We then processed these cropped images to prepare them for augmentation. 

3.1 Data augmentation methods

The first augmentation stage included camera distortion methods that allowed us to generate unique shapes by mapping a numeral’s regular, 2D grid onto a grid that is skewed

Here are few different configuration results:

Some other data augmentation techniques we used included scaling and shearing the numerals, and adjusting the gray boundaries around the numerals to simulate brush strokes with varied pressures. 

These augmentation processes are applied as a combination to each image and controlled by a random seed. A main function calls this augmentation function iteratively with a different seed value for each image. The seed values are sampled from a uniform distribution and this ensured variety for all of the augmentation methods across the entire output dataset. 

3.2 Overlay onto canvas

After augmentations, we needed to overlay our cropped numerals back onto canvases. We had two workflows:

  1. Use their original backgrounds, which ensures their natural feel, especially since the image sizes vary a lot.
  2. Use backgrounds extracted from other images. One way to do this would be to resize them to the background sizes used in the original images.This would allow us to perform more augmentations by rotating the background images by 90 degrees both clockwise and counterclockwise, and flip them vertically and horizontally. The downside to this approach is that it would require heavy cleanup afterwards. 

We went with the second workflow for the heavy noise cases and first one for the rest. Second workflow allowed N^2 augmentations which was needed to balance for heavy noise cases across all 10 numerals. Here’s a schematic for two heavy noisy images and their noise templates mix and matched with their clean image counterparts:

3.3 Data quality assessment and cleaning up

A general framework was setup to study image sets based on their shape features. We used an Imagenet pretrained ResNet-152 model as a feature detector to retrieve the features array on a given image set. This is fed to a t-SNE clustering algorithm with dimension-reduction to obtain a two-dimensional array that could be used with various workflows. In our case, this was set up on cropped images and thus features were representative of the shapes. More features could be configured, especially based on noise levels, etc. Data points on the t-SNE output could be selected or removed based on minimum or maximum distance parameters.

Data cleanup was done at the following levels –

– Manual inspection to correct labels on raw data.

– Remove similar shaped cases among data-augmented images from section 3.1.

– Remove ambiguous cases using manual inspection among noise overlaid images from section 3.2.

The general framework was also used to setup validation set to ensure variety in it.

The entire workflow from start (raw data) to finish (good quality data) could be summarized schematically:

3.4 Transfer strategies

So, how do we use these strategies on a general computer vision problem? Let me throw in some ideas.

– To crop into the objects in context, semantic segmentation or even bounding box detection could help.

– For data-augmentation, GANs could be used, depending on the labels and their availability.

– After segmenting out objects, leftover backgrounds with “holes” could be filled back with image inpainting or something similar.

4. Your advice for other learners

The domain of AI can feel really big to early learners. My suggestion would be to explore as many areas in it as possible and find the one that feels like a good fit for you. Bonus tip – Find an ultrawide monitor for these computer-vision data based work and competitions, as they help a lot!