In this blog post, Mohammad Motamedi, one of the winners of the Data-Centric AI Competition, describes techniques and strategies that led to victory. Participants received a fixed model architecture and a dataset of 1,500 handwritten Roman numerals. Their task was to optimize model performance solely by improving the dataset and dividing it into training and validation sets. The dataset size was capped at 10,000. You can find more details about the competition here.
I am a Senior Software Engineer at NVIDIA, working on optimizing deep learning libraries and computationally complex state-of-the-art deep learning models. I have been focusing on deep learning since 2014, when I started my Ph.D. at the University of California, Davis. At that time, in the absence of deep learning frameworks and modern GPUs, addressing the computational demands of deep neural networks was more challenging. As a result, I started my journey by finding solutions to alleviate the hardware constraints associated with AI[1-3]. While I was in graduate school, I interned in a startup working on self-driving cars and became familiar with object detection in 3D point clouds. During this internship, I realized that it is challenging to have the labelers provide consistent bounding cubes around objects of interest in a point cloud. We developed a tool and used a plethora of techniques such as sensor fusion, auto-calibration, point synthesis, and creating mappings between lidar and camera coordinates to improve the consistency of data labels. I never contemplated this topic in a Data-Centric fashion until last May, when I watched Andrew Ng’s talk on Data-Centric AI (DCAI) . Subsequently, in early August, I learned about the DCAI competition and considered it an exciting opportunity to learn more and get involved in this emerging field.
For my submission to the first DCAI competition, I used a data optimization pipeline that investigates the dataset from various aspects, identifies invalid or mislabeled instances, and makes suggestions for addressing them. Different pipeline steps are summarized in what follows, and the corresponding paper can be found on arXiv .
Step 1 – Duplicate Detection and Elimination: The first step of the pipeline investigates the dataset to discover identical samples that may or may not have the same name. I used a multi-stage hashing approach to ensure that such an investigation has minimal execution time, even if the dataset includes millions of records. The benefits of this step are twofold: it eliminates redundancy and ensures that the training and validation sets have no overlap.
Step 2 – Training Auxiliary Models: In this step, we randomly selected a small number of samples to certify their validity and to ensure that they are labeled correctly. Subsequently, I trained a classifier using this data. Since the data size is small, it was essential that we apply heavy data augmentation. Appropriate augmentation techniques vary based on the data and the target application. In the case of Roman numerals, we used 180 samples for training and 20 samples for validation. For data augmentation, we used random rotation with a factor of 0.05, random contrast change with a factor of 0.5, random translation with a range of 20% of image dimensions, random pen-stroke like black spots, random white spots, and finally, random dashed lines with various lengths and orientations. It is worth mentioning that models trained in this step are auxiliary neural networks that help with facilitating dataset optimization. These models are no longer required after step 3.
Step 3 – Dataset Investigation: We used the model prepared in the previous step for inference on the rest of the dataset, and the samples are sorted based on their loss values. The first K samples with the smallest loss values are perceived to be valid and have correct labels based on the models’ assessment. A human supervisor confirmed this perception. Subsequently, we added these K samples to the training set and used them in the next iteration. In addition to the first K samples, we also investigated the last L samples that wereperceived to be invalid, hadwrong labels, or weresimply hard to classify. Analogous to the first step, a human supervisor checked the outputs, the new labels that the model suggests, and performed the necessary adjustments as needed. These samples, if not removed, were alsoadded to the training set. Subsequently, a new model is trained as described in step two, and the process reiterates between these two steps. In the case of Roman numerals, since the results are submitted to a competition, we repeated the process until all of the samples were certified by a human.
Step 4 – Class Imbalance Resolution: In this step, additional data from each class is moved to a surplus dataset to ensure that all categories have the same number of samples. As we augment the data, these samples have the highest priority for moving back into the training set.
Step 5 – N-fold Cross-Validation: Subsequent to class imbalance resolution is the test set selection, which is performed randomly. The rest of the data will be divided into N folds. For each fold, the training set and the validation set are selected randomly. For the Roman numerals, N equals 8.
In the past decade, the research community has put forth numerous approaches for improving the accuracy of deep learning models. As a result, in most domains, several effective models are available that offer competing accuracies. At this stage, focusing on innovative approaches to improve data quality appears to be the next step in further enhancing the performance of deep learning in various domains. In my view, practicing with the Roman-MNIST dataset  is a great starting point for those who want to learn about Data-Centric Artificial Intelligence.
 Philipp Gysel, Mohammad Motamedi, and Soheil Ghiasi. “Hardware-oriented approximation of convolutional neural networks.” arXiv preprint arXiv:1604.03168 (2016).
 Mohammad Motamedi, Daniel Fong, and Soheil Ghiasi. “Machine intelligence on resource-constrained IoT devices: The case of thread granularity optimization for CNN inference.” ACM Transactions on Embedded Computing Systems (TECS) 16.5s (2017): 1-19.
 Mohammad Motamedi, et al. “Design space exploration of FPGA-based deep convolutional neural networks.” 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2016.
 Andrew Ng: A Chat with Andrew on MLOps: From Model-centric to Data-centric AI. Available at https://www.youtube.com/watch?v=06-AZXmwHjo.
 Mohammad Motamedi, Nikolay Sakharnykh, Tim Kaldewey (2021).A Data-Centric Approach for Training Deep Neural Networks with Less Data. arXiv preprint arXiv:2110.03613.
 Andrew Ng, Lynn He, and Dillon Laird. Data-Centric AI Competition, 2021. Available at https://https-deeplearning-ai.github.io/data-centric-comp/