Benchmarks have been a significant driver of research progress in machine learning. But they've driven progress in model architecture, not approaches to building datasets, which can have a large impact on performance in practical applications. Could a new type of benchmark spur progress in data-centric AI development?
Remember: AI System = Code (model/algorithm) + Data
Most benchmarks provide a fixed set of Data and invite researchers to iterate on the Code. This makes it possible to compare algorithms: By running many models on the same dataset, we can find the ones that perform best. To spur innovation on data-centric AI approaches, perhaps it’s time to hold the Code fixed and invite researchers to improve the Data.
A huge amount of innovation — in algorithms, ideas, principles, and tools — is needed to make data-centric AI development efficient and effective.
When AI was shifting toward deep learning over a decade ago, I didn’t foresee how many thousands of innovations and research papers would be needed to flesh out core tenets of the field. But now I think an equally large amount of work lies ahead to support a data-centric approach. For example, we need to develop good ways to:
- Surface and address inconsistencies in data labels
- Detect and address data drift and concept drift
- Help developers with error analysis
- Select and apply the most effective data augmentation techniques
- Decide what additional data to collect (rather than collecting more of everything)
- Merge inconsistent data sources
- Track data provenance and lineage, so we can address problems in the data, such as bias, that may be discovered later
Benchmarks and competitions in which teams are asked to improve the data rather than the code would better reflect the workloads of many practical applications. I hope that such benchmarks also will spur research and help engineers gain experience working on data. The Human Computer Interface (HCI) community also has a role in designing user interfaces that help developers and subject-matter experts work efficiently with data.
I asked for feedback on the idea of a data-centric competition on social media (Twitter, LinkedIn, Facebook). I’ve read all the responses so far — thanks to all who replied. If you have thoughts on this, please join the discussion there.