We just wrapped up the Data-Centric AI Workshop at the NeurIPS 2021 conference. It was packed with information about how to engineer data for AI systems. I wish the whole DeepLearning.AI community could have been there! I expect the videos to be available before long and will let you know when they’re online
Over the course of an eight-hour session, authors presented 100 papers via two-minute lightning talks and posters. Eight invited speakers described a variety of data-centric AI issues and techniques, and expert panels answered questions from the audience.
These were some of my key takeaways:
- There’s a lot going on in data-centric AI — even more than I realized. I was also surprised by the variety of ideas presented on how to measure, engineer, and improve data. Several participants expressed variations on, “I’ve been tuning the data by myself for a long time, and it’s great to finally find a like-minded and supportive community to discuss it with.”
- Many diverse applications are using data-centric AI in areas including chatbots, content moderation, healthcare, document scanning, finance, materials science, speech, and underwater imaging. They take advantage of clever techniques for spotting incorrect labels, crowdsourcing, generating data, managing technical debt, managing data pipelines, benchmarking, and more.
- An immense amount of innovation and research lies ahead. We’re working collectively to coalesce broadly useful data-centric principles and tools. But, given the richness of the problems that remain open, it will take many years and thousands of research papers to flesh out this field.
Among the invited speakers:
- Anima Anandkumar showed sophisticated synthetic data techniques.
- Michael Bernstein shared tips for making crowdsourcing much more effective.
- Douwe Kiela demonstrated DynaBench as a tool for creating new data-centric benchmarks.
- Peter Mattson and Praveen Paritosh described efforts to benchmark data including a plan by MLCommons to continue developing projects like DataPerf.
- Curtis Northcutt described the CleanLab system, which made it possible to find many labeling errors in the test sets of widely used datasets like MNIST and ImageNet.
- Alex Ratner described a programmatic approach to Data-Centric AI.
- Olga Russakovsky presented a tool for de-biasing large datasets.
- D. Scully discussed the role of data-centric AI in addressing technical debt in machine learning systems.
Thanks to everyone who participated in the workshop or submitted a paper; to the presenters, panelists, invited speakers, and poster presenters; and to the reviewers, volunteers, and co-organizers who put the program together.
I was struck by the energy, momentum, and camaraderie I felt among the participants. I came away more excited than ever to keep pushing forward the data-centric AI movement, and I remain convinced that this field will help everyone build more effective and fairer AI systems.
Keep engineering your data!