How much data do you need to collect for a new machine learning project? If you’re working in a domain you’re familiar with, you may have a sense based on experience or from the literature. But when you’re working on a novel application, it’s hard to tell. In this circumstance, I find it useful to ask not how much data to collect but how much time to spend collecting data.
For instance, I’ve worked on automatic speech recognition, so I have a sense of how much data is needed to build this kind of system: 100 hours for a rudimentary one, 1,000 hours for a basic one, 10,000 hours for a very good one, and perhaps 100,000-plus hours for an absolutely cutting-edge system. But if you were to give me a new application to work on, I might find it difficult to guess whether we need 10 or 10,000 examples.
When starting a project, it's useful to flip the question around. Instead of asking:
How many days will we need to collect m training examples?
How many training examples can we collect in d days?
Taking a data-centric approach to model development, let’s say it takes about two days to train a model and two days to perform error analysis and decide what additional data to collect (or how to tweak the model). How many days should you spend collecting data before training and error analysis? Allocating comparable amounts of time to each step seems reasonable, so I would advocate budgeting a couple of days — a week at most — for data collection. Then iterate through the loop.
I’ve seen many teams spend far too much data collecting data before jumping into the model development loop. I’ve rarely seen a team spend too little time. If you don’t collect enough data the first time around, usually there’s time to collect more, and your efforts will be more focused because they’ll be guided by error analysis.
When I tell a team, “Let’s spend two days collecting data,” the time limit often spurs creativity and invention of scrappy ways to acquire or synthesize data. This is much better than spending two months collecting data only to realize that we weren’t correcting the right data (say, the microphone we used was too noisy, leading to high Bayes/irreducible error).
So, next time you face an unfamiliar machine learning problem, get into the model iteration loop as quickly as possible, and set a limited period of time for collecting data the first time around, at least. You’re likely to build a better model in less time.
P.S. Once I created an unnecessarily scramble when asked a team to make sure that data collection took no longer than two days. Because of a bad Zoom connection, they thought I said “today.” Now I've learned to hold up two fingers whenever I say “two days” on a video call.