Training a machine learning model on a different model’s output can be a useful technique, but it also raises engineering, business, and legal questions. When is it okay?
Engineering recipes for training learning algorithms on generated data are still being developed. When I led a large automatic speech recognition (ASR) team, there were rumors — that we never proved or disproved — that a competitor was using our system to generate transcripts to train a competing system. It was said that, rather than using our ASR system’s output directly as labeled training data, our competitor used a lightweight process to manually clean up errors and make sure the data was high-quality.
Lately, I’ve seen many developers experiment with use cases such as prompting a large model (say, 175B parameters) to generate high-quality outputs specialized to an application such as customer support, and using this data to fine-tune a smaller model (say, ~10B parameters) that costs less per inference. UC Berkeley trained Koala using data from ShareGPT, and Stanford trained Alpaca by fine-tuning Meta’s LLaMA on data generated with assistance from OpenAI’s text-davinci-003.
Such recipes raise important business questions. You may have spent a lot of effort to collect a large labeled training set, yet a competitor can use your model’s output to gain a leg up. This possibility argues that, contrary to conventional tech-business wisdom, data doesn’t always make your business more defensible. Specifically, if a market leader spent significant resources to get its performance up to a certain level, and if the market leader’s product generates data that makes it cheaper for competitors to catch up, then the market leader’s initial effort spent gathering data is a weak defense against competitors.
- Are terms that restrict competitor’s access to your services enforceable in light of antitrust and fair-use laws?
(To state the obvious, I am not a lawyer. Don’t construe anything I say as legal advice!)
In the era of generative AI, we’ll see many creative use cases for intentionally using one model to generate data to train another. This is an exciting technical trend, even as we keep in mind the need to move forward in ways that are legal and fair.
P.S. On Friday, April 7, Yann LeCun and I will hold a live online discussion about a proposed six-month pause in cutting-edge AI research. The proposal raises questions about AI’s future and, if implemented, would have a huge impact on developers and businesses. Please join us.