Last week, I described trends that AI Fund, the venture studio I lead, has seen in building AI startups. I'd like to discuss another aspect of building companies that’s unique to AI businesses: the controversial topic of data moats.
A company has a data moat if its access to data makes it difficult for competitors to enter its business. Moat is a common business term used evoke the water-filled moats built around castles to make them easier to defend against attackers. For example, if a self-driving car company can acquire far more data than its competitors to train and test its system, and if this data makes a material difference in the system’s performance, then its business will be more defensible.
For a few years, some investors asked every AI startup’s founders about its data moat, as if they expected everyone to build one. But, like many things in AI, it depends. A data moat can provide protection, but its effectiveness varies depending on the specific circumstances of the business.
For instance, a data moat may not do much to protect an AI business if:
- System performance plateaus with more data. Say you're building a general-purpose speech recognizer, and human-level performance is 95 percent accurate. Collecting enough data to achieve 94 percent accuracy is hard, and getting incrementally more data will have diminishing returns. In fact, it’s much easier for a competitor to improve from 90 to 91 percent accuracy than for you to improve from 94 to 95 percent.
- Data doesn’t change over time. If the mapping from input x to output y remains the same (as in speech recognition, where the input spoken words “The Batch” will continue to map to their text equivalents for a long time), competitors will have time to accumulate data and catch up.
- The application can be built with a smaller dataset thanks to new data-centric AI development technologies, including the ability to generate synthetic data, and tools that systematically improve data quality.
In contrast, data can make an AI business more defensible if:
- Performance keeps improving materially within the range of dataset size that a company and its competitors can reasonably amass. For example, web searches form a very long tail of rare queries, which make up a large fraction of all searches. Thus, performance keeps improving for a long time as a search engine gets more clickstream data, and a dominant search engine can stay ahead of smaller outfits that try to bootstrap with little data. Generally, larger datasets tend to confer a longer-lasting benefit on applications where a large fraction of relevant data makes up a long tail of rare or hard-to-classify events.
- The data distribution varies significantly over time. In this case, access to an ongoing stream of fresh data is critical for keeping the machine learning model current, which in turn earns further access to the data stream. I believe this is one of the factors that makes social media companies especially defensible. The topics posted change regularly, and the ability to keep the system up-to-date helps increase its appeal relative to new competitors.
- The market has winner-take-all dynamics, and users have low switching costs. When a market supports only one leader, access to data that delivers even marginally better performance can be a major advantage. For instance, a ride-sharing company whose data pipeline enables passengers to reach rider destinations faster is likely to attract the most riders.
- Access to customer data significantly increases switching costs, reduces churn, or increases the ability to upsell. This is especially true if customers would have a hard time exporting or even making sense of their own data if they were to patronize a competitor.
Data strategy is important for AI companies, and thinking through how a system’s performance varies with the amount of data, the importance of fresh data, and other factors described above can help you decide how much having data adds to a business’ defensibility. Sometimes a data moat doesn't help at all. But in other cases, it's one pillar (hopefully among many) that makes it harder for competitors to catch up.