Many AI systems have been built using data scraped from the internet. Indeed, even the cornerstone dataset for computer vision research, ImageNet, was built using images taken from the public internet. With the rise of data-centric AI, access to good data continues to grow in importance to developers.
What are the limits for scraping and using public data? Earlier this year, a United States court ruled that scraping data from websites that don’t take measures to hide it from public view doesn’t violate a law designed to thwart hackers. I believe this is a positive step for AI as well as competition on the internet, and I hope it will lead to further clarity about what is and isn’t allowed.
Many companies aim to create so-called walled gardens in which they provide exclusive access to content — even though it may be visible to all — such as social media posts or user résumés (the data at the heart of the ruling). But such data is valuable to other companies as well. For example, while LinkedIn helps users display their résumés to professional contacts, other companies might use this data to recruit potential employees, predict whether employees are likely to leave their current positions (updating a résumé is a sign), or find sales leads. Scraping the web was important in the early days of the internet to make web search viable, but as new uses come up — such as using machine learning to generate novel insights — clear rules about which data can and can’t be used, and how, become even more important.
This isn’t a simple matter. There is a fine line between protecting copyright, which incentivizes businesses to create that data, and making data widely available, which enables others to derive value from it. In addition, freely available data can be abused. For example, some face recognition companies have been especially aggressive in scraping face portraits, building systems that invade privacy.
The U.S. court found that scraping data that is publicly accessible doesn’t violate the Computer Fraud and Abuse Act. This is not the same as allowing unfettered access to web scrapers. Data held behind a login wall or accessible only after agreeing to restrictive terms of service may be a different matter. (Disclaimer: Please don’t construe anything I say as legal advice.)
While this ruling may hurt companies that have built businesses on data that is fully visible to the public, overall I view it as a positive step. It will increase the free flow of information and make it easier for teams to innovate in AI and beyond. Also, knocking down part of the wall that surrounds walled gardens should increase competition on the internet. On the other hand, it increases the incentives to put data behind a login wall, where it’s no longer publicly accessible.
The issues of open versus closed data aren’t new. With the rise of mobile apps over a decade ago, web search companies worried that data would be locked within mobile apps rather than accessible on the web. This is one reason why Google invested in the Android mobile operating system as a counterweight to Apple’s iOS. Although ideas about which data should be accessible continue to shift, I continue to believe that a more open internet will benefit more people. With the rise of AI, algorithms — in addition to people — are hungry to see this data, making it even more important to ensure relatively free access.