Technologies that fundamentally change society only come around once every decade or so. The Internet was one. Artificial Intelligence (A.I.) is the next. A.I. has the potential to improve lives and reshape industries from healthcare to finance–but A.I. can only be as good as the quality of data it’s trained on.
The extensive growth of text, images, videos, and audio available on the public web has fueled the rise of A.I. models by providing a constantly expanding source of information. This is why researchers predict that AI, already a $137 billion industry, will grow more than 37% each year this decade.
For instance, Meta recently released LLaMA, “a collection of foundation language models” that aim at democratizing access to A.I. research. “We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively,” the Facebook parent said.
However, even as it touts the importance of publicly available data to A.I., Meta is simultaneously pursuing litigation to close access to public web data that it acknowledges it does not own.
If Big Tech is allowed to build a walled garden around data that’s present in the public domain (meaning data that isn’t behind a login), it will prevent A.I. from reaching its full potential.
Looking ahead, the volume of data and information created, captured, copied, and consumed worldwide is expected to reach 120 zettabytes this year–nearly triple what it was in just 2019.
If publicly available web data is stripped from the public and held onto only by the most powerful companies, the ability for A.I. to advance in a way that benefits society would be severely limited. If only a few companies were developing cutting-edge A.I., its development will not be aligned with humanity’s best interests.
Publicly available data is not only the lifeblood of emerging artificial intelligence tools, but it’s also essential for current business operations. Companies and nonprofits alike rely on publicly available web data to efficiently and effectively carry out their missions, with 94% using it on a daily basis, according to a survey of 150 IT, technology, and data analytics experts from U.S. retail, technology, and nonprofit organizations. In this survey, nearly four out of five respondents stated they would be unable to operate effectively without access to public web data.
The potential for A.I. to be used for social good is equally exciting. For example, through our pro bono program, The Bright Initiative, we assist nonprofit, academic and charitable organizations, helping them tackle serious social problems such as antisemitism, hate speech, and human trafficking.
More broadly, developers must have access to the datasets they need to ethically train A.I. By providing a vast amount of diverse and up-to-date information, public web data can be used to train machine learning models, improve accuracy, and ensure A.I. is aligned with humanity’s goals.
Or Lenchner is the CEO of Bright Data, a web data platform dedicated to maintaining transparent access to public web data for all.
The opinions expressed in Fortune.com commentary pieces are solely the views of their authors and do not necessarily reflect the opinions and beliefs of Fortune.
More must-read commentary published by Fortune:
- Stanford researchers scoured every reputable study for the link between video games and gun violence that politicians point to. Here’s what the review found
- Is it smart to be a ‘stupid genius’ like Elon Musk?
- Why there will be no winners in the never-ending war between Disney and DeSantis
- America’s ‘disease burden’ is getting heavier by the day–and it’s unevenly distributed across states