Venture capitalists are heavily investing in AI startups, many of which have foundational models built using data protected by U.S. copyright laws. This has led to a spate of lawsuits from copyright holders, including major music companies and media outlets. Lawmakers are taking notice.

AI and Copyright Law

In January, the Senate held a hearing to discuss AI’s impact on journalism. The hearing hinged on the question of  whether or not tech giants should compensate media outlets for using their work to train LLMs. Additionally, 17 heavy-hitting authors, including Game of Thrones creator George R.R. Martin and NYT Best-Seller List regulars Jodi Picoult, John Grisham and Jonathan Franzen, joined the The Author’s Guild in filing a class action lawsuit against OpenAI.

Are AI models infringing copyright?

At the forefront of these very public debates are the fundamental issues of fair use, copyright infringement, and the question of fair compensation.

The critical question is whether or not parties can legally sue for copyright infringement when AI reads and digests data that is protected?

If lawmakers decide that the answer to that question is “yes,” then the price tag for acquiring training data, or retroactively reimbursing for data that was already ingested, will quickly become prohibitive.

But despite the debate, venture capitalists seem confident that ingestion for training purposes falls firmly under fair use, and one would assume (which is a famously dangerous habit) that the largest players have thoroughly researched the legal risks.

AI Needs Good Data and Good Data is in Short Supply

Models are only as valuable as the data they’re trained on — poor foundational data will result in limited, erroneous, and biased models. And good data, it turns out, is in short supply.

A now-famous research paper predicts that we could run out of high-quality text data to train AI models by 2026. And tech companies have gotten increasingly creative. For example, Meta was supposedly considering buying the venerated  100-year-old publishing institution Simon and Schuster (!!) in order to acquire high-quality texts for model training.

It has always been dangerous to build your house on someone else's land. In the early days of influencer marketing, there was risk for any company who built their business around a connection to a social platform. If you were pulling data from Instagram, there was the eternal threat that Instagram could “shut off the pipes” and your whole business would be wiped out. In this case, the pipes may not be shut down, but the data flowing through them could become so expensive that they may as well be. Or, to extend the water metaphor, the cost of retroactively paying for data ingested could be enough to put an insurmountable lien on the houses they’ve built.

This metaphor is already playing out in the social sphere with Reddit’s 2023 announcement of a premium API tier for AI scraping.  The “largest companies in the world,” as Reddit refers to them, will easily be able to pay to play in these spaces — but newer, smaller players may not be.

The Future of AI Belongs to Data Owners

While we wait for history to be decided, organizations that have their own existing data moats will have an insurmountable head start and long term protection in the market by virtue of avoiding this conundrum altogether. Proprietary, first-party data is on track to become solid gold.