TL;DR
A Thorsten Meyer AI report argues that data has become the AI industry’s hardest chokepoint as public web text nears saturation and licensing costs rise. The confirmed developments include projections from Epoch AI, major copyright litigation and settlements, and growing competition for proprietary, expert and sovereign datasets.
AI companies are facing a new constraint in 2026: the most valuable training data is no longer freely available at scale, according to a Thorsten Meyer AI report that frames proprietary, licensed, expert and sovereign datasets as the industry’s next major chokepoint.
The report says the AI sector has largely used the easiest supply of public web text and is now moving toward data that is harder to obtain: paywalled content, enterprise records, expert-authored material, real-world operational data and military or state-held datasets. Unlike compute, which can be rented, such data is scarce because it is owned by specific companies, institutions, governments or individuals.
Epoch AI is cited as estimating that the public internet contains roughly 300 trillion tokens of high-quality text. Its projection says the stock of public human text could be fully used between 2026 and 2032, with a median estimate around 2028. Those figures are projections, not settled facts, and depend on model size, training methods and how much repeated training can extract from existing corpora.
The report also points to legal and commercial shifts. It cites Anthropic’s reported $1.5 billion authors settlement over pirated books, while noting that the settlement addressed past piracy claims rather than future model training or model outputs. The New York Times case against OpenAI is described as still in discovery, while some publishers have moved toward licensing deals instead of lawsuits.
Data: The One Thing You Can’t Rent
The free part of “all human knowledge” is running out. As compute and models commoditize, the corpus you can’t replicate becomes the moat — so data is being fenced, priced, and, in places, treated as a national asset.
Data was supposed to be the abundant input. It’s the scarce one. It’s also the chokepoint you can actually own — so guard your proprietary data, and don’t hand it to a provider who can become your competitor (the lesson everyone fled Scale to learn). Nations: license it like Ukraine — keep the model, keep the leverage.
Data Access Shapes AI Power
The shift matters because it changes what gives AI companies an advantage. If chips, models and cloud capacity become easier for rivals to obtain, exclusive or higher-quality data may become a stronger competitive barrier.
For startups, the report argues, a paid licensing market can raise entry costs. Large companies may be better able to absorb billion-dollar settlements, negotiate publisher deals and acquire data-rich suppliers. Smaller developers may need to rely more on open datasets, synthetic data, narrow partnerships or customer-owned information.
For businesses outside the AI industry, the message is direct: proprietary data can become a strategic asset. The report warns that companies should be cautious about sending unique internal data to vendors that could later use similar capabilities to compete with them. That is an interpretation from the report, not a confirmed outcome across the market.

Western Digital 18TB WD Red Pro NAS Internal Hard Drive HDD – 7200 RPM, SATA 6 Gb/s, CMR, 512 MB Cache, 3.5" – WD181KFGX
Capacity: 18 TB, providing ample space for storing large amounts of data in commercial and enterprise NAS environments
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Public Text Is Near Limits
The AI boom was built in part on large-scale web scraping, open datasets and digitized books. The Thorsten Meyer AI report says that phase is changing because the highest-quality public text has already been heavily mined by frontier labs.
Elon Musk is cited as saying in early 2025 that the cumulative sum of human knowledge had been largely exhausted for training. The report treats that as a blunt industry claim rather than a precise measurement. It also says synthetic data has become a common response, citing Nvidia’s $320 million acquisition of Gretel and Microsoft’s use of hundreds of billions of synthetic tokens.
But the report adds a caveat: synthetic data can increase risks when answers are hard to verify, because errors may be repeated or amplified across model generations. That concern, it says, raises the value of fresh, verified human-made data, especially in fields such as law, medicine, science and defense.
“You can rent compute. You can lease power. You cannot rent data that no one else has.”
— Thorsten Meyer AI report

Unlock & Reset Tool for Ubiquiti® UniFi® Access Points & Cameras
Fast & Hassle-Free Removal – No more struggling with hard-to-release Ubiquiti access points.
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Open Questions On Data Scarcity
It is not yet clear how quickly public training data will become exhausted in practice. Epoch AI’s range extends from 2026 to 2032, and future efficiency gains could change how much value labs can extract from existing text.
The legal picture also remains unsettled. The Anthropic settlement, as described in the report, does not resolve all questions about future training, model outputs or other pending cases. Court rulings and licensing norms may still reshape what developers can use and what they must pay for.
It is also unclear how much synthetic data can replace human-created material without harming model quality in high-stakes domains. The report cites model-collapse risks, but the scale of that risk varies by field, verification method and training design.

New Jersey Fire Alarm License Exam Review Questions and Answers: A Self-Practice Exercise Book covering fire alarm technical information and state specific licensing regulations
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Licensing Deals Set The Pace
The next phase will likely be shaped by court decisions, publisher and enterprise licensing contracts, and acquisitions of companies that control valuable datasets. The report cites Meta’s reported $14.3 billion deal for a 49% stake in Scale AI as one example of how data-related assets are being valued.
Governments may also play a larger role where data is tied to national security, public services or battlefield operations. The report points to Ukraine’s reported condition around wartime AI data: keep the model and keep the leverage. More such arrangements could turn data governance into a strategic policy issue, not only a business concern.
sovereign dataset acquisition
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Key Questions
What is the main development in this report?
The report says AI competition is shifting from access to rented compute toward access to scarce datasets, including licensed media, enterprise records, expert material and sovereign data.
Is public web data already exhausted?
No. The report cites Epoch AI projections that high-quality public text could be fully used between 2026 and 2032, with a median estimate around 2028. That remains a projection.
Why does copyright litigation matter for AI data?
Copyright cases and settlements can determine whether AI developers must pay for training material, destroy improperly obtained files or change sourcing practices. The report says this is helping create a paid licensing market.
Can synthetic data solve the shortage?
Synthetic data can help, and major AI companies are already using it. The report says it may not fully replace fresh human-made data, especially where errors are hard to detect.
What should companies take from this?
The report’s practical warning is that proprietary data may be a business asset. Companies may need clearer rules for how vendors can use internal data supplied to AI systems.
Source: Thorsten Meyer AI