TL;DR

A Thorsten Meyer AI report argues that data has become the AI industry’s hardest chokepoint as public web text nears saturation and licensing costs rise. The confirmed developments include projections from Epoch AI, major copyright litigation and settlements, and growing competition for proprietary, expert and sovereign datasets.

AI companies are facing a new constraint in 2026: the most valuable training data is no longer freely available at scale, according to a Thorsten Meyer AI report that frames proprietary, licensed, expert and sovereign datasets as the industry’s next major chokepoint.

The report says the AI sector has largely used the easiest supply of public web text and is now moving toward data that is harder to obtain: paywalled content, enterprise records, expert-authored material, real-world operational data and military or state-held datasets. Unlike compute, which can be rented, such data is scarce because it is owned by specific companies, institutions, governments or individuals.

Epoch AI is cited as estimating that the public internet contains roughly 300 trillion tokens of high-quality text. Its projection says the stock of public human text could be fully used between 2026 and 2032, with a median estimate around 2028. Those figures are projections, not settled facts, and depend on model size, training methods and how much repeated training can extract from existing corpora.

The report also points to legal and commercial shifts. It cites Anthropic’s reported $1.5 billion authors settlement over pirated books, while noting that the settlement addressed past piracy claims rather than future model training or model outputs. The New York Times case against OpenAI is described as still in discovery, while some publishers have moved toward licensing deals instead of lawsuits.

AI Dispatch · The Control Series · Part 3
Chokepoint 03 — Data

Data: The One Thing You Can’t Rent

The free part of “all human knowledge” is running out. As compute and models commoditize, the corpus you can’t replicate becomes the moat — so data is being fenced, priced, and, in places, treated as a national asset.

Scarcity & value rises ↑
Sovereign / real-world
Avengers combat data · FSD · ISR
can’t be bought
Expert-authored
PhDs, lawyers, surgeons define “good”
the new gold
Licensed content
paywalled, deal-only — now priced
fenced
Public web text
scraped for free — exhausting ~2028
commoditizing
~300T
public text tokens — used up 2026–2032
$1.5B
Anthropic authors settlement — scraping era ends
$14.3B
Meta for 49% of Scale — triggered an exodus
keep the model
Ukraine’s condition — data as sovereign asset
The take

Data was supposed to be the abundant input. It’s the scarce one. It’s also the chokepoint you can actually own — so guard your proprietary data, and don’t hand it to a provider who can become your competitor (the lesson everyone fled Scale to learn). Nations: license it like Ukraine — keep the model, keep the leverage.

Sources: Epoch AI; PBS; Intl AI Safety Report 2026; NPR; Authors Guild; Wolters Kluwer; TechCrunch; TIME; CNBC; Ukraine MoD (2024–Jun 2026). Token estimates are projections; valuations as reported.
thorstenmeyerai.com · 03 / 06

Data Access Shapes AI Power

The shift matters because it changes what gives AI companies an advantage. If chips, models and cloud capacity become easier for rivals to obtain, exclusive or higher-quality data may become a stronger competitive barrier.

For startups, the report argues, a paid licensing market can raise entry costs. Large companies may be better able to absorb billion-dollar settlements, negotiate publisher deals and acquire data-rich suppliers. Smaller developers may need to rely more on open datasets, synthetic data, narrow partnerships or customer-owned information.

For businesses outside the AI industry, the message is direct: proprietary data can become a strategic asset. The report warns that companies should be cautious about sending unique internal data to vendors that could later use similar capabilities to compete with them. That is an interpretation from the report, not a confirmed outcome across the market.

Western Digital 18TB WD Red Pro NAS Internal Hard Drive HDD - 7200 RPM, SATA 6 Gb/s, CMR, 512 MB Cache, 3.5" - WD181KFGX

Western Digital 18TB WD Red Pro NAS Internal Hard Drive HDD – 7200 RPM, SATA 6 Gb/s, CMR, 512 MB Cache, 3.5" – WD181KFGX

Capacity: 18 TB, providing ample space for storing large amounts of data in commercial and enterprise NAS environments

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Public Text Is Near Limits

The AI boom was built in part on large-scale web scraping, open datasets and digitized books. The Thorsten Meyer AI report says that phase is changing because the highest-quality public text has already been heavily mined by frontier labs.

Elon Musk is cited as saying in early 2025 that the cumulative sum of human knowledge had been largely exhausted for training. The report treats that as a blunt industry claim rather than a precise measurement. It also says synthetic data has become a common response, citing Nvidia’s $320 million acquisition of Gretel and Microsoft’s use of hundreds of billions of synthetic tokens.

But the report adds a caveat: synthetic data can increase risks when answers are hard to verify, because errors may be repeated or amplified across model generations. That concern, it says, raises the value of fresh, verified human-made data, especially in fields such as law, medicine, science and defense.

“You can rent compute. You can lease power. You cannot rent data that no one else has.”

— Thorsten Meyer AI report

Unlock & Reset Tool for Ubiquiti® UniFi® Access Points & Cameras

Unlock & Reset Tool for Ubiquiti® UniFi® Access Points & Cameras

Fast & Hassle-Free Removal – No more struggling with hard-to-release Ubiquiti access points.

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Open Questions On Data Scarcity

It is not yet clear how quickly public training data will become exhausted in practice. Epoch AI’s range extends from 2026 to 2032, and future efficiency gains could change how much value labs can extract from existing text.

The legal picture also remains unsettled. The Anthropic settlement, as described in the report, does not resolve all questions about future training, model outputs or other pending cases. Court rulings and licensing norms may still reshape what developers can use and what they must pay for.

It is also unclear how much synthetic data can replace human-created material without harming model quality in high-stakes domains. The report cites model-collapse risks, but the scale of that risk varies by field, verification method and training design.

New Jersey Fire Alarm License Exam Review Questions and Answers: A Self-Practice Exercise Book covering fire alarm technical information and state specific licensing regulations

New Jersey Fire Alarm License Exam Review Questions and Answers: A Self-Practice Exercise Book covering fire alarm technical information and state specific licensing regulations

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Licensing Deals Set The Pace

The next phase will likely be shaped by court decisions, publisher and enterprise licensing contracts, and acquisitions of companies that control valuable datasets. The report cites Meta’s reported $14.3 billion deal for a 49% stake in Scale AI as one example of how data-related assets are being valued.

Governments may also play a larger role where data is tied to national security, public services or battlefield operations. The report points to Ukraine’s reported condition around wartime AI data: keep the model and keep the leverage. More such arrangements could turn data governance into a strategic policy issue, not only a business concern.

Amazon

sovereign dataset acquisition

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the main development in this report?

The report says AI competition is shifting from access to rented compute toward access to scarce datasets, including licensed media, enterprise records, expert material and sovereign data.

Is public web data already exhausted?

No. The report cites Epoch AI projections that high-quality public text could be fully used between 2026 and 2032, with a median estimate around 2028. That remains a projection.

Copyright cases and settlements can determine whether AI developers must pay for training material, destroy improperly obtained files or change sourcing practices. The report says this is helping create a paid licensing market.

Can synthetic data solve the shortage?

Synthetic data can help, and major AI companies are already using it. The report says it may not fully replace fresh human-made data, especially where errors are hard to detect.

What should companies take from this?

The report’s practical warning is that proprietary data may be a business asset. Companies may need clearer rules for how vendors can use internal data supplied to AI systems.

Source: Thorsten Meyer AI

This article is for informational purposes only and is not medical advice. Always consult a qualified healthcare professional about your specific situation.
You May Also Like

The United Kingdom: The Pragmatist’s Hedge

A Post-Labor Atlas analysis says the UK is taking a middle path on welfare, work and AI policy after Brexit.

The Neocloud Cartel: How the AI Industry Started Renting Compute From Itself

Recent AI infrastructure deals show labs renting GPU capacity from rivals and suppliers, raising pricing, competition and financing questions.

The obscure laws Trump is using to reshape Washington in his image

Donald Trump is leveraging lesser-known legal tools to alter Washington’s historic landscape, sparking legal challenges and public debate about the city’s symbolism.

7 Best Film Camera Prime Day Deals for Instant Prints in 2026

A 2026 Prime Day deal report ranks Instax, QuickSnap and HP Sprocket bundles, with film count and camera type driving value.