The Data Wall — Effector Loop

One-sentence read. The old scaling recipe turned more high-quality human text into lower loss. That trade is still real, but the easy part is ending: future gains depend less on raw token count and more on data quality, synthetic-data mixtures, and multimodal transfer.

For most of the modern language-model era, scaling had a simple operational meaning: collect more human-written text, train longer, and expect validation loss to improve along a fairly predictable curve.

That was never the whole story. Architecture, optimization, filtering, deduplication, and compute all mattered. But data had a useful property: there was always more of it. The internet, books, code, papers, forums, and archives looked large enough that “more text” could be treated like an engineering input rather than a strategic constraint.

That assumption is now weaker.

The issue is not that humanity has literally run out of words. The issue is that the most useful kind of text for pre-training — high-quality, diverse, available, deduplicated, legally usable, and not already overrepresented — is much smaller than “all text on the internet.” Once that subset becomes scarce, scaling does not stop, but it changes character. The problem becomes less about finding more tokens and more about finding tokens that still teach the model something new.

That is what I mean by the data wall.

It is not a single cliff. It is a regime change.

What running out of text looks like

The data wall

Synthetic data extends the curve — then it plateaus

Modeled validation loss vs. training tokens. Lower loss is better, so the axis is flipped — a rising line means a stronger model. Natural human text follows a clean scaling curve until the stock of public text runs out at the wall; synthetic data can train past that point, but a mostly-natural blend is far more efficient than pure synthetic. Curves are illustrative, not measured.

Modeled from arXiv anchors · as of 2026-07-03

Interactive view requires JavaScript. Below: what each modeled curve does across the token axis (validation loss, lower is better). The data wall sits at ~300T tokens, where the total stock of public human text is exhausted — though the genuinely high-quality curated subset (~5–15T) runs out far earlier; all curves bend toward an irreducible floor near 1.50.

Natural human text: loss 2.60 → 1.95
Pure synthetic: loss 1.95 → 1.82
Blended (1/3 synthetic + 2/3 natural): loss 1.95 → 1.62
Selective synthetic (pruned): loss 1.95 → 1.58

Modeled from 4 arXiv sources — see the source appendix below.

The reading is the ordering of the curves after the wall: pure synthetic data flattens almost immediately, a mostly-natural blend keeps improving well past it, and selective synthetic data adds a further margin. The anchor values are reported in the cited papers; the curves drawn through them are illustrative — trust the ordering, not the decimals.

How to read this chart

The x-axis is the number of training tokens. The y-axis is validation loss: a measure of how well the model predicts held-out text. Lower loss means the model is less surprised by the next token.

The chart flips the loss axis so that “better” points upward. A line that climbs to the right is a model improving as it trains on more data.

There are two important data markers.

The first is the high-quality subset: filtered web text, books, papers, code, and other sources that are disproportionately useful for pre-training. This is the part that matters first, because frontier training runs do not benefit equally from every scraped token. A clean textbook paragraph, a useful code file, and a duplicated spam page are not interchangeable.

The second marker is the broader public-text stock. This is the larger estimate: the rough upper bound of public human-written text that could plausibly be used if quality, access, licensing, deduplication, and usefulness were not the limiting factors.

Those two markers should not be read as exact capacities. They are better read as a warning about the slope of the problem. Scaling on natural text becomes harder before the absolute text stock is exhausted, because the marginal token gets worse.

Synthetic data is not “more internet”

Synthetic data is the obvious response to a shrinking natural-text frontier: use models to generate more training material.

That works in some settings. It also fails in some settings. The important distinction is not “synthetic versus natural,” but how the synthetic data is produced, mixed, filtered, and used.

Pure synthetic data can add volume without adding enough new information. In that case, the curve extends but flattens. The model sees more tokens, but many of them are reformulations of what the generator already knew. More data is not automatically more learning.

The more interesting result is that synthetic data can be useful as part of a mixture. A training set that is mostly natural text, with a carefully chosen synthetic minority, can reach the same loss with far fewer tokens than natural data alone. Selection matters too: generated examples that target the model’s weaknesses are more valuable than a larger pile of easy or redundant examples.

That is the first major lesson of the wall: once natural data becomes scarce, data quality becomes an active part of scaling. The pipeline matters as much as the pile.

None of this means language-model progress stops. It means the easiest extrapolation stops. The next gains are more likely to come from better mixtures, better filtering, better curricula, better use of private or domain-specific data, better post-training, and better ways of extracting signal from non-text sources.

Which raises the next question: if text becomes constrained, where does the next large reservoir come from?

The next reservoir is video

The stock of public text is mostly already written. Video is still being produced, in bulk.

A large video platform receives on the order of a trillion seconds of new footage per year. Depending on how that video is tokenized, one year of uploads can be made to look comparable to the entire public-text stock. The cumulative library is larger still.

But this comparison is easy to misuse.

A token count is not a property of the world. It is a property of a tokenizer. The same video can become a small number of compressed semantic tokens, a much larger number of “watch-rate” understanding tokens, or an even larger number of generation tokens. Those are different representations of the same footage, not three different amounts of reality.

That distinction matters because training does not benefit from tokens in the abstract. It benefits from useful signal. Video contains enormous signal about objects, motion, causality, social behavior, environments, and physical constraints. It also contains enormous redundancy. Consecutive frames are similar. Backgrounds persist. Many seconds are visually uninformative. Compression is not a detail; it is the central problem.

How many tokens is a year of video?

The next reservoir — video

Video is abundant. The tokens are not the footage.

Total available tokens (across) vs. useful-signal density (up — denser is better). The same year of YouTube becomes ~30T, ~300T, or ~600T tokens depending only on how it is tokenized; at watch-rate it lands near the ~300T public-text stock from the chart above. Bubble size tracks tokens emitted per second of footage (log scale); the density axis is modeled — read the shape, not the decimals.

Curated anchors · density modeled · as of 2026-07-02

Interactive view requires JavaScript. Each data source below by total tokens and modeled useful-signal density (1.0 = as dense as curated text, lower = more redundant). The 300T text wall from the chart above is matched by one year of YouTube tokenized at watch rate.

High-quality curated text: 10T tokens · density 1.00
All public human text: 300T tokens · density 1.00
1 yr of YouTube — as information (30/s): 30T tokens · density 1.00
1 yr of YouTube — as watched (300/s): 300T tokens · density 0.10
1 yr of YouTube — as generated (600/s): 602T tokens · density 0.05
All of YouTube so far (watch rate): 3Q tokens · density 0.10
NVIDIA Cosmos training corpus: 9Q tokens · density 0.0002

Token counts cited or computed from cited rates (5 sources); density is modeled — see the source appendix below.

The dashed line is the same one year of footage passed through three tokenizers — a ~20x spread in token count with no new video added. NVIDIA's Cosmos corpus shows where a real generation-grade pipeline lands: ~9,000T tokens from 20 million hours, roughly 125,000 tokens per second of footage, so almost none of any given token is new information by the ~30 tokens-per-second yardstick. Density is modeled (information rate divided by tokenization rate); the token counts are cited or computed from cited rates.

The tokenizer is the bottleneck

Video changes the data problem from scarcity to compression.

With text, the token is already close to the native object of interest. Words and code symbols are lossy representations of thought, but they are already semantic artifacts. Video is different. Raw pixels are too large, too redundant, and too low-level. A useful video model has to decide what to preserve: objects, motion, affordances, captions, temporal structure, actions, camera dynamics, or visual detail.

Different choices create different token counts and different learning problems.

A compact representation may preserve the information needed for understanding but lose details needed for generation. A rich representation may support generation but consume memory and compute too quickly. A representation optimized for robotics may not be optimal for language reasoning. A representation optimized for internet video may not transfer to scientific, embodied, or industrial domains.

So video is not “the new text.” It is a larger, messier, more expensive source of structure.

The second lesson is that the post-text era is not simply a move from a small reservoir to a large one. It is a move from a relatively clean modality to a modality where representation decides everything.

Does video teach the model physics?

The strongest reason to care about video is not that it has many tokens. It is that it might teach models things text only describes indirectly.

Text can say that unsupported objects fall, that solid objects do not pass through each other, and that a ball hidden behind a screen has not disappeared. Video shows those regularities millions of times. In principle, a model trained on video could learn a more grounded representation of the world.

There is evidence that this can happen, but the details matter.

Models trained to predict pixels, or models that reason about video mostly through language, do not necessarily learn intuitive physics. On simple physical-plausibility tests, they can remain close to chance. A model trained to predict what comes next in a learned latent representation — Meta’s V-JEPA line of work — does much better on the original IntPhys setup. That suggests the objective is doing important work: predicting the meaning of what comes next is different from predicting every pixel.

But the result is not yet a general solution. When the benchmark becomes more complex — IntPhys 2 uses longer, more cluttered, more realistic scenes — performance falls back toward chance. The model has learned something real, but brittle. It can capture some simple physical regularities without matching human-like physical understanding in messier scenes.

Predicting pixels is not the same as predicting structure

Does video teach physics?

Predict the pixels, or predict the meaning?

Can a model tell a physically possible video from an impossible one? (Chance = 50%.) Predicting pixels or text leaves models near chance; predicting meaning in a latent space (V-JEPA) nears human level on simple physics, then falls back toward chance on harder scenes. Reported benchmark results, not modeled.

Reported benchmarks · as of 2026-06-09

Interactive view requires JavaScript. Accuracy on physical-plausibility tests (chance = 50%; humans ≈ 85–95%):

Predict the pixels / text: IntPhys ≈ chance · IntPhys 2 ≈ chance
Predict the meaning (V-JEPA, latent): IntPhys 98% · IntPhys 2 52%

Reported benchmark figures from 4 sources — see the source appendix below.

The gold band marks human performance: roughly 85–95% across these tests, and about 96% on IntPhys 2 specifically — the harder scenes barely slow people down. The ≈-chance bars stand in for the papers' qualitative finding that pixel-prediction and multimodal-language models score near 50%. These are reported results, not modeled curves, and they measure physical perception, not language reasoning.

The transfer question is still open

The third chart is the most important caveat in the post.

Video can support physical perception. It can help with motion understanding. It can support world models for robotics and embodied planning. Those are real results.

The open question is whether that kind of grounding transfers into the thing most language-model scaling cares about: better abstract reasoning, better planning in text, better tool use, better scientific reasoning, or better generalization outside visual tasks.

That transfer might happen. It would be surprising if richer perceptual training never mattered. But it should not be assumed from token counts alone, or even from better physical-plausibility benchmarks. A model can become better at video understanding without becoming proportionally better at mathematics, code, or long-horizon language reasoning.

So the conclusion is narrower than “video solves the data wall.”

Video gives scaling a new source of signal. Whether language models can use that signal efficiently is an architecture question, a memory question, and a transfer question.

What the data wall actually changes

The data wall does not mean AI scaling is over.

It means one of the simplest versions of scaling is over: take a larger scrape of human text, train a larger model, and expect the same predictable return.

The next phase is more constrained and more interesting.

Synthetic data matters, but only when generation is paired with selection, mixing, and feedback. Video matters, but only when tokenization preserves useful structure without overwhelming memory and compute. Physical grounding matters, but only if it transfers beyond perception into the capabilities people actually want from general models.

The useful question is no longer “how many tokens are left?”

It is:

Which tokens still change the model?

That question makes the wall less dramatic, but more important. It shifts attention from abundance to marginal value. In the old regime, the next token was usually worth adding. In the next regime, the next token has to earn its place.

That is the real constraint.

Not the end of data. The end of treating data as undifferentiated fuel.

Source and data appendix

The charts in this post combine reported source values with simple modeled transformations. The purpose is to make the tradeoffs visible, not to claim new benchmark measurements.

The first chart uses reported anchors from scaling-law and synthetic-data papers, then draws illustrative curves through those anchors. The natural-text curve, synthetic plateau, mixture effect, and selective-synthetic effect should be read as a schematic of the regime change rather than as measured validation-loss traces from a single experiment.

The second chart uses cited or derived token counts for text and video sources, then adds a modeled density axis. The density values are intentionally approximate. They represent the idea that video tokenization can trade off compactness against redundancy; they are not measurements of semantic value.

The third chart uses reported benchmark results where available. Unlike the first two charts, it is not a modeled loss curve. It summarizes physical-plausibility benchmark results and should be interpreted narrowly: these benchmarks measure aspects of physical perception, not general language reasoning.

Throughout the post, the main distinction is between token volume and useful signal. Token counts are necessary for scaling analysis, but they are not sufficient. A trillion redundant tokens and a trillion informative tokens are not the same training resource.

Scaling-law anchors (arXiv abstracts)

Source type: arXiv abstract pages, scraped politely (declared User-Agent, inter-request delay, raw HTML cached under posts/the-data-wall/pipeline/cache/).
Endpoint: https://arxiv.org/abs/<id>.
Fetched at: 2026-07-03.
Notes: The synthetic-plateau, optimal-blend, and speedup anchors are stated in the abstracts and scraped directly. The data-wall stock is not stated numerically in the 2211.04325 abstract, so it falls back to the bracketed default ~3×10¹⁴ (a WARNING is logged when this happens). The high-quality curated subset (~10¹³, ~5–15T) is likewise a documented default — a secondary reference marker, never scraped, so it can’t accidentally pick up the 300T total. The Deliberate-Practice result is qualitative — it informs a modeling knob (a modest extra efficiency shift for the selective curve), not a measured number. Resolved anchors are written to posts/the-data-wall/pipeline/data/scaling_data_laws.yaml.

Anchor	Value used	Resolved	arXiv source
Data wall (total public human text)	~3×10¹⁴ tokens	default — not in abstract	2211.04325 — Will we run out of data? (Villalobos et al.)
High-quality curated subset	~10¹³ tokens (~5–15T)	default — reference marker	2211.04325 — high-quality slice (2022 estimate; RefinedWeb-scale ~5T)
Synthetic plateau	~300B tokens	scraped	2503.19551 — Scaling Laws of Synthetic Data (“rectified scaling law”)
Optimal blend	1/3 synthetic : 2/3 natural	scraped	2510.01631 — Demystifying Synthetic Data in LLM Pre-training
Blend speedup	5–10×	scraped	2510.01631 — “pure synthetic alone is not faster than natural text”
Selective pruning	qualitative (modeling knob)	modeling assumption	2502.15588 — Deliberate Practice (ICML 2025): pruning informative samples beats naive volume

Modeled curves

Source type: synthesized (numpy), not measured.
Method: natural text is a power law L = E + A·N^(−α) fit so it passes through the modeled start and wall losses; each synthetic variant branches at the wall and decays toward its own floor with a tunable knee. Anchors are the tunable parameters; the shape knobs (floors, knees, loss scale) are illustrative modeling choices. The whole thing is one re-runnable stage: uv run --project pipeline python posts/the-data-wall/run.py (scrape → scaling_data_laws.yaml → curves → this post’s .data.json).
Note: “Curves are modeled from reported anchor values; illustrative, not measured.”

Video abundance & tokenization anchors (the Modality Map)

The second chart’s numbers are a curated, per-row-cited dataset (posts/the-data-wall/pipeline/data/video_data_anchors.yaml) rather than scraped — the headline figures live in vendor docs, newsrooms and Epoch posts the arXiv scraper can’t reach, so the YAML is the source of truth, in the same human-in-the-loop shape the slope post’s curated inputs use. Each row is tagged [STATED] (reported directly in the cited source) or [DERIVED] (computed from stated inputs — the arithmetic is in the row’s rationale). The transform that turns these into the bubble geometry is posts/the-data-wall/pipeline/modality.py; the same run.py regenerates both charts.

Anchor	Value used	Tag	Source
YouTube uploaded per year	~1 trillion seconds	STATED	2211.04325 — Villalobos et al., Appendix D (500 hrs/min)
Information rate	~30 tokens/sec	STATED	2211.04325 — Villalobos’s conservative information estimate
Watch rate (understanding)	~300 tokens/sec	STATED	2403.05530 — Gemini 1.5 (258 tok/frame @1fps + audio) + API docs
Generation rate	~602 tokens/sec	DERIVED	2310.05737 — MagViT-2: 1,280 tokens / 2.125 s of 128² video
NVIDIA Cosmos corpus	9,000T tokens / 20M hrs	STATED	NVIDIA Cosmos World Foundation Model Platform (CES 2025)
Cosmos rate	~125,000 tokens/sec	DERIVED	9,000T ÷ (20M hrs × 3600 s)
Largest training set today	~15T tokens	STATED	Epoch AI, “Can AI scaling continue through 2030?”
Cumulative YouTube library	~10 trillion seconds	DERIVED	2211.04325 — Appendix D, extrapolated from the annual rate
Effective human text	~300T tokens	STATED	2211.04325 — the wall above; single-sourced from the scaling anchors

Modeled axis. Only the y-axis is modeled: density = 30 ÷ actual-rate, i.e. the share of each emitted token that is useful information if Villalobos’s ~30 tokens/sec is taken as the information rate. That single assumption is the modeled part; every x-position is cited or arithmetic from cited rates. Watch-rate video lands on the 300T wall because 1 trillion seconds × ~300 tokens/sec ≈ 300T — a coincidence of the deployed tokenization, not a measured equivalence of content.
Re-run: uv run --project pipeline python posts/the-data-wall/run.py --offline (no network — the video anchors are committed, not fetched).

Grounding benchmarks (the physics chart)

Reported benchmark figures — measured, not modeled — curated in posts/the-data-wall/pipeline/data/grounding_benchmarks.yaml, validated by pipeline/grounding.py, shaped into the grouped bars by pipeline/physics.py. Each row is tagged [STATED] (a reported number) or [DERIVED] (a stand-in for a qualitative finding — e.g. “near chance,” plotted at 50%).

Benchmark / quantity	Value	Tag	Source
V-JEPA on IntPhys (latent prediction)	98%	STATED	2502.11831 — Garrido et al., intuitive physics
V-JEPA on InfLevel / GRASP	62% / 66%	STATED	same — harder variants already pull it down
Generative video + multimodal LLMs on IntPhys	≈ chance (50%)	DERIVED	same — “remain close to chance levels or untrained networks”
V-JEPA 2 on IntPhys 2 (harder scenes)	~52%	STATED	2506.09849 — IntPhys 2 benchmark
Humans — IntPhys 2 / across benchmarks	96.4% / 85–95%	STATED	2506.09849; Meta blog
V-JEPA 2 zero-shot robot pick-and-place	65–80%	STATED	2506.09985 — V-JEPA 2
V-JEPA 2 — Something-Something v2 / PerceptionTest	77.3 / 84.0	STATED	same

Chance = 50% (a 2-way possible-vs-impossible classification). The “predict the pixels/text” bars are plotted at the chance line as a faithful stand-in for the papers’ qualitative “near chance” result — not a precise per-model measurement.
Scope, not precision, is the caveat. These measure physical-plausibility perception, not language reasoning. The clean wins are physical/embodied; whether video grounding improves text reasoning is unproven — the section ends open on purpose.