Effector Loop

The Data Wall

ai scaling training-data synthetic-data

One-sentence read. The old scaling recipe turned more high-quality human text into lower loss. That trade is still real, but the easy part is ending: future gains depend less on raw token count and more on data quality, synthetic-data mixtures, multimodal transfer, and memory.

For most of the modern language-model era, scaling had a simple operational meaning: collect more human-written text, train longer, and expect validation loss to improve along a fairly predictable curve.

That was never the whole story. Architecture, optimization, filtering, deduplication, and compute all mattered. But data had a useful property: there was always more of it. The internet, books, code, papers, forums, and archives looked large enough that “more text” could be treated like an engineering input rather than a strategic constraint.

That assumption is now weaker.

The issue is not that humanity has literally run out of words. The issue is that the most useful kind of text for pre-training — high-quality, diverse, available, deduplicated, legally usable, and not already overrepresented — is much smaller than “all text on the internet.” Once that subset becomes scarce, scaling does not stop, but it changes character. The problem becomes less about finding more tokens and more about finding tokens that still teach the model something new.

That is what I mean by the data wall.

It is not a single cliff. It is a regime change.

Synthetic data helps, but mostly under constraints

The data wall

Synthetic data extends the curve — then it plateaus

Modeled validation loss vs. training tokens. Lower loss is better, so the axis is flipped — a rising line means a stronger model. Natural human text follows a clean scaling curve until the high-quality text stock becomes binding; synthetic data can train past that point, but a mostly-natural blend is far more efficient than pure synthetic. Curves are illustrative, not measured.

Modeled from arXiv anchors · as of 2026-06-09

Interactive view requires JavaScript. Below: what each modeled curve does across the token axis (validation loss, lower is better). The data wall sits at ~300T tokens, where the total stock of public human text is exhausted — though the genuinely high-quality curated subset (~5–15T) runs out far earlier; all curves bend toward an irreducible floor near 1.50.

  • Natural human text: loss 2.60 → 1.95
  • Pure synthetic: loss 1.95 → 1.82
  • Blended (1/3 synthetic + 2/3 natural): loss 1.95 → 1.62
  • Selective synthetic (pruned): loss 1.95 → 1.58

Modeled from 4 arXiv sources — see the source appendix below.

Modeled validation loss versus training tokens. Lower loss is better, but the vertical axis is flipped so improvement moves upward. Natural human text follows the familiar scaling curve until the available text stock becomes binding. Synthetic data can extend training beyond that point, but pure synthetic data plateaus quickly in the modeled setup. Mixtures of natural and synthetic data are more efficient, and selective synthetic data performs best in this illustration. The curves are modeled from reported anchors; read the shape, not the exact decimals.

How to read this chart

The x-axis is the number of training tokens. The y-axis is validation loss: a measure of how well the model predicts held-out text. Lower loss means the model is less surprised by the next token.

The chart flips the loss axis so that “better” points upward. A line that climbs to the right is a model improving as it trains on more data.

There are two important data markers.

The first is the high-quality subset: filtered web text, books, papers, code, and other sources that are disproportionately useful for pre-training. This is the part that matters first, because frontier training runs do not benefit equally from every scraped token. A clean textbook paragraph, a useful code file, and a duplicated spam page are not interchangeable.

The second marker is the broader public-text stock. This is the larger estimate: the rough upper bound of public human-written text that could plausibly be used if quality, access, licensing, deduplication, and usefulness were not the limiting factors.

Those two markers should not be read as precise dates or exact capacities. They are better read as a warning about the slope of the problem. Scaling on natural text becomes harder before the absolute text stock is exhausted, because the marginal token gets worse.

Synthetic data is not “more internet”

Synthetic data is the obvious response to a shrinking natural-text frontier: use models to generate more training material.

That works in some settings. It also fails in some settings. The important distinction is not “synthetic versus natural,” but how the synthetic data is produced, mixed, filtered, and used.

Pure synthetic data can add volume without adding enough new information. In that case, the curve extends but flattens. The model sees more tokens, but many of them are reformulations of what the generator already knew. More data is not automatically more learning.

The more interesting result is that synthetic data can be useful as part of a mixture. A training set that is mostly natural text, with a carefully chosen synthetic minority, can reach the same loss with far fewer tokens than natural data alone. Selection matters too: generated examples that target the model’s weaknesses are more valuable than a larger pile of easy or redundant examples.

That is the first major lesson of the wall: once natural data becomes scarce, data quality becomes an active part of scaling. The pipeline matters as much as the pile.

What the first chart says

The old regime was dominated by quantity. The next one is dominated by data engineering.

That does not mean language-model progress stops. It means the easiest extrapolation stops. The next gains are more likely to come from better mixtures, better filtering, better curricula, better use of private or domain-specific data, better post-training, and better ways of extracting signal from non-text sources.

Which raises the next question: if text becomes constrained, where does the next large reservoir come from?

The next reservoir is video

Text is finite in a way video is not.

A large video platform produces a continuing stream of new data. Depending on how video is tokenized, one year of uploads can be made to look comparable to the entire public-text stock. The cumulative video library is larger still.

But this comparison is easy to misuse.

A token count is not a property of the world. It is a property of a tokenizer. The same video can become a small number of compressed semantic tokens, a much larger number of “watch-rate” understanding tokens, or an even larger number of generation tokens. Those are different representations of the same footage, not three different amounts of reality.

That distinction matters because training does not benefit from tokens in the abstract. It benefits from useful signal. Video contains enormous signal about objects, motion, causality, social behavior, environments, and physical constraints. It also contains enormous redundancy. Consecutive frames are similar. Backgrounds persist. Many seconds are visually uninformative. Compression is not a detail; it is the central problem.

Video is abundant. Useful video tokens are expensive.

The next reservoir — video

Video is abundant. The tokens are not the footage.

Total available tokens (across) vs. useful-signal density (up — denser is better). The same year of YouTube becomes ~30T, ~300T, or ~600T tokens depending only on how it is tokenized; at watch-rate it lands near the ~300T public-text stock from the chart above. Bubble size = tokens emitted per second of footage; the density axis is modeled — read the shape, not the decimals.

Curated anchors · density modeled · as of 2026-06-09

Interactive view requires JavaScript. Each data source below by total tokens and modeled useful-signal density (1.0 = as dense as curated text, lower = more redundant). The 300T text wall from the chart above is matched by one year of YouTube tokenized at watch rate.

  • High-quality curated text: 10T tokens · density 1.00
  • All public human text: 300T tokens · density 1.00
  • 1 yr of YouTube — as information (30/s): 30T tokens · density 1.00
  • 1 yr of YouTube — as watched (300/s): 300T tokens · density 0.10
  • 1 yr of YouTube — as generated (600/s): 602T tokens · density 0.05
  • All of YouTube so far (watch rate): 3Q tokens · density 0.10
  • NVIDIA Cosmos training corpus: 9Q tokens · density 0.24

Token counts cited or computed from cited rates (5 sources); density is modeled — see the source appendix below.

Total available tokens versus modeled useful-signal density. The same year of YouTube-scale video can produce very different token counts depending on the tokenizer. The more densely the video is compressed, the fewer tokens it produces; the more exhaustively it is tokenized, the more redundancy it carries. Bubble size represents tokens emitted per second of footage. The y-axis is modeled, so read the relative tradeoff rather than the exact values.

The tokenizer is the bottleneck

Video changes the data problem from scarcity to compression.

With text, the token is already close to the native object of interest. Words and code symbols are lossy representations of thought, but they are already semantic artifacts. Video is different. Raw pixels are too large, too redundant, and too low-level. A useful video model has to decide what to preserve: objects, motion, affordances, captions, temporal structure, actions, camera dynamics, or visual detail.

Different choices create different token counts and different learning problems.

A compact representation may preserve the information needed for understanding but lose details needed for generation. A rich representation may support generation but consume memory and compute too quickly. A representation optimized for robotics may not be optimal for language reasoning. A representation optimized for internet video may not transfer to scientific, embodied, or industrial domains.

So video is not “the new text.” It is a larger, messier, more expensive source of structure.

The second lesson is that the post-text era is not simply a move from a small reservoir to a large one. It is a move from a relatively clean modality to a modality where representation decides everything.

Does video teach the model physics?

The strongest reason to care about video is not that it has many tokens. It is that it might teach models things text only describes indirectly.

Text can say that unsupported objects fall, that solid objects do not pass through each other, and that a ball hidden behind a screen has not disappeared. Video shows those regularities millions of times. In principle, a model trained on video could learn a more grounded representation of the world.

There is evidence that this can happen, but the details matter.

Models trained to predict pixels, or models that reason about video mostly through language, do not necessarily learn intuitive physics. On simple physical-plausibility tests, they can remain close to chance. A model trained to predict in a learned latent representation does much better on the original IntPhys setup. That suggests the objective is doing important work: predicting the meaning of what comes next is different from predicting every pixel.

But the result is not yet a general solution. When the benchmark becomes more complex, performance falls sharply. The model has learned something real, but brittle. It can capture some simple physical regularities without matching human-like physical understanding in messier scenes.

Predicting pixels is not the same as predicting structure

Does video teach physics?

Predict the pixels, or predict the meaning?

Can a model tell a physically possible video from an impossible one? (Chance = 50%.) Predicting pixels or text leaves models near chance; predicting meaning in a latent space (V-JEPA) nears human level on simple physics, then falls back toward chance on harder scenes. Reported benchmark results, not modeled.

Reported benchmarks · as of 2026-06-09

Interactive view requires JavaScript. Accuracy on physical-plausibility tests (chance = 50%; humans ≈ 85–95%):

  • Predict the pixels / text: IntPhys ≈ chance · IntPhys 2 ≈ chance
  • Predict the meaning (V-JEPA, latent): IntPhys 98% · IntPhys 2 52%

Reported benchmark figures from 4 sources — see the source appendix below.

Accuracy on physical-plausibility tests. Chance is 50%. Pixel- or text-oriented approaches remain near chance in the cited setup. Latent prediction performs much better on the simpler IntPhys benchmark, then falls back near chance on the harder IntPhys 2 benchmark. These are reported benchmark results, not modeled curves, and they measure physical perception rather than general language reasoning.

The transfer question is still open

The third chart is the most important caveat in the post.

Video can support physical perception. It can help with motion understanding. It can support world models for robotics and embodied planning. Those are real results.

The open question is whether that kind of grounding transfers into the thing most language-model scaling cares about: better abstract reasoning, better planning in text, better tool use, better scientific reasoning, or better generalization outside visual tasks.

That transfer might happen. It would be surprising if richer perceptual training never mattered. But it should not be assumed from token counts alone, or even from better physical-plausibility benchmarks. A model can become better at video understanding without becoming proportionally better at mathematics, code, or long-horizon language reasoning.

So the conclusion is narrower than “video solves the data wall.”

Video gives scaling a new source of signal. Whether language models can use that signal efficiently is an architecture question, a memory question, and a transfer question.

What the data wall actually changes

The data wall does not mean AI scaling is over.

It means one of the simplest versions of scaling is over: take a larger scrape of human text, train a larger model, and expect the same predictable return.

The next phase is more constrained and more interesting.

Synthetic data matters, but only when generation is paired with selection, mixing, and feedback. Video matters, but only when tokenization preserves useful structure without overwhelming memory and compute. Physical grounding matters, but only if it transfers beyond perception into the capabilities people actually want from general models.

The useful question is no longer “how many tokens are left?”

It is:

Which tokens still change the model?

That question makes the wall less dramatic, but more important. It shifts attention from abundance to marginal value. In the old regime, the next token was usually worth adding. In the next regime, the next token has to earn its place.

That is the real constraint.

Not the end of data. The end of treating data as undifferentiated fuel.

Source and data appendix

The charts in this post combine reported source values with simple modeled transformations. The purpose is to make the tradeoffs visible, not to claim new benchmark measurements.

The first chart uses reported anchors from scaling-law and synthetic-data papers, then draws illustrative curves through those anchors. The natural-text curve, synthetic plateau, mixture effect, and selective-synthetic effect should be read as a schematic of the regime change rather than as measured validation-loss traces from a single experiment.

The second chart uses cited or derived token counts for text and video sources, then adds a modeled density axis. The density values are intentionally approximate. They represent the idea that video tokenization can trade off compactness against redundancy; they are not measurements of semantic value.

The third chart uses reported benchmark results where available. Unlike the first two charts, it is not a modeled loss curve. It summarizes physical-plausibility benchmark results and should be interpreted narrowly: these benchmarks measure aspects of physical perception, not general language reasoning.

Throughout the post, the main distinction is between token volume and useful signal. Token counts are necessary for scaling analysis, but they are not sufficient. A trillion redundant tokens and a trillion informative tokens are not the same training resource.

Scaling-law anchors (arXiv abstracts)

  • Source type: arXiv abstract pages, scraped politely (declared User-Agent, inter-request delay, raw HTML cached under posts/the-data-wall/pipeline/cache/).
  • Endpoint: https://arxiv.org/abs/<id>.
  • Fetched at: 2026-06-09T14:39:13.432187+00:00.
  • Notes: The synthetic-plateau, optimal-blend, and speedup anchors are stated in the abstracts and scraped directly. The data-wall stock is not stated numerically in the 2211.04325 abstract, so it falls back to the bracketed default ~3×10¹⁴ (a WARNING is logged when this happens). The high-quality curated subset (~10¹³, ~5–15T) is likewise a documented default — a secondary reference marker, never scraped, so it can’t accidentally pick up the 300T total. The Deliberate-Practice result is qualitative — it informs a modeling knob (a modest extra efficiency shift for the selective curve), not a measured number. Resolved anchors are written to posts/the-data-wall/pipeline/data/scaling_data_laws.yaml.
AnchorValue usedResolvedarXiv source
Data wall (total public human text)~3×10¹⁴ tokensdefault — not in abstract2211.04325Will we run out of data? (Villalobos et al.)
High-quality curated subset~10¹³ tokens (~5–15T)default — reference marker2211.04325 — high-quality slice (2022 estimate; RefinedWeb-scale ~5T)
Synthetic plateau~300B tokensscraped2503.19551Scaling Laws of Synthetic Data (“rectified scaling law”)
Optimal blend1/3 synthetic : 2/3 naturalscraped2510.01631Demystifying Synthetic Data in LLM Pre-training
Blend speedup5–10×scraped2510.01631 — “pure synthetic alone is not faster than natural text”
Selective pruningqualitative (modeling knob)modeling assumption2502.15588Deliberate Practice (ICML 2025): pruning informative samples beats naive volume

Modeled curves

  • Source type: synthesized (numpy), not measured.
  • Method: natural text is a power law L = E + A·N^(−α) fit so it passes through the modeled start and wall losses; each synthetic variant branches at the wall and decays toward its own floor with a tunable knee. Anchors are the tunable parameters; the shape knobs (floors, knees, loss scale) are illustrative modeling choices. The whole thing is one re-runnable stage: uv run --project pipeline python posts/the-data-wall/run.py (scrape → scaling_data_laws.yaml → curves → this post’s .data.json).
  • Note: “Curves are modeled from reported anchor values; illustrative, not measured.”

Video abundance & tokenization anchors (the Modality Map)

The second chart’s numbers are a curated, per-row-cited dataset (posts/the-data-wall/pipeline/data/video_data_anchors.yaml) rather than scraped — the headline figures live in vendor docs, newsrooms and Epoch posts the arXiv scraper can’t reach, so the YAML is the source of truth, in the same human-in-the-loop shape the slope post’s curated inputs use. Each row is tagged [STATED] (reported directly in the cited source) or [DERIVED] (computed from stated inputs — the arithmetic is in the row’s rationale). The transform that turns these into the bubble geometry is posts/the-data-wall/pipeline/modality.py; the same run.py regenerates both charts.

AnchorValue usedTagSource
YouTube uploaded per year~1 trillion secondsSTATED2211.04325 — Villalobos et al., Appendix D (500 hrs/min)
Information rate~30 tokens/secSTATED2211.04325 — Villalobos’s conservative information estimate
Watch rate (understanding)~300 tokens/secSTATED2403.05530 — Gemini 1.5 (258 tok/frame @1fps + audio) + API docs
Generation rate~602 tokens/secDERIVED2310.05737 — MagViT-2: 1,280 tokens / 2.125 s of 128² video
NVIDIA Cosmos corpus9,000T tokens / 20M hrsSTATEDNVIDIA Cosmos World Foundation Model Platform (CES 2025)
Cosmos rate~125 tokens/secDERIVED9,000T ÷ (20M hrs × 3600 s)
Largest training set today~15T tokensSTATEDEpoch AI, “Can AI scaling continue through 2030?”
Cumulative YouTube library~10 trillion secondsDERIVED2211.04325 — Appendix D, extrapolated from the annual rate
Effective human text~300T tokensSTATED2211.04325 — the wall above; single-sourced from the scaling anchors
  • Modeled axis. Only the y-axis is modeled: density = 30 ÷ actual-rate, i.e. the share of each emitted token that is useful information if Villalobos’s ~30 tokens/sec is taken as the information rate. That single assumption is the modeled part; every x-position is cited or arithmetic from cited rates. Watch-rate video lands on the 300T wall because 1 trillion seconds × ~300 tokens/sec ≈ 300T — a coincidence of the deployed tokenization, not a measured equivalence of content.
  • Re-run: uv run --project pipeline python posts/the-data-wall/run.py --offline (no network — the video anchors are committed, not fetched).

Grounding benchmarks (the physics chart)

Reported benchmark figures — measured, not modeled — curated in posts/the-data-wall/pipeline/data/grounding_benchmarks.yaml, validated by pipeline/grounding.py, shaped into the grouped bars by pipeline/physics.py. Each row is tagged [STATED] (a reported number) or [DERIVED] (a stand-in for a qualitative finding — e.g. “near chance,” plotted at 50%).

Benchmark / quantityValueTagSource
V-JEPA on IntPhys (latent prediction)98%STATED2502.11831 — Garrido et al., intuitive physics
V-JEPA on InfLevel / GRASP62% / 66%STATEDsame — harder variants already pull it down
Generative video + multimodal LLMs on IntPhys≈ chance (50%)DERIVEDsame — “remain close to chance levels or untrained networks”
V-JEPA 2 on IntPhys 2 (harder scenes)~52%STATED2506.09849 — IntPhys 2 benchmark
Humans — IntPhys 2 / across benchmarks96.4% / 85–95%STATED2506.09849; Meta blog
V-JEPA 2 zero-shot robot pick-and-place65–80%STATED2506.09985 — V-JEPA 2
V-JEPA 2 — Something-Something v2 / PerceptionTest77.3 / 84.0STATEDsame
  • Chance = 50% (a 2-way possible-vs-impossible classification). The “predict the pixels/text” bars are plotted at the chance line as a faithful stand-in for the papers’ qualitative “near chance” result — not a precise per-model measurement.
  • Scope, not precision, is the caveat. These measure physical-plausibility perception, not language reasoning. The clean wins are physical/embodied; whether video grounding improves text reasoning is unproven — the section ends open on purpose.