S&P 5005,842.10 0.42%
NASDAQ19,210.55 0.88%
NVDA1,184.22 2.41%
MSFT478.90 0.88%
GOOGL210.11 1.12%
META612.50 0.34%
AAPL239.80 0.21%
AMZN248.66 1.40%
AVGO1,902.40 3.12%
TSLA298.10 1.05%
BTC98,420 1.88%
ETH4,210 2.24%
10Y4.18% 0.02%
DXY104.12 0.18%
S&P 5005,842.10 0.42%
NASDAQ19,210.55 0.88%
NVDA1,184.22 2.41%
MSFT478.90 0.88%
GOOGL210.11 1.12%
META612.50 0.34%
AAPL239.80 0.21%
AMZN248.66 1.40%
AVGO1,902.40 3.12%
TSLA298.10 1.05%
BTC98,420 1.88%
ETH4,210 2.24%
10Y4.18% 0.02%
DXY104.12 0.18%
Back to homepage
Data For AI

The Real AI Gold: Why Data Infrastructure Will Outperform Models

As model architectures stabilize, the next competitive moat is the messy work of data pipelines, labeling and marketplaces — and investors are starting to notice.

P
Pedro Marini
June 21, 2026 · 4 min read
The Real AI Gold: Why Data Infrastructure Will Outperform Models

Illustration by IMF Alpha editorial · Reviewed by Pedro Marini

Listen to this article
AI narration · ~4 min
Tickers mentioned
SNOW+2.40%PLTR-1.30%MSFT+0.80%GOOGL+1.10%AMZN+0.50%MDB-0.90%AI+3.20%

Why the fuss about data now?

Model headlines — bigger transformers, fancier multimodal tricks — grab attention. But the quiet, costly work that rarely makes the front page is data sourcing, cleaning and governance. For years the obvious levers were compute and model size; now the pendulum is swinging toward data quality and access. Think less oil and more soil: fertile, tended, and compounding over time. Sounds boring. It also matters more than most people realize.

A short history to keep in mind

  • Early machine learning ran on clever models and scarce labeled sets. Teams squeezed performance out of small, curated datasets.
  • The last five years flipped the script: scale of data plus scale of compute produced foundation models that generalize in surprising ways.
  • With foundational architectures widely available, the next frontier is the proprietary data and the plumbing that converts raw logs, images and documents into training-grade datasets. Yes, the messy ETL and annotation work.

Where value is being created today

  • Data marketplaces and clean rooms are turning private and third-party datasets into monetizable, privacy-aware assets. Snowflake Marketplace is a clear example of data being packaged and transacted.
  • Labeling and annotation platforms are the bridge between raw data and usable training examples. Cloud vendors bundling labeling into ML suites is making the path to production smoother — and stickier.
  • Synthetic data and programmatic labeling lower dependence on costly human annotation and reduce exposure of sensitive records. They help, but they are not a perfect substitute.
  • Data ops — versioning, lineage, cataloging — are becoming table stakes for any enterprise that wants reproducible ML at scale.

What's interesting here is how these pieces interact: better marketplaces drive more labeling demand, which in turn increases the need for governance and lineage.

Why investors should care

  • Margins. Data infrastructure often sells like software with recurring revenue. Once datasets and pipelines are embedded, switching costs go up.
  • Durability. Proprietary data partnerships and curated marketplaces create moats that are harder to replicate than a model checkpoint on the internet.
  • Diversification. Backing data suppliers and tooling spreads exposure beyond a single-model bet and ties value to real-world signals across industries.

Not a slam-dunk, but a different risk profile than betting only on model architects.

Counterpoints and risks

  • Commoditization is real. If third-party data becomes indexed and standardized, pricing pressure follows.
  • Regulation on provenance and consent is tightening. Compliance will add cost and may restrict useful datasets.
  • Overreliance on synthetic data can mislead models if the generated distributions drift from reality.

Practical signals to watch

  • Partnerships between cloud giants and niche data providers; those deals often set distribution patterns and pricing power.
  • Growth in data marketplace transactions and in-platform billing on exchanges — billing is a surprisingly good leading indicator.
  • Adoption metrics for governance tools: active catalogs, production lineage coverage, and the rate of labeled dataset growth in production.

A quick investor map

  • Big cloud and data-platform incumbents are folding data services into their stacks. That gives them optionality: either capture more of the data value chain or partner with specialists.
  • Specialist vendors — labeling, synthetic data, data ops — can expand multiples if they prove enterprise retention and pricing power. Watch churn and net revenue retention closely.

The shiny race to build bigger models will go on. But the organizations that control the messy, sticky plumbing of data are likely to be the strategic winners over the next five years. For investors tired of putting all chips on model makers, the quietly growing world of data infrastructure offers a more defensible, revenue-bearing route into the AI era.

Author note: I track product integrations and billing signals more than press releases. The shift toward data-first strategies feels like the end of one chapter and the start of a more boring — but far more profitable — one.

Advertisement
Continue reading

Related coverage

The IMF Brief · Daily Newsletter

The AI economy, decoded before the open.

Five minutes. One email. The signal cutting through the noise at the intersection of artificial intelligence and Wall Street. Free, forever.

Join 184,000+ readers · No spam · Unsubscribe anytime