S&P 5005,842.10 0.42%
NASDAQ19,210.55 0.88%
NVDA1,184.22 2.41%
MSFT478.90 0.88%
GOOGL210.11 1.12%
META612.50 0.34%
AAPL239.80 0.21%
AMZN248.66 1.40%
AVGO1,902.40 3.12%
TSLA298.10 1.05%
BTC98,420 1.88%
ETH4,210 2.24%
10Y4.18% 0.02%
DXY104.12 0.18%
S&P 5005,842.10 0.42%
NASDAQ19,210.55 0.88%
NVDA1,184.22 2.41%
MSFT478.90 0.88%
GOOGL210.11 1.12%
META612.50 0.34%
AAPL239.80 0.21%
AMZN248.66 1.40%
AVGO1,902.40 3.12%
TSLA298.10 1.05%
BTC98,420 1.88%
ETH4,210 2.24%
10Y4.18% 0.02%
DXY104.12 0.18%
Back to homepage
Synthetic Data

Banks Are Buying Fiction: How Synthetic Data Is Rewiring Finance AI

Synthetic and curated datasets are emerging as the missing link between privacy, model performance, and regulatory pressure — and investors should pay attention.

P
Pedro Marini
June 14, 2026 · 4 min read
Banks Are Buying Fiction: How Synthetic Data Is Rewiring Finance AI

Illustration by IMF Alpha editorial · Reviewed by Pedro Marini

Listen to this article
AI narration · ~4 min
Tickers mentioned
NVDA+0.00%SNOW+0.00%PLTR+0.00%MSFT+0.00%GOOGL+0.00%AMZN+0.00%

Here’s a simple, unsettling premise: financial institutions are training mission‑critical AI on data that never happened. It sounds like science fiction, but it’s a pragmatic response to three forces colliding on Wall Street and in Silicon Valley — tighter privacy rules, a shortage of labeled rare events, and the growing cost of chewing through ever‑larger raw datasets.

Think of synthetic data as a dress rehearsal for the market. Not the live performance, but a controlled room where you can trigger rare failures without blowing up a client portfolio.

Why this matters now

  • Privacy and compliance stopped being checkboxes and joined the boardroom conversation. Banks need datasets that keep analytic value while stripping out personally identifiable details.
  • Tail events are, by definition, uncommon. Synthetic generation lets models encounter more of the extreme scenarios they’ll have to endure.
  • Cloud pricing, plus new platforms for on‑demand generation and curated marketplaces, means this is now practical for mid‑tier banks and fintechs — not only the hyperscalers.

Who’s building the plumbing

  • Data marketplaces and platforms are positioning themselves as the distribution layer for curated and synthetic sets. Expect friction between data‑native startups and the cloud incumbents.
  • Model‑infrastructure vendors and GPU suppliers are selling the flip side: more compute to train on generated data, along with validation tooling.
  • You’ve seen early examples: Snowflake‑style marketplaces hosting curated datasets; lakehouse vendors packaging labeled financial feeds; niche startups offering synthetic transaction streams tuned for fraud detection. Components exist. The winner will be whoever stitches generation, validation, and deployment into an automated workflow.

The tradeoffs — because there are always tradeoffs

  • Upside: faster iteration, fewer privacy headaches, better stress‑testing of edge cases.
  • Downside: synthetic data can bake in designer biases. Train on worlds that are too neat and models will stumble hard when the real, messy world returns.
  • Regulatory risk: supervisors care about outcomes, not methodology. A discriminatory or unstable model invites scrutiny even if the training data were entirely synthetic.

A useful analogy: synthetic data is like a flight simulator. Pilots can practice for hurricanes they hope to never face. But a simulator only helps if it mirrors the physics; otherwise you engrain the wrong instincts.

What this means for investors and executives

  • Keep an eye on partnerships between data marketplaces and model vendors. They’re an early indicator of product‑market fit.
  • Demand verifiable validation tooling: methods that compare synthetic distributions with holdout real data and independent audits of model behavior.
  • Be skeptical of vendors promising perfect privacy and perfect accuracy. The practical winners will accept imperfect guarantees in exchange for rigorous measurement and governance.

A short playbook

  • For risk officers: require adversarial testing of any model trained on synthetic sets, and insist on a small, audited real‑data holdout to check performance.
  • For CTOs: build pipelines that can label, generate, version, and reproduce datasets. Prioritize reproducibility over one‑off gains.
  • For investors: favor firms selling verifiable tooling and strong data governance, not those pitching synthetic generation as a checkbox feature.

Why the history matters

The last decade in finance was about amassing data: tick feeds, alternative datasets, web scraping. The next phase is about stewardship. Volume still matters, but lineage, quality, and defensible privacy practices will be the differentiators. This isn’t another arms race to hoard logs; it’s a move toward curated, explainable inputs.

Where we land

Synthetic data won’t replace real‑world signals. Nor should it. But as a bridge between privacy, cost, and robustness, it’s reshaping how financial AI gets built. Expect a messy transition — new vendors, new audit regimes, and a fair bit of pushback when synthetic‑trained models misbehave. For now, the sensible play is not to accept the fiction blindly, but to fold it into governance, measurement, and skeptical testing.

Pedro Marini

Advertisement
Continue reading

Related coverage

The IMF Brief · Daily Newsletter

The AI economy, decoded before the open.

Five minutes. One email. The signal cutting through the noise at the intersection of artificial intelligence and Wall Street. Free, forever.

Join 184,000+ readers · No spam · Unsubscribe anytime