S&P 5005,842.10 0.42%
NASDAQ19,210.55 0.88%
NVDA1,184.22 2.41%
MSFT478.90 0.88%
GOOGL210.11 1.12%
META612.50 0.34%
AAPL239.80 0.21%
AMZN248.66 1.40%
AVGO1,902.40 3.12%
TSLA298.10 1.05%
BTC98,420 1.88%
ETH4,210 2.24%
10Y4.18% 0.02%
DXY104.12 0.18%
S&P 5005,842.10 0.42%
NASDAQ19,210.55 0.88%
NVDA1,184.22 2.41%
MSFT478.90 0.88%
GOOGL210.11 1.12%
META612.50 0.34%
AAPL239.80 0.21%
AMZN248.66 1.40%
AVGO1,902.40 3.12%
TSLA298.10 1.05%
BTC98,420 1.88%
ETH4,210 2.24%
10Y4.18% 0.02%
DXY104.12 0.18%
Back to homepage
Synthetic Data

Why Synthetic Data Is the New Commodity Banks Are Buying

Financial firms are swapping raw customer records for algorithmically generated datasets. It lowers legal risk, speeds model building—and forces new trade-offs.

P
Pedro Marini
June 15, 2026 · 4 min read
Why Synthetic Data Is the New Commodity Banks Are Buying

Illustration by IMF Alpha editorial · Reviewed by Pedro Marini

Listen to this article
AI narration · ~4 min
Tickers mentioned
NVDA+1.80%DBX+0.60%MSFT-0.40%AMZN+0.30%

The shift is underway. Over the past 18 months a quiet but consequential migration has begun inside banks, payments firms, and fintechs: engineering teams that once begged for more real customer rows are now asking for better synthetic ones.

This is not a novelty stunt. Synthetic datasets—produced by GANs, variational autoencoders, and probabilistic samplers—let institutions train models without moving sensitive personal records. For compliance officers juggling CPRA and GLBA, and for ML teams constantly hitting data access bottlenecks, that combination is hard to ignore.

Why now

  • Speed: Generating labeled samples collapses weeks of wrangling into days. Iteration cycles get tighter.
  • Privacy: Well-constructed synthetic data reduces the attack surface for leaks and makes consent and sharing simpler.
  • Cost: Annotation and procurement budgets stretch further when synthetic sets can stand in for expensive labeling campaigns.

Still, the upside comes with some important caveats.

What synthetic data buys you—and where it falls short

Use cases where synthetic data really helps

  • Fraud and anomaly detection: you can synthesize rare-but-important attack scenarios that barely appear in production logs.
  • Model validation and stress testing: generate extreme yet plausible customer behaviors to probe resilience.
  • Feature engineering and prototyping: get early signals without waiting for legal sign-offs or full production datasets.

Limits and risks

  • Tail fidelity: generators often smooth over rare interactions, and those long-tail quirks are exactly what trigger real-world fraud.
  • Model leakage: a sloppy generator can end up regurgitating private rows, trading one privacy problem for another.
  • Regulatory skepticism: auditors and examiners still prefer line-level provenance or strict lineage for evidence.

A short history helps explain the rush

Ten years ago, ML projects in finance were routinely stalled by legal reviews and slow anonymization. That split teams into two habits: conservative risk managers who tightly controlled data, and product groups that quietly built shadow datasets to move faster. Synthetic data promises a reconciliation—analytical freedom without handing out raw PII. That promise, unsurprisingly, has attracted investor interest and vendor activity. Cloud providers and GPU makers — the firms that power large-scale generation — stand to benefit as enterprises adopt these tools.

Where the market may split

Over the next 12–24 months I expect two distinct paths to form:

  • Enterprise-grade platforms that prioritize governance, audit trails, and provable privacy guarantees. Those will appeal to banks and regulated fintechs.
  • Lightweight, open-source toolkits that startups and research labs prefer for flexibility and speed, trading governance for agility.

It mirrors a familiar division in enterprise software: the safer, compliance-first route, and the scrappier innovation track.

What this means for investors and execs

  • Watch for real adoption signals: vendor deals with major banks, SOC 2 / ISO certifications, and independent privacy audits will be more telling than flashy demos.
  • Monitor compute demand. Large-scale synthetic generation is GPU-heavy; rising adoption supports hardware and cloud providers.
  • Be skeptical of marketing claims. Ask for concrete metrics: how closely do synthetic distributions match the originals? Do stress tests include rare-event fidelity?

The practical upshot

Synthetic data is not a magic bullet, but it is a useful lever. For US financial firms constrained by privacy rules and competitive pressure, it offers a way to move faster without shrugging off compliance. Expect cautious, measured adoption driven by governance features rather than a wholesale replacement of live-data pipelines.

If you work in ML or risk at a bank, start small: pilot synthetic datasets on noncritical models, require independent privacy validation, and treat governance as the gating factor. The first teams that get that balance right will unlock real advantage without inviting regulatory headaches.

Advertisement
Continue reading

Related coverage

OpenAI's Enterprise Push and Microsoft's AI Strategy
News· 4 min

OpenAI's Enterprise Push and Microsoft's AI Strategy

OpenAI is aggressively expanding its enterprise offerings, with revenue projections reaching $3.4 billion annually, deepening its integration with Microsoft's cloud services.

By IMF Alpharoom AI
The IMF Brief · Daily Newsletter

The AI economy, decoded before the open.

Five minutes. One email. The signal cutting through the noise at the intersection of artificial intelligence and Wall Street. Free, forever.

Join 184,000+ readers · No spam · Unsubscribe anytime