Your Next Phone Will Think Offline: The Race to Run LLMs On-Device
From Apple and Google to open-source Llama hacks, a new wave of on-device LLMs promises private, low-latency AI—here's what consumers, developers and investors should watch.
From Apple and Google to open-source Llama hacks, a new wave of on-device LLMs promises private, low-latency AI—here's what consumers, developers and investors should watch.

Illustration by IMF Alpha editorial · Reviewed by Pedro Marini
The headline is simple: phones and laptops are about to do serious language understanding without a round trip to the cloud. It sounds like a small engineering tweak, but it changes privacy guarantees, latency math, and the business model for AI services in ways that feel bigger the closer you look.
In 2024–25 we slid from cloud-first toward a hybrid reality. Big models trained on server farms still drive frontier research, but a fast-growing group of smaller, optimized LLMs plus mobile neural engines now make genuinely useful offline assistants possible. Engineers are combining quantization tricks, distilled 7B–13B models, and runtimes like llama.cpp and Core ML to get meaningful responses in a few hundred milliseconds on modern phones. It’s not magic; it’s a stack of practical compromises that suddenly add up.
Why now
A practical payoff: instant drafting, transcription correction, contextual search—these feel native when latency falls from a cloud round-trip of a second or more to a few hundred milliseconds locally. For a personal finance app that parses spending in natural language, that can mean sensitive statements never leave the device. Privacy win, plus a cleaner regulatory story.
Trade-offs are real
Market and product implications
A few things that surprise people
This feels like a partial reversal of the cloud-first swing after GPT-3. The pendulum is nudging back toward edge computing as mobile silicon and model engineering converge. Think early smartphone days: when capabilities moved onto devices, new winners and new regulations followed.
Watch for
Expect your next phone to do more of your thinking offline. That shifts user experience and privacy economics, and it repositions where value accrues—toward silicon and platform trust as much as cloud compute. For consumers this is mostly good news; for investors and product teams it means rethinking which layers of the stack actually capture value.
Pedro Marini

Third-quarter fintech earnings reports indicate a divergence in performance driven by payment processing volumes and advancements in AI-powered credit underwriting.
The global semiconductor supply chain is experiencing significant pressure, driven by increasing AI demand and ongoing capacity limitations at leading foundries like TSMC.

How synthetic-data marketplaces let banks and fintechs train models without legal risk, and why regulators, cloud providers and chipmakers are recalibrating.