Apple Silicon's bet on on-device AI hasn't moved — only the chip range that runs it

Days before WWDC, Apple's silicon team's argument: the on-device AI bet hasn't shifted in nine years — just compounded across the range.

Apple Silicon's bet on on-device AI hasn't moved — only the chip range that runs it

Three days before WWDC opens in Cupertino, and at the end of a Computex week in which NVIDIA, Qualcomm and Intel all took turns on the Taipei stages arguing — in their own words — for hybrid AI between cloud and device, Apple's silicon team is making a quieter point: their bet on running it on the device hasn't really moved.

The Neural Engine has been in Apple chips since 2017. Unified memory has been a foundational design choice since the first A-series part. What's changed in nine years isn't the philosophy — it's the silicon range that now delivers it, which today stretches from a $599 MacBook Neo to an M5 Max workstation chip with two dies bonded into a single package.

That, in essence, was Doug Brooks's argument in a briefing on the sidelines of Computex. Brooks is the senior product manager for Apple Silicon, and the conversation was on-the-record and product-focused — no demo, no marketing slide deck, no roadmap. Just the chip.

His opening line is worth quoting, because it's the spine of everything that followed:

If you want to build a great device for AI, you need to build a great computer.

You can read that as evasion — Apple deflecting the AI conversation back to the hardware it already makes — or you can read it as the actual thesis: on-device AI doesn't get bolted on to a phone or a Mac; it gets designed in from the silicon up, and Apple has been doing exactly that since the A11 Bionic introduced the Neural Engine in 2017. As Brooks put it, with a flicker of dryness, "well before the AI PC trend broke."

The same week, three other Silicon CEOs took the stage

Brooks's case is worth setting against the rest of Computex. The show's keynote slots belonged to NVIDIA's Jensen Huang, Qualcomm's Cristiano Amon and Intel's Lip-Bu Tan, and each of them spent two hours arguing — in their own words — for the same thing: that AI is not going to live entirely in the cloud.

Huang announced the NVIDIA RTX Spark, a new Windows PC superchip built with MediaTek that uses NVLink-C2C to connect a Blackwell GPU die to a Grace CPU die with coherent memory across both. "The PC is being reinvented," he said. The same keynote put NVIDIA's Vera Rubin AI-factory platform into full production. The bet is plainly on both ends — cloud-scale data centres and on-device PC silicon, working as a continuum.

Amon's framing was sharper. He declared 2026 the "Year of the Agent," introduced Snapdragon C for $300 entry-level Windows AI PCs and Snapdragon X2 Elite for premium ones, and unveiled Dragonfly, a brand-new data-centre AI inference chip line that marks Qualcomm's full entry into the cloud silicon market. He called the underlying strategy the "Computing Continuum" — workloads dynamically allocated across devices, the edge, and the cloud. "The agent isn't tied to the device," he said. "It moves with the user."

Tan's Intel keynote landed the most explicit version of the case. On stage with Perplexity CEO Aravind Srinivas, Intel ran a live hybrid-inference demo on a Core Ultra Series 3 laptop: the local model flagged what was sensitive, kept that on the device, and sent only non-sensitive material to the cloud. Tan put it plainly afterwards — "privacy, security, compliance and cost are driving the need for hybrid compute."

Three companies, three stages, three flavours of the same argument. AI is not all going to live in the cloud. Some of it has to live on the device, for privacy or for cost or for both, and the silicon has to deliver that across a range from a phone to a workstation.

That is essentially the case Apple has been making since the A11 Bionic introduced the Neural Engine in 2017, with the small detail that Apple did it without naming a strategy. Brooks did not use the phrases "Computing Continuum" or "hybrid compute" or "AI factories." His vocabulary was scalable, balanced, unified, and foundational — words Apple has used since the A11. Whether the rest of the industry has caught up by reinventing language or by adopting Apple's positions wholesale is the open question. What is no longer the question this Computex week is whether on-device AI matters.

The fundamentals haven't moved

Brooks returned repeatedly to four foundations: a scalable architecture, a balanced architecture, unified memory, and an insistence on performance per watt. None of those words have changed since the early days of Apple Silicon. What has changed is how aggressively each has been scaled.

The Neural Engine, in particular, has gone from a single block on a phone chip to a feature shared across A-series and M-series, with billions of Apple devices now AI-accelerated by Brooks's count. The framework story rode along in parallel — Core ML in 2017, then today's MLX open-source project, Metal Performance Shaders, the Foundation Models API and the Apple Intelligence APIs. They all sit atop the Neural Engine, the CPU, or the GPU.

Asked about what's actually hard about this — what keeps the team awake — Brooks's answer was less about technology and more about discipline.

It's really easy to throw more performance and more transistors at problems without bounding it. You have to continue to keep that focus on power efficiency. We're not willing to sacrifice a huge amount of battery life for raw performance on its own.

It's a reply that quietly draws a line between Apple's approach and an industry currently celebrating multi-kilowatt AI workstations. As Brooks framed it, Apple's bound is the phone in your pocket. Everything else is engineered downwards from that constraint.

Unified memory: designed for efficiency, kept for AI

Apple chose unified memory long before generative AI was a consumer concern. The original case was efficiency — eliminating wasteful data shuffling between a CPU's large, slow memory pool and a GPU's smaller, faster one. As Brooks described the old architecture, the system spent meaningful time "copying data back and forth, which really doesn't benefit the user, but it was kind of a necessity."

Unified memory replaced that with a single pool every part of the chip can address: CPU, GPU, Neural Engine. When asked at what point Apple realised this efficiency play had quietly become an AI advantage, Brooks's answer was, accurately, that the realisation came after the fact. "Unified memory just makes so much sense right now," he said. After Apple re-invented it, it was the obvious response, and he laughed.

That single pool is what lets each Apple chip run an AI model up to its memory capacity, with whichever piece of silicon — Neural Engine, GPU, or CPU — is the right tool for the workload. No copies, no domain transfers. As Brooks put it, when the Neural Engine produces a result and the CPU needs it next, without unified memory, you're moving data around; with it, you're not.

Neural Accelerators: the recent compounder

The newest piece of the on-device AI stack is Neural Accelerators, and they tell the most interesting story about how Apple's silicon strategy actually compounds.

Brooks traced the lineage. Apple first added Neural Accelerators to the CPU in 2019 with the A13, as dedicated matrix and vector math units. Their sweet spot, in his words, is "low-latency, bursty AI tasks — things like speech recognition, speech generation." They handle the workloads where the model needs to be small, and the answer needs to be instant.

The bigger move came last year, when Apple put Neural Accelerators on every GPU core. That hit first on the A19 Pro in iPhone — the chip used in the iPhone 17 Pro — and then propagated through the M5 family, with M5 Pro and M5 Max picking them up this spring. The result, per Brooks: "a huge boost in AI performance in the GPU domain. That's been a huge benefit for a wide range of AI workloads, from LLMs to image generation."

Taken together, the picture is a chip in which AI acceleration is no longer a single block — the Neural Engine — but spread across three domains: the CPU's matrix units, the GPU's per-core accelerators, and the Neural Engine itself. The point is choice. Different on-device workloads land best on different silicon. The chip provides all three.

What that looks like in practice

The case Brooks made in the abstract becomes sharper when you look at the apps already in the field. Apple shared a comment from four developers whose products lean heavily on the on-device AI Brooks described.

Whisper Transcription, from Jordi Bruin, owner of Good Snooze, runs the entire transcription workflow on-device using the Neural Engine and GPU. The first version shipped in 2023, when M-series MacBooks had only just made long-form on-device transcription practical. The progress since then is the kind of compounding the silicon team would point to.

Model improvements, machine learning frameworks such as Core ML, and the Neural Accelerators in the M and A-series chips sped up the transcription workflow enormously since then. Users can now transcribe files at more than 300 times real-time speed, meaning an hour-long interview can be transcribed locally in under 15 seconds right on their iPhone or Mac.

Whisper Transcription has been my go-to app for a couple of years, and for anyone who works with audio, the privacy story matters as much as the speed. The app, in Bruin's words, "takes advantage of Apple silicon and Apple frameworks to run the whole transcription process on device, so users can keep sensitive recordings private instead of sending them to a remote server." Apple Intelligence integration, alongside other local AI options like Ollama and LM Studio, means downstream features — chat, summaries, follow-up actions — can also stay local.

SwingVision, from CEO Swupnil Sahai, is the app Apple mentioned in its "sports tracking" on-device AI use case. It turns a single smartphone into a pro-quality stats and officiating system for tennis and pickleball, generating TV-quality highlights from court-side video. It is also one of the more demanding workloads anyone runs on a phone: 1080p footage at 60 frames per second, processed in real time, often on a court in direct sun.

To maximise ball tracking accuracy, SwingVision's AI models process 1080p video at 60 frames per second. Thanks to our models running on Apple silicon, tennis and pickleball players can experience real-time line calling and audio feedback on any court in the world — all without worrying about battery life or overheating.

That last clause — without worrying about battery life or overheating — is the iPhone-envelope answer we did not get directly from Brooks. SwingVision is the existence proof.

LM Studio runs the most widely used local-LLM desktop app and recently acquired Locally AI from its creator, Adrien Grondin, bringing the same model-running capability to iPhone and iPad. Their experience is the cleanest endorsement of MLX, the open-source framework Brooks listed alongside Core ML and Metal Performance Shaders.

MLX is incredibly performant and highly optimised for Apple silicon, and we get to work with an inference framework that leverages the hardware and software best, said LM Studio CEO Yagil Burowski. Bringing the experience across iPhone and Mac, he added, was "straightforward" — exactly the build-once, scale-everywhere story Brooks was making.

Grondin's framing gets at the iPad-and-iPhone side of the case. "LM Studio and the Locally app democratise access to advanced models, enabling users to easily download and run specific LLMs locally on device. Apple silicon and MLX allowed us to enable a truly seamless on-device AI experience across iPhone, iPad and Mac."

CollaNote, Quoc Huy's iPad-first note-taking app, makes the multimodal version of the same argument. AI assistants inside notes, lecture transcription linked to handwriting, sketches turned into illustrations — these are use cases that were not previously practical on a tablet.

Apple silicon is making a new generation of AI features possible directly on iPad. For CollaNote, this means we can build experiences that used to feel impossible to run on device — from an AI assistant inside your notes, to lecture transcription connected with handwriting, to creative tools that turn simple sketches into beautiful illustrations — helping students learn, review, and express ideas more naturally.

The interesting bit is the layering: handwriting recognition, speech-to-text and image generation all happening on the same chip, all without leaving the device. That is exactly what Brooks's "AI acceleration across the entire chip" architecture is meant to enable, and CollaNote is the app showing what it looks like when it works.

Why developers don't get direct Neural Engine access

The MLX story LM Studio described — same framework, same model code, running across iPhone, iPad and Mac — is exactly what Apple's framework abstraction is designed to deliver, and it is also why Apple doesn't let developers program the Neural Engine directly. They reach the AI silicon through Core ML, the Foundation Models framework or MLX — not by writing to the NPU. Asked whether there's a case for opening up lower-level access, Brooks's answer was a defence of the abstraction itself.

Abstraction brings a lot of flexibility. On one class of device you might want to send a workload to the Neural Engine because that's the fastest, most efficient way to run it. On a newer chip with Neural Accelerators in the GPU, that might be the better way. The abstraction lets developers trust the system to do the best thing at the time.

The user-facing pitch is that developers don't have to rewrite code when Apple changes the silicon underneath. The flip side, which Brooks didn't say but is true, is that this also makes the platform deeply Apple-shaped: a developer building an on-device AI feature on Core ML is committing to the framework rather than to a chip, and the chip can rewire itself without breaking them. That's flexibility for both sides — and it's also lock-in by another name.

Fusion, and the problem of two dies

The Mac side of the story is shorter, in keeping with Apple's emphasis on iPhone and iPad on-device AI ahead of WWDC, but the architectural moment worth understanding is Fusion. The M5 Pro and M5 Max announced this spring are the first Apple SoCs that aren't a single die — they bond two dies together inside the package.

For AI specifically, that ought to be a problem. Splitting a chip across multiple dies typically fragments memory into separate pools, and an on-device LLM requires a single coherent pool large enough to hold the entire model. Asked how Apple kept unified memory intact across two physical dies, Brooks was emphatic that this wasn't optional.

Unified memory is foundational to Apple Silicon. It was critically important that when we went to that dual-die architecture with Fusion, unified memory was preserved. Fusion doesn't change the fundamentals of unified memory — it preserves it across both dies.

He pointed back to UltraFusion, the earlier interconnect Apple used to bond two complete SoCs into the M-series Ultra chips, as the original solution to the same problem. The two architectures share a goal: to make the result look like a single bank of unified memory, with a single set of processing domains, to the software running on top. As far as the OS, frameworks, and apps are concerned, there are no dies — there is a chip.

What that ultimately buys is room. As Brooks put it, what determines how large a model a device can run is the memory capacity enabled on that class of chip. Fusion's job is to lift that ceiling on the high end without breaking the rule that has held since the first A-series part.

The Neo, and what the A-series quietly proved

The other end of the range is the MacBook Neo, the $599 Mac released in March that runs the A18 Pro — the same chip that debuted in the iPhone 16 Pro back in 2024. It's the first Mac to ship with an iPhone-class A-series part rather than an M-series, and the silicon-engineering question it raises is when an A-series chip became Mac-worthy in the first place.

Brooks's answer reached back further than the Neo. The A-series, he noted, has been "a 64-bit desktop-class architecture for years." Crucially, it's also what convinced Apple internally that it could move the Mac off Intel.

The strength of the A-series is what proved to us that we could bring Mac to Apple Silicon. It was really how capable those chips were that allowed us to continue the endeavour with the M1.

That bit is worth sitting with. The MacBook Neo isn't a curiosity — it's the visible end of a long-running engineering argument that the A-series was always close enough to a Mac-grade chip that the gap was a packaging and binning question, not a capability one. Brooks framed the Neo as a success. For our purposes, it's also an existence proof that the same on-device AI capability now reaches Apple's cheapest Mac.

The one question that didn't get answered

With time running short, I asked Brooks about the NVIDIA RTX Spark — the Windows PC superchip Huang had announced at Computex that same week. Apple's Fusion solves the same coherent-memory-across-silicon problem as NVIDIA's NVLink-C2C, just with in-package die bonding instead. Why was Apple's approach the right answer, and where does the trade-off sit — latency, bandwidth, power, or something more fundamental about how developers should see the system?

Brooks declined cleanly.

I'm probably not the best person to respond to that. We'll see if we can get you an answer to that.

Apple has offered to follow up with someone on the chip-architecture side. As of publication, that conversation is pending, and we'll come back to it if and when Apple does.

The verdict, days before WWDC

What Brooks did not say, but what the interview, taken together, suggests, is that Apple's on-device AI position going into WWDC is not a pivot. It is a compounding strategy that has been running since 2017 and is now bearing on a chip range from a $599 fanless laptop to a two-die workstation SoC. Every part of that range runs the same Neural Engine, the same unified memory architecture, the same developer frameworks. WWDC is where the software story gets told. The silicon story, Brooks's read of it, is that there isn't a turn coming — there is just more of what Apple has been doing.

That is either reassuring or unambitious, depending on what you wanted to hear. What is true either way is that no other major silicon vendor has been making the on-device-first argument as long as Apple has, and now that the rest of the industry has decided AI is going to live partly on the device, Apple's case is essentially that it already does.

Subscribe to our newsletter

Subscribe to our newsletter to get the latest updates and news

Member discussion