Apple Sharpens Its GenAI Profile Apple updates its on-device and cloud AI models, introduces a new developer API

Published
Reading time
4 min read
Apple AI models outperform rivals in instruction accuracy and human text evaluations across devices and servers.
Loading the Elevenlabs Text to Speech AudioNative Player...

Apple revamped two vision-language models in a bid to catch up with fast-moving competitors.

What’s new: Apple updated the Apple Foundation Models (AFM) family, including smaller on-device and larger server-hosted versions, to improve their capabilities, speed, and efficiency. It also released the Foundation Models framework, an API that enables developers to call the on-device model on Apple devices that have Apple Intelligence enabled.

  • Input/output: Text, images in (up to 65,000 tokens), text out
  • Architecture: AFM-on-device: 3 billion-parameter transformer, 300-million parameter vision transformer. AFM-server: custom mixture-of-experts transformer (parameter count undisclosed), 1 billion-parameter vision transformer.
  • Performance: Strong in non-U.S. English, image understanding
  • Availability: AFM-on-device for developers to use via Foundations Models framework, AFM-server not available for public use
  • Features: Tool use, 15 languages, vision
  • Undisclosed: Output token limit, AFM-server parameter count, details of training datasets, vision adapter architecture, evaluation protocol

How it works: Introduced last year, AFM models use a vision encoder to produce an image embedding, which a vision adapter modifies for the LLM. The LLM takes the modified image embedding and text prompt and generates a response. The team trained the systems to predict the next token, align embeddings produced by the vision encoder and LLM, and align responses with human feedback. They trained the models on text and image-text data from publicly available datasets, data scraped from the web, and data licensed from publishers.

  • Quantization: The team used quantization aware training (simulating quantization during training to improve performance of the quantized model at inference) to compress AFM-on-device to 2 bits per weight (except for the embedding layer, which was compressed to 4 bits per weight). They used Adaptive Scalable Texture Compression, a method initially designed for graphics pipelines, to compress the AFM-server model to an average of 3.56 bits per weight (except for the embedding layer, which is compressed to 4 bits per weight).
  • LoRA adapters: They trained LoRA adapters to recover performance loss due to compression, which adapted the model to specific tasks including summarization, proofreading, replying to  email, and answering questions.
  • MoE architecture: While AFM-on-device uses a transformer architecture, AFM-server uses a custom mixture-of-experts (MoE) architecture. A typical MoE can be viewed as splitting a portion of its fully connected layers into a number of parallel fully connected layers, of which it uses only a portion at inference. In comparison, the AFM-server’s MoE first splits the model into groups of layers, then it splits each group into parallel blocks. Each block is a separate multi-layer transformer outfitted with MoE layers (processed on a small number of hardware devices). While a typical MoE combines results across all devices at every mixture-of-experts layer, Apple’s architecture combines them only at the end of each block, which saves communication overhead during processing.

Performance: In human evaluations, the AFM models achieved mixed performance compared to selected models of similar or greater size. The tests included language tasks in U.S. English, non-U.S. English (including Canada and UK), and a basket of European and Asian languages. 

  • AFM-on-device: The on-device model performed better than the competitors at language tasks in non-U.S. English and image understanding. For instance, answering questions about images, AFM-on-device bested Qwen2.5-VL-3B more than 50 percent of the time and was judged worse 27 percent of the time.
  • AFM-server: The server model’s performance was not decisively better than that of the competitors. For instance, AFM-server outperformed Qwen3-23B 25.8 percent of the time but was judged worse 23.7 percent of the time. It underperformed GPT-4o in all tests reported.

Behind the news: Apple dominated social media last week with a controversial paper that purported to show that 5 state-of-the-art reasoning models couldn’t solve puzzles beyond a certain level of complexity.

  • The researchers prompted the models with four puzzles that allowed them to control complexity, including swapping the positions of red and blue checkers on a one-dimensional checkers board, Tower of HanoiRiver Crossing, and Blocks World. For all the puzzles and models, they found, the models’ performance fell to zero when the puzzles reached a certain degree of complexity (for example, a certain number of checkers to swap).
  • A rebuttal paper quickly appeared, penned by Open Philanthropy senior program associate Alex Lawsen with help from Claude 4 Opus. Lawsen contended that Apple’s conclusions were unfounded because its tests included unsolvable puzzles, didn’t account for token output limits, and posed unrealistic criteria for judging outputs. However, he later posted a blog, “When Your Joke Paper Goes Viral,” in which he explained that he intended his paper as “obvious satire” of authors who use LLMs to write scientific papers, and that he hadn’t checked Claude 4 Opus’ output. He updated his paper to correct errors in the original version but maintained his fundamental critique.

Why it matters: Apple has been viewed as falling behind in AI. A promised upgrade of Siri, Apple’s AI assistant, is delayed indefinitely, and the lack of advanced AI features in new iPhones has led to a class-action lawsuit. Meanwhile, Google and its Android smartphone platform are racing ahead. The new models, especially the Foundation Models framework, look like a bid for a reset.

We’re thinking: Apple may be behind in AI, but its control over iOS is a huge advantage. If the operating system ships with a certain model and loads it into the limited memory by default, developers have a far greater incentive to use that model than an alternative. Limited memory on phones and the large size of good models make it impractical for many app developers to bundle models with their software, so if a model is favored by Apple (or Android), it’s likely to gain significant adoption for on-device uses.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox