A Single Tokenizer for Visual Media Apple’s AToken, a multimodal model with a single encoder and tokenizer for images, videos, and 3D objects

Published
Reading time
4 min read
Apple's AToken, a multimodal model with a single encoder and tokenizer for images, videos, and 3D objects
Loading the Elevenlabs Text to Speech AudioNative Player...

Multimodal models typically use different tokenizers to embed different media types, and different encoders when training to generate media rather than classify it. A team at Apple created a multidimensional tokenizer that maps not just images and videos, but also 3D objects into a shared token space for any of these visual media — and a shared encoder that performs well at both identifying such objects and generating them.

What’s new: Jiasen Lu, Liangchen Song, and colleagues from Apple trained AToken, a transformer model with an all-purpose visual tokenizer. The new model can both generate and classify images, videos, and 3D, approaching the performance of specialized models for each of these input and output types.

Key insight: Image generation models use encoders (like VAEs or VQ-VAEs) that preserve visual details (is the cat’s/ball’s surface orange?) but discard semantics (is it a cat or a ball?), and therefore don’t recognize objects as well as classification models. Image classification models, on the other hand, use encoders (like CLIP or SigLIP) that capture types of objects (say, “cat” or “ball”) but miss visual details, so they are worse at generation. Moving from still images to video and 3D complicates matters further. Before they’re encoded, video and 3D typically require separate tokenizers to break down images into data an encoder can process, each with its own architecture and embedding space. If the three media types are analyzed using the same tokenizer in a single format, one transformer can learn to work with all of them. Further, training the model to reconstruct these media types and align them to matching text descriptions forces embeddings to retain both fine visual details and semantic references, eliminating the need for separate generation and classification models.

How it works: AToken consists of a pretrained SigLIP2 vision encoder (400 million parameters) — here extended from two dimensions to four — and an untrained decoder of the same size. The authors trained AToken to reconstruct inputs and align their embeddings to text using three image sets (two public and one private), three public sets of videos, and two public sets of 3D objects, all paired with matching text. They trained on this data in three stages: first images, then videos, and last 3D.

  • The authors split every input into tokens with space-time coordinates (t, x, y, z). Images were mapped as a single two-dimensional (x,y) slice, setting  t = z = 0. Videos included an extra coordinate for time (t, x, y, with z = 0), and 3D objects mapped onto an (x, y, z) grid, with t = 0. A single linear layer turned each token into an embedding.
  • They also used 4D Rotary Position Embedding, to encode angular position along time, height, width, and depth (t, x, y, and z). The encoder took each linear embedding along with its relative position to produce embeddings that represented the input.
  • The decoder reconstructed the inputs from the encoder’s embeddings: It generated RGB pixels for images and videos and Gaussian splats — small colored 3D blobs that, when rendered together, form a coherent 3D shape — for 3D objects. AToken learned to reconstruct inputs via four losses: (i) the first minimizes the difference between predicted and ground-truth pixels; (ii) LPIPS (landing page not found) which minimizes the distance between the AlexNet embeddings of the original and those of the reconstruction; (iii) CLIP perceptual loss, which minimizes the distance between the CLIP embeddings of the reconstruction and those of the original; and (iv) Gram matrix loss, which minimizes differences in style and individual features by maximizing the correlation between embeddings from an unnamed pretrained network of the reconstruction and the original image. 
  • To align embeddings of visual inputs with the embeddings of their corresponding captions, the authors combined all the encoder’s embeddings through attention, producing one global embedding that summarized the input, a process called attention pooling. They used contrastive loss to increase the similarity between the global visual embedding and the SigLIP2 text embedding of the matching caption, while decreasing the similarity to embeddings of mismatched captions.

Results: AToken matched or closely approached state-of-the-art models that process images, videos, or 3D. 

  • When classifying images, AToken reached 82.2 percent ImageNet classification accuracy, close to that of dedicated image encoders like the stand-alone SigLIP2 (83.4 percent). When generating images, it achieved 0.21 rFID (a measure of reconstruction quality, lower is better). This outperformed previous models with unified tokenizers such as UniTok, which uses a specialized encoder for each type of input, and approached specialized models like FLUX.1 [dev], which achieved 0.18 rFID.
  • When reconstructing videos from the TokenBench dataset, AToken achieved 36.07 PSNR (frame-by-frame pixel similarity to the original videos, averaged over all frames — higher is better) and 3.01 rFVD (a measure of reconstruction quality — lower is better). This outperformed specialized video models such as Wan2.2 (36.39 PSNR, 3.19 rFVD) and HunyuanVideo (36.37 PSNR, 3.78 rFVD) at video quality but underperformed them at pixel similarity. When asked to identify videos from a text prompt, AToken found the correct video 40.2 percent of the time on the MSR-VTT dataset, still below strong video-focused encoders like VideoPrism-g, which reached 52.7 percent.
  • When reconstructing 3D objects from Toys4K data, it reached 28.28 PSNR (a measure of how close reconstructed 3D renderings are to the originals – higher is better). This exceeded the specialized 3D tokenizer Trellis-SLAT, which achieved 26.97 PSNR.

Why it matters: A major innovation of large language models is their use of a single tokenizer for all language inputs, whether code, dialogue, tables, or books, etc. This generality eases a model’s ability to transfer knowledge from one data source to another during training. When models get better at understanding or generating text, they get better at code, too. AToken offers a similar generality for vision models, particularly when it comes to 2D and 3D objects. AToken’s strong performance at generating and reconstructing multiple visual media types suggests that here, too, a shared tokenizer and encoder could allow improvements from one modality to carry over to the others.

We’re thinking: A model like AToken may prove helpful for generating synthetic 3D and video data. Models that generalize from one media type to another generally reduce the amount of total data needed to train for each task. For example, high-quality, well-labeled, two-dimensional image data is abundant compared to video and 3D, which are both essential to robotics applications.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox