Mar 20, 2026
A Single Tokenizer for Visual Media: Apple’s AToken, a multimodal model with a single encoder and tokenizer for images, videos, and 3D objects
Multimodal models typically use different tokenizers to embed different media types, and different encoders when training to generate media rather than classify it.