Object Detection
Vision and Language Tightly Bound: Training on a single loss function improves multimiodal AI.
Recent multimodal models process both text and images as sequences of tokens, but they learn to represent these distinct data types using separate loss functions. Recent work unifies the loss function as well.