Enabling a pretrained large language model to process a data type other than text (say, images), possibly in a specialized domain (say, radiology), typically requires thousands to millions of examples that pair the other data (perhaps x-rays) with text. Researchers devised an approach that requires a small number of examples.
What’s new: Sample-Efficient Modality Integration (SEMI) enables an LLM to process any input data type in any specialized domain based on as few as 32 examples. Given a suitable, pre-existing encoder, a single projector plus a dynamic complement of LoRA adapters translates input embeddings into the LLM’s embedding space. Osman Batur İnce developed the method with colleagues at University of Edinburgh, Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa, and Unbabel, a machine translation company.
Key insight: Typically, adapting a large language model (LLM) to accept multimodal inputs requires training a separate projector for each data type and/or domain. But the ability to adapt to unfamiliar input data types/domains can be considered a general, learnable skill. A projector can learn this skill by training on data types/domains for which examples are plentiful. Then LoRA adapters can adjust it for new data types/domains for which few examples are available. Better yet, a separate network can generate LoRA adapters that adjust the projector to new data types/domains as needed.
How it works: The authors aimed to connect pre-existing, pretrained domain-specific encoders (CLIP for images, CLAP for audio, VideoCLIP-XL for video, and others) to a pretrained large language model (Llama 3.1 8B). To that end, they trained a projector (a vanilla neural network) plus a LoRA generator (a network made up of a single attention layer).
- The authors trained the projector using datasets (around 50,000 to 600,000 examples) that paired text with images, audio, and video. They connected the projector to the LLM, kept the LLM frozen, and minimized the difference between LLM’s outputs and ground-truth text.
- They froze the projector and trained the LoRA generator to produce LoRA adapters based on a description of the task at hand and 128 examples for each data type/domain involved drawn from other datasets of text paired with images, audio, and video. To simulate a wider variety of data types/domains, given a subset of 128 examples, they applied a mathematical transformation to the encoder’s output embeddings while preserving the geometric relationships between the vectors, such as their relative distances and angles.
- At inference, the authors used other pretrained encoders to embed data types/domains the system hadn’t been trained on (for example, MolCA for graphs of molecules). Given a few examples that paired a particular data type/domain with text descriptions, the LoRA generator produced an appropriate adapter.
- To further improve performance, having applied the adapter, they fine-tuned the projector with each adapter using the same subset of examples, keeping other weights frozen.
Results: The authors compared SEMI to training a projector from scratch; fine-tuning their projector; and fine-tuning their projector with a bespoke LoRA adapter using astronomical images from their own dataset, satellite images, IMU sensor data, and graphs of molecular structures plus appropriate pre-existing encoders. They measured performance using metrics that include CIDEr (higher is better), which gauges how well a generated caption matches various human-written ones.
- SEMI beat all baselines in all tests across all numbers of examples (from 32 to the complete datasets of 2,500 to 26,000 examples).
- For instance, on astronomical images with 32 examples, SEMI achieved over 215 CIDEr, while the next-best method achieved 105.
- The sole exception: In tests on graphs of molecular structures, with a few thousand examples, the fine-tuned projector outperformed SEMI.
Why it matters: Large language models are of limited use in many technical fields because little text-paired data is available and building large text-paired datasets is expensive. This work could accelerate adoption of AI in such fields by taking advantage of knowledge in data-rich domains to bootstrap AI training in data-poor ones.
We’re thinking: For AI models to generalize to novel data types, they usually need to be trained on diverse, high-quality data. To that end, it’s helpful to squeeze more learning out of less data.