Streamlined Inference Deja Vu, a method that boosts LLM speed by activating only essential neural parts

Published

May 8, 2024

Reading time

2 min read

It’s not necessary to activate all parts of a large language model to process a given input. Using only the necessary parts saves processing.

What’s new: Zichang Liu and collaborators at Rice University, Zhe Jiang University, Stanford, University of California San Diego, ETH Zürich, Adobe, Meta, and Carnegie Mellon proposed Deja Vu, an algorithm that accelerates inferencing of large language models (LLMs) by using small vanilla neural networks to predict which parts of it to use.

Key insight: Transformer-based neural networks can save a lot of time at inference by activating only a fraction of (i) attention heads and (ii) neurons in fully connected layers. But it’s necessary to activate the right neurons, because different parts of the network learn about different patterns of inputs. By using the input to decide which parts of the network to activate, the network can maintain accuracy using only the parts relevant for the current input.

How it works: The authors used pretrained OPT models of various sizes (175, 66, and 30 billion parameters). They built a dataset by feeding examples from OpenBookQA and Wiki-Text to the OPTs and recording the outputs of all attention heads and fully-connected-layer neurons. By activating various portions of these networks, they learned that, for a given input, they could discard most of an OPT’s lowest-output attention heads and fully-connected-layer neurons without degrading its performance.

The authors used their dataset to train a sparsity predictor for each of an OPT’s fully connected layers. This small vanilla neural network classified which neurons in a fully connected layer to activate (because they produced large outputs), given the output of the previous fully connected layer.
Using the same dataset, they trained, for each attention layer, a small vanilla neural network to classify which attention heads to activate (because they produced large outputs), given the output of the previous attention layer.
At inference, an OPT and its predictor networks ran in parallel. While the OPT computed an attention layer, a predictor network predicted the neurons to activate in the following fully connected layer. Similarly, while the OPT computed each fully connected layer, a predictor network predicted the heads to activate in the following attention layer.

Results: Deja Vu (175 billion parameters) produced a sequence of 128 tokens in 20 milliseconds, while an Nvidia implementation of OPT of the same size needed 40 milliseconds and a Hugging Face implementation of OPT of the same size needed 105 milliseconds. Moreover, Deja Vu achieved these speedups without reducing accuracy. On WikiText and C4, Deja Vu’s ability to predict the next word held steady while activating 25 percent of attention heads and fully-connected-layer neurons. On datasets such as WinoGrande and OpenBookQA, it maintained its accuracy while activating 35 percent of attention heads and fully-connected-layer neurons.

Why it matters: Efficient use of processing power becomes increasingly important as models become larger. Moreover, faster token generation benefits agentic workflows, which can consume large numbers of tokens.

We’re thinking: Deja Vu’s design is in the spirit of the mixture of experts (MoE) architecture: For each transformer layer, MoE uses a neural-network layer to choose which fully connected layer to use. In contrast, for each attention head and fully-connected-layer neuron, Deja Vu uses small neural networks to decide which to activate.

Subscribe to The Batch