Softmax commonly computes probabilities in a classifier’s output layer. But softmax isn’t always accurate in complex tasks — say, in a natural-language task, when the length of word vectors is much smaller than the number of words in the vocabulary. A new function renders more accurate predictions with lower computational cost than earlier alternatives.
What’s new: Zhilin Yang, Thang Luong, and Ruslan Salakhutdinov at Carnegie Mellon University, with Quoc Le at Google Brain, developed an efficient solution to the so-called softmax bottleneck: Mixtape.
Key insight: A previous proposal, Mixture of Softmaxes, (MoS) is a weighted sum of multiple softmaxes, and thus slow to train. Mixtape reformulates MoS as a single softmax of weighted sums. With a clever way of calculating the weights, that rearrangement avoids the bottleneck with much speedier execution.
How it works: Mixtape’s weighted sum depends on the word it is evaluating — a not-so-obvious way to formulate the problem. The weights must be generated efficiently to avoid losing the computational advantage over MoS.
- Mixtape calculates weights for the weighted sum using a sigmoid tree decomposition. The sigmoid tree is a binary tree in which each node is a sigmoid. The tree’s leaves provide the weights. This is more efficient than using a softmax to calculate weights.
- Some of the weights are shared among infrequent output classes, which further boosts efficiency.
- This sharing does create potential for a bottleneck, but far less, and with less inaccuracy, than softmax.
Results: The researchers compared transformer-based models with output layers employing Mixtape, MoS-15, or softmax. The tasks included recreating a text sample and translating a sentence from English to German or French. On text generation, MoS-15 (which entails 15 softmax calculations) and Mixtape improved perplexity — a measure of the model’s predictive certainty — by around 3, achieving a score of 56. MoS-15 slightly outperformed Mixtape. However, Mixtape required only slightly more training time than softmax, whereas MoS-15 required twice as long.
Why it matters: Much research has focused on extracting meaningful features of input, but features are less useful if the output layer can’t classify them properly. Mixtape should allow models to take better advantage of features they extract without sacrificing AWS credits.
We’re thinking: Mixtape can do better than softmax with only a little more training time. We may see Mixtape overtake softmax in some applications.