Fine-Tune Your Fine-Tuning New method optimizes training for few shot NLP models.

Published

Feb 23, 2022

Reading time

3 min read

Let’s say you have a pretrained language model and a small amount of data to fine-tune it to answer yes-or-no questions. Should you fine-tune it to classify yes/no or to fill in missing words — both viable approaches that are likely to yield different results? New work offers a way to decide.

What’s new: Yanan Zheng and collaborators at Beijing Academy of Artificial Intelligence, Carnegie Mellon University, DeepMind, Massachusetts Institute of Technology, and Tsinghua University proposed FewNLU, a method that compares fine-tuning algorithms in few-shot natural language understanding, or language comprehension tasks in which a model must learn from a few examples. They also provide a toolkit for optimizing fine-tuned performance.

Key insight: Previous comparisons of fine-tuning algorithms used fixed hyperparameter values; the researchers chose values known to work with a particular algorithm and maintained them with other algorithms. But different combinations of algorithm and architecture require different hyperparameter values to achieve their optimal performance. So, to compare fine-tuning algorithms, it’s best to determine hyperparameter values separately for each combination.

How it works: The authors compared various data-split strategies and hyperparameter values for different fine-tuning algorithms applied to DeBERTa and ALBERT. They fine-tuned the models on 64 labeled examples for each of seven tasks in the SuperGLUE benchmark (such as answering yes-or-no questions about a text passage or multiple-choice questions about causes of events) to find the best data-split strategy and most important hyperparameters. Then they compared fine-tuning algorithms using different values for the most important hyperparameters.

The authors considered three data-split strategies: minimum description length, K-fold cross validation, and one they created called Multi-Splits. Whereas K-fold cross validation splits the dataset into K parts and uses a different part for validation K times, Multi-Splits shuffles and splits the data randomly into training and validation sets according to a fixed ratio K times.
They compared different values for six hyperparameters, varying one at a time: the order in which they provided the 64 labeled examples during training, the pattern used to convert various types of examples into fill-in-the-blank examples, training batch size, learning rate, evaluation frequency, and maximum training steps.
They compared the performance of four fine-tuning algorithms on ALBERT and DeBERTa using the best data-split strategy (Multi-Splits) and various combinations of hyperparameter values. The algorithm known as CLS adds a special token at the beginning of an input example, and the model uses the token’s representation to classify it. PET, ADAPET, and P-tuning change the classification task into a fill-in-the-blank procedure.

Results: Multi-Splits led to superior test performance on 4 of the 7 tasks, and it had the greatest correlation between validation and test performance on 5 of the 7 tasks. Changes in the prompt pattern led to the greatest standard deviation in performance across hyperparameters (average of 5.5 percent accuracy, compared to the next-highest, training order, at 2.0 percent), suggesting that it was the most important hyperparameter to optimize. Using Multi-Splits and the optimal hyperparameter values for each fine-tuning algorithm (specific to each model and task), PET, ADAPET, and P-tuning performed similarly and typically outperformed CLS by 15 to 20 percentage points in accuracy and F1 score. There was no clear winner among PET, ADAPET, and P-tuning, each of which achieved the highest accuracy or F1 score on one task or another, often within 1 standard deviation of each other.

Why it matters: It’s certainly good to know how to get the most out of fine-tuning. Beyond that, this work reinforces the notion that, since the only way to know the best hyperparameter values is to find them empirically, it pays to keep guessing to a minimum.

We’re thinking: Here’s a puzzler: If the choice of a fine-tuning algorithm changes a model’s optimal hyperparameter values, is the choice itself a hyperparameter?

Subscribe to The Batch