Transforming Pixels An image generation model using the GPT architecture

Published

Aug 05, 2020

Reading time

2 min read

Language models like Bert, Ernie, and Elmo have achieved spectacular results based on clever pre-training approaches. New research applies some of those Sesame Street lessons into image processing.

What’s new: OpenAI researchers led by Mark Chen adapted to pixels techniques developed for processing words in Image Generative Pre-Training (iGPT).
Key insight: Language models based on the transformer architecture learn to predict the next word, or missing words, in text by unsupervised pre-training on an enormous corpus followed by supervised fine-tuning. The same approach can train models to predict the next pixel in an image.

How it works: iGPT uses the GPT-2 architecture that made waves in natural language processing. However, it learns from sequences of pixels instead of sequences of words.

The researchers preprocessed images by flattening them into one-dimensional vectors.
The researchers trained iGPT to either predict the next pixel in a sequence (an autoregressive task) or predict a group of pixels missing from a sequence (which they call Bert).
Pre-trained NLP models often are fine-tuned on a supervised task such as question answering. Similarly, the researchers fine-tuned iGPT on image classification. They found that hiding pixels from the model during fine-tuning improved performance.
The researchers provided all intermediate-layer features and labels to a new output layer, but trained only that layer’s parameters.

Results: Using features extracted by the intermediate layers in the autoregressive task, iGPT achieved 72 percent accuracy on ImageNet, just behind the state-of-the-art 76.5 percent achieved by SimCLR, a popular unsupervised approach. iGPT outperformed SimCLR when fine-tuned and evaluated on the CIFAR datasets.
Yes, but: The researchers had to downsample ImageNet examples to about 7 percent of their original size to accommodate GPT-2. They suspect that iGPT would stack up better against SimCLR if it could accept larger images.

Why it matters: iGPT isn’t a convolutional neural network. It doesn’t even use the convolutional filter that’s fundamental to current image processing methods. This work shows the value of applying architectures proven in one domain to others.

We’re thinking: We’ve been encouraged by the progress in self-supervised learning using methods like Contrastive Predictive Coding and variations thereof, in which a neural network is trained on a supervised learning task that is created from unlabeled data. iGPT appears to be a new line of attack on this problem.

Subscribe to The Batch