A pretrained text-to-image generator enabled researchers to see — roughly — what other people looked at based on brain scans.
What's new: Yu Takagi and Shinji Nishimoto developed a method that uses Stable Diffusion to reconstruct images viewed by test subjects from scans of their brains that were taken while they were looking at the images.
Diffusion model basics: During training, a text-to-image generator based on diffusion takes a text description and an image that has been adulterated with noise. An embedding model embeds the description, and a diffusion model learns to use the embedding to remove the noise in successive steps. At inference, the system starts with a text description and pure noise, and it iteratively removes noise according to the text embedding to generate an image. A variant known as a latent diffusion model saves computation by embedding the image as well and removing noise from noisy versions of the embedding instead of a noisy image.
Key insight: Stable Diffusion, like other latent diffusion text-to-image generators, uses separate embeddings of corresponding images and text descriptions to generate an image. In an analogous way, the region of the human brain that processes input from the eyes can be divided into areas that process the input’s purely sensory and semantic aspects respectively. In brain scans produced by functional magnetic resonance imaging (fMRI), which depicts cortical blood oxygenation and thus indicates neuron activity, these areas can be embedded separately to substitute for the usual image and text embeddings. Given these embeddings, Stable Diffusion can generate an image similar to what a person was looking at when their brain was scanned.
How it works: The authors trained a simple system to produce input for Stable Diffusion based on fMRI. They trained a separate version of the system for each of four subjects whose brains were scanned as they looked at 10,000 images of natural scenes.
- Given a photo with associated text, Stable Diffusion’s encoders separately embedded the photo and the text.
- The authors trained two linear regression models. One learned to reproduce Stable Diffusion’s image embedding from the part of the fMRI scan that corresponds to the brain’s early visual cortex (which detects the orientation of objects), and the other learned to reproduce Stable Diffusion’s text embedding from the part of the fMRI scan that corresponds to the ventral visual cortex (which decides the meaning of objects).
- At inference, given an fMRI scan, the linear regression models produced image and text embeddings. The authors added noise to the image embedding and fed both embeddings to Stable Diffusion, which generated an image.
Results: The authors concluded that their approach differed so much from previous work that quantitative comparisons weren’t helpful. Qualitatively, the generated images for all four subjects depict roughly the same scenes as the ground-truth images, though the details differ. For instance, compared to the ground-truth image of an airplane, the generated images appear to show something airplane-shaped but with oddly shaped windows, a cloudier sky, and blurred edges.
Why it matters: Previous efforts to reproduce visual images from brain scans required training a large neural network. In this case, the authors trained a pair of simple linear models and used a large pretrained model to do the heavy lifting.
We’re thinking: The generated images from models trained on brain scans of different subjects showed different details. The authors suggest that this disagreement may have arisen from differences in the subjects’ perceptions or differences in data quality. On the contrary, they may relate to the noise added during image generation.