Noam Shazeer helped spark the latest NLP revolution. He developed the multi-headed self-attention mechanism described in “Attention Is All You Need,” the 2017 paper that introduced the transformer network. That architecture became the foundation of a new generation of models that have a much firmer grip on the vagaries of human language. Shazeer’s grandparents fled the Nazi Holocaust to the former Soviet Union, and he was born in Philadelphia in 1976 to a multi-lingual math teacher turned engineer and a full-time mom. He studied math and computer science at Duke University before joining Google in 2000. Below, he discusses the transformer and what it means for the future of deep learning.**The Batch: ***How did you become interested in machine learning?***Shazeer: **I always liked messing around with the computer and probability was one of my favorite topics. My favorite course in grad school was a seminar where the class collaborated to write a crossword puzzle solver. We got to put together all kinds of different techniques in language processing and probabilities.**The Batch: ***Was that your gateway to NLP?***Shazeer: **It was a great introduction to the field. They say a picture is worth 1,000 words, but it’s also 1 million times as much data. So language is 1,000 times more information dense. That means it’s a lot easier to do interesting stuff with a given amount of computation. Language modeling feels like the perfect research problem because it’s very simple to define (what’s the next word in the sequence?), there’s a huge amount of training data available, and it’s AI-complete. It’s great working at Google because it’s a language company.**The Batch: ***How did the idea of self-attention evolve?***Shazeer: **I’d been working with LSTMs, the state-of-the-art language architecture before transformer. There were several frustrating things about them, especially computational problems. Arithmetic is cheap and moving data is expensive on today’s hardware. If you multiply an activation vector by a weight matrix, you spend 99 percent of the time reading the weight matrix from memory. You need to process a whole lot of examples simultaneously to make that worthwhile. Filling up memory with all those activations limits the size of your model and the length of the sequences you can process. Transformers can solve those problems because you process the entire sequence simultaneously. I heard a few of my colleagues in the hallway saying, “Let’s replace LSTMs with attention.” I said, “Heck yeah!”**The Batch: ***The transformer’s arrival was hailed as “NLP’s ImageNet moment.” Were you surprised by its impact?***Shazeer: **Transformer is a better tool for understanding language. That’s very exciting, and it’s going to affect a lot of applications at Google like translation, search, and accessibility. I’ve been very pleasantly surprised by transfer learning for transformers, which really kicked off with BERT. The fact that you could spend a lot of computation and train a model once, and very cheaply use that to solve all sorts of problems.**The Batch: ***One outcome is an ongoing series of bigger and bigger language models. Where does this lead?***Shazeer: **According to the papers OpenAI has been publishing, they haven’t seen any signs that the quality improvements plateau as they make the models bigger. So I don’t see any end in sight.**The Batch: ***What about the cost of training these enormous models?***Shazeer: **At this point, computation costs 10^{-17} to 10^{-18} dollars per operation. GPT-3 was trained using 3×10^{23} operations, which would mean it cost on the order of $1 million to train. The number of operations per word is roughly double the parameter count, so that would be about 300 billion operations per word or roughly 1 millionth of a dollar per word that you analyze or produce. That doesn’t sound very expensive to me. If you buy a paperback book and read it, that costs around one ten-thousandth of a dollar per word. You can still see significant scaling up possible while finding cost-effective applications.**The Batch: ***Where do you find inspiration for new ideas?***Shazeer: **Mostly building on old ideas. And I often find myself looking at the computational aspects of deep learning and trying to figure out if you could do something more efficiently, or something better equally efficiently. I wasted a lot of time in my first few years in deep learning on things that would never work because fundamentally they weren’t computationally efficient. A lot of the success of deep learning is because it runs many orders of magnitude faster than other techniques. That’s important to understand.**The Batch: ***What’s on the horizon for NLP?***Shazeer: **It’s hard to predict the future. Translation of low-resource languages is one fun problem, and a very useful one to give way more people the opportunity to understand each other.**The Batch: ***Who is your number-one NLP hero?***Shazeer: **There have been a massive number of people standing on each other’s shoulders.**The Batch: ***But who stands at the bottom?***Shazeer: **I don’t know! From here, it looks like turtles all the way down.

# AI Transformed

Published

Reading time

3 min read