Mixture of Experts (MoE)

2 Posts

GLaM model architecture
Mixture of Experts (MoE)

Efficiency Experts: Mixture of Experts Makes Language Models More Efficient

The emerging generation of trillion-parameter language models take significant computation to train. Activating only a portion of the network at a time can cut the requirement dramatically and still achieve exceptional results.
Different graphs showing switch transformer data
Mixture of Experts (MoE)

Bigger, Faster Transformers: Increasing parameters without slowing down transformers

Performance in language tasks rises with the size of the model — yet, as a model’s parameter count rises, so does the time it takes to render output. New work pumps up the number of parameters without slowing down the network.

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox