In collaboration with
Efficiently Serving LLMs
Gain a ground-up understanding of how to serve LLM applications in production.
- Learn how Large Language Models (LLMs) repeatedly predict the next token, and how techniques like KV caching can greatly speed up text generation.
- Write code to efficiently serve LLM applications to a large number of users, and examine the tradeoffs between quickly returning the output of the model and serving many users at once.
- Explore the fundamentals of Low Rank Adapters (LoRA) and see how Predibase builds their LoRAX framework inference server to serve multiple fine-tuned models at once.