Fast & Efficient LLM Inference with vLLM
Instructor: Cedric Clyburn
- Intermediate
- 1 Hour 38 Minutes
- 9 Video Lessons
- 3 Code Examples
- Instructor: Cedric Clyburn
What you'll learn
Apply quantization to shrink a model's memory footprint, then measure the accuracy tradeoff.
Serve a model with vLLM and see how efficiently it handles many concurrent requests using techniques like continuous batching and PagedAttention.
Benchmark your deployment and measure model quality so you can make informed tradeoffs between speed, cost, and accuracy.
About this course
Who should join?
ML engineers, platform engineers, and developers who need to deploy open-source LLMs efficiently. Familiarity with Python and basic LLM concepts (tokens, inference, GPU usage) is recommended.
Course Outline
9 Lessons・3 Code ExamplesIntroduction
Video・3 mins
Why Efficient LLM Deployment Matters
Video・6 mins
Inference & Memory Fundamentals
Video・14 mins
LLM Optimization Fundamentals
Video・14 mins
Optimizing a Model with LLM Compressor
Video with code examples・11 mins
Serving LLMs Efficiently with vLLM - Part I
Video・10 mins
Serving LLMs Efficiently with vLLM – Part II
Video with code examples・7 mins
Measuring What Matters: Benchmarking and Evaluation
Video with code examples・15 mins
Conclusion: Putting it All Together
Video・4 mins
Quiz
Reading・10 mins
Instructor
Fast & Efficient LLM Inference with vLLM
- Intermediate
- 1 Hour 38 Minutes
- 9 Video Lessons
- 3 Code Examples
- Instructor: Cedric Clyburn
Additional learning features, such as quizzes and projects, are included with DeepLearning.AI Pro. Explore it today
Want to learn more about Generative AI?
Keep learning with updates on curated AI news, courses, and events, as well as Andrew Ng’s thoughts from DeepLearning.AI!
