Short CourseIntermediate1 Hour 38 Minutes

Fast & Efficient LLM Inference with vLLM

Instructor: Cedric Clyburn

Red Hat
  • Intermediate
  • 1 Hour 38 Minutes
  • 9 Video Lessons
  • 3 Code Examples
  • Instructor: Cedric Clyburn
    • Red Hat
    Red Hat

What you'll learn

  • Apply quantization to shrink a model's memory footprint, then measure the accuracy tradeoff.

  • Serve a model with vLLM and see how efficiently it handles many concurrent requests using techniques like continuous batching and PagedAttention.

  • Benchmark your deployment and measure model quality so you can make informed tradeoffs between speed, cost, and accuracy.

About this course

<p><span style="font-weight: 400;">Introducing Fast &amp; Efficient LLM Inference with vLLM, a short course built in partnership with Red Hat and taught by Cedric Clyburn, Senior Developer Advocate at Red Hat.</span></p> <p><span style="font-weight: 400;">Serving open-source LLMs efficiently, for many users at low latency and reasonable cost, comes down mostly to memory management. Two things compete for that memory: the model weights and the KV cache. A 70-billion-parameter model takes around 140 GB of memory just for the weights, while the KV cache grows with every request you serve. In this course, you'll learn to shrink the weights through quantization, and serve the model with vLLM, the widely adopted open-source serving system, taking advantage of the memory management techniques it provides like PagedAttention and prefix caching.</span></p> <p><span style="font-weight: 400;">You'll run the full optimize-deploy-benchmark workflow on a real model: compressing an open-source Qwen model with LLM Compressor, serving it with vLLM, and benchmarking your deployment under realistic traffic using GuideLLM and lm-eval.</span></p> <p><strong>In detail, you'll:</strong></p> <ul> <li style="font-weight: 400;"><span style="font-weight: 400;">Understand why efficient LLM deployment matters, what happens during inference, what the KV cache is, and how the GPU memory hierarchy shapes performance.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Explore LLM optimization fundamentals and how compression techniques like weight and activation quantization enhance a model&rsquo;s throughput and latency while preserving accuracy.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Use LLM Compressor to quantize a full-precision model, compare its size before and after, and use perplexity to measure whether the compressed model still performs well.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Learn the three core techniques behind modern LLM serving: continuous batching to keep the GPU busy, PagedAttention to manage the KV cache without waste, and prefix caching to skip recomputation when requests share content.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Connect to a vLLM inference server, send requests through the OpenAI-compatible API, and watch vLLM&rsquo;s memory management techniques&nbsp; working live in the metrics.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Benchmark your deployment under load with GuideLLM and evaluate model quality with lm-eval.</span></li> </ul> <p><span style="font-weight: 400;">By the end, you'll have run the full optimize-deploy-benchmark workflow on a real model and built the intuition to navigate the tradeoffs between accuracy, speed, and cost.</span></p>

Who should join?

ML engineers, platform engineers, and developers who need to deploy open-source LLMs efficiently. Familiarity with Python and basic LLM concepts (tokens, inference, GPU usage) is recommended.

Course Outline

9 Lessons・3 Code Examples
  • Introduction

    Video3 mins

  • Why Efficient LLM Deployment Matters

    Video6 mins

  • Inference & Memory Fundamentals

    Video14 mins

  • LLM Optimization Fundamentals

    Video14 mins

  • Optimizing a Model with LLM Compressor

    Video with code examples11 mins

  • Serving LLMs Efficiently with vLLM - Part I

    Video10 mins

  • Serving LLMs Efficiently with vLLM – Part II

    Video with code examples7 mins

  • Measuring What Matters: Benchmarking and Evaluation

    Video with code examples15 mins

  • Conclusion: Putting it All Together

    Video4 mins

  • Quiz

    Reading10 mins

Instructor

Cedric Clyburn

Cedric Clyburn

Senior Developer Advocate at Red Hat

Additional learning features, such as quizzes and projects, are included with DeepLearning.AI Pro. Explore it today

Want to learn more about Generative AI?

Keep learning with updates on curated AI news, courses, and events, as well as Andrew Ng’s thoughts from DeepLearning.AI!