Short CourseIntermediate1 hour 24 mins

AI Agents for Image and Video Generation

Instructors: Katie Nguyen, Wafae Bakkali

Google logo

Earn an accomplishment with PRO

  • Intermediate
  • 1 hour 24 mins
  • 9 Video Lessons
  • 6 Code Examples
  • 1 Graded Assignment PRO
  • Earn an accomplishment with PRO
  • Instructors: Katie Nguyen, Wafae Bakkali
  • GoogleGoogle
  • Learn more aboutMembership PRO Plan

What you'll learn

  • Build AI agents that generate images and video, evaluate the output automatically, and loop back to iterate when results miss the mark.

  • Apply three evaluation techniques: image-text similarity, LLM-based judges, and structured rubrics.

  • Build two agents: a UI mockup agent driven by brand guidelines, and a multi-scene video explainer agent.

About this course

Join our new short course, AI Agents for Image and Video Generation, built in partnership with Google and taught by Katie Nguyen, Developer Relations Engineer at Google Cloud AI, and Wafae Bakkali, Staff Generative AI Specialist at Google.

Most agents you’ve worked with probably produce text. But whether you’re building a product demo, a website asset, or an explainer video, you’re working with visual media. With models like Google’s Nano Banana for images and Veo for video, generating a single output from a prompt is straightforward. The harder problem is producing high-quality results consistently at scale, and the bottleneck there is evaluation: there is no single correct answer to compare against, so quality depends on context and use case.

In this course, you’ll learn three complementary evaluation techniques, then combine them with image and video generation to build autonomous media agents. You’ll build an image agent that turns brand guidelines into UI mockups, and a video agent that plans multi-scene explainers, animates reference frames with synchronized audio, and checks consistency across scenes. In the final lesson, you’ll use Gemini CLI to build a generative media agent in natural language, packaging what you’ve learned into reusable agent skills.

In detail, you’ll:

  • Get a clear mental model of the generative media landscape and the architectures behind image, video, and audio generation.
  • Engineer prompts for high-quality images and video, using techniques like LLM-enhanced prompting, reference images, and starting frames.
  • Build evaluation pipelines that combine SigLIP image-text similarity scores, LLM-based judges, and structured rubrics to assess output at scale.
  • Build an image agent that turns brand guidelines into UI mockups, generating, evaluating, and iterating until designs pass your bar.
  • Build a video agent that plans multi-scene explainers, generates and animates reference frames with audio, and evaluates temporal consistency.
  • Package what you’ve learned into reusable agent skills, and use Gemini CLI to build a generative media application from natural language prompts.

By the end, you’ll be ready to build agents that generate visual media, evaluate it, and iterate to improve outputs.

Who should join?

This course is for AI builders who want to extend agentic workflows beyond text into visual media. Familiarity with Python and basic experience working with LLM APIs is recommended.

Course Outline

9 Lessons・6 Code Examples
Unlock certificates

Elevate your learning experience with Pro

Upgrade to Pro and gain unlimited accomplishments on your resume

Instructors

Katie Nguyen

Katie Nguyen

Developer Relations Engineer at Google Cloud AI

Wafae Bakkali

Wafae Bakkali

Staff Generative AI Specialist at Google

Additional learning features, such as quizzes and projects, are included with DeepLearning.AI Pro. Explore it today

Want to learn more about Generative AI?

Keep learning with updates on curated AI news, courses, and events, as well as Andrew Ng’s thoughts from DeepLearning.AI!

Start Learning