Beginner
2 Hours
Erwin Huizenga
Learn state-of-the-art techniques for getting the most out of multimodal AI with Google’s Gemini model family.
Leverage the power of Gemini’s cross-modal attention to fuse information from text, images, and video for complex reasoning tasks.
Extend Gemini’s capabilities with external knowledge and live data via function calling and API integration.
About this course
Multimodal models like Gemini are pushing the boundaries of what’s possible by unifying traditionally siloed data modalities. With Gemini, you can build applications that seamlessly understand and reason across text, images, and videos, enabling a new class of intelligent systems. For example, building a virtual interior designer that can analyze a user’s room images, understand their style preferences from a text description, and generate personalized design recommendations. Or creating a smart document processing pipeline that can extract structured data from complex PDFs, answer questions based on the content, and generate human-like summaries.
You’ll learn prompt engineering techniques to guide Gemini’s behavior and optimize its performance for diverse use cases, from creative story generation to analytical report writing. And you’ll discover how to integrate Gemini with external APIs and databases using function calling, with the ability to infuse your applications with real-time data and dynamic content.
What you’ll learn, in detail:
- Introduction to Gemini Models: Explore the Gemini model family, and understand the key differences and use cases for Gemini Nano, Pro, Flash, and Ultra. Understand how to select optimal models based on capability, latency, and cost considerations.
- Multimodal Prompting and Parameter Control: Learn advanced techniques for structuring effective text-image-video prompts to elicit desired model behavior. Fine-tune key parameters like temperature, top_p, top_k to control model creativity vs determinism.
- Best Practices for Multimodal Prompting: Get experience with prompt engineering for Gemini multimodal models, and best practices around role assignment, task decomposition, and formatting. Analyze the impact of prompt-image ordering on model performance for different objectives.
- Creating Use Cases with Images: Build engaging multimodal applications like interior design assistants and receipt itemization tools. Leverage Gemini’s cross-modal reasoning capabilities to analyze relationships between entities across multiple images.
- Developing Use Cases with Videos: Implement “needle in the haystack” semantic video search powered by Gemini’s large context window. Explore techniques for long-form video QA and content summarization.
- Integrating Real-Time Data with Function Calling: Extend Gemini with external knowledge and live data via function calling and API integration. Combine Gemini’s Natural Language Understanding (NLU) capabilities with APIs for up-to-date facts and interactive services.
Through this course, you’ll become well-versed in Gemini’s capabilities, how to maximize them in different use cases, and a portfolio of practical techniques for architecting advanced multimodal AI applications.
Note that due to technical requirements, this course features downloadable-only notebooks on the learning platform. You are free to download, review, and run these notebooks on your own.
Who should join?
Whether your goal is to build next-gen document understanding systems, intelligent video search tools, or interactive virtual assistants, this course will equip you with the skills to develop transformative applications.
Instructor
Course access is free for a limited time during the DeepLearning.AI learning platform beta!
Want to learn more about Generative AI?
Keep learning with updates on curated AI news, courses, and events, as well as Andrew’s thoughts from DeepLearning.AI!