ChatGPT is going multimodal with help from DALL·E.
What’s new: ChatGPT is being geared to accept voice input and output, OpenAI announced. It will also accept and generate images, thanks to integration with DALL·E 3, a new version of the company’s image generator.
How it works: The updates expand ChatGPT into a voice-controlled, interactive system for text and image interpretation and production. New safety features are designed to protect legal rights of artists and public figures.
- Voice input/output will give ChatGPT functionality similar to that of Apple Siri or Amazon Alexa. OpenAI’s Whisper speech recognition system will transcribe voice input into text prompts, and a new text-to-speech model will render spoken output in five distinct voice profiles. Voice interactions will be available to subscribers to the paid ChatGPT Plus and Enterprise services within a couple of weeks.
- A new model called GPT-4 with Vision (GPT-4V) manages ChatGPT’s image input/output, which OpenAI demonstrated at GPT-4’s debut. Users can include images in a conversation to, say, analyze mathematical graphs or plan a meal around the photographed contents of a refrigerator. Like voice, image input/output will be available to paid subscribers within weeks.
- DALL·E 3 will use ChatGPT to refine prompts, and it will generate images from much longer prompts than the previous version. It will produce legible text within images (rather than made-up characters and/or words). Among other safety features, it will decline prompts that name public figures or ask for art in the style of a living artist. The update will be available to paid subscribers in early October, and Microsoft Bing’s Image Creator will switch from DALL·E 2 to DALL·E 3.
- All new functionality eventually will roll out to unpaid and API users.
Yes, but: OpenAI said the new voice and image capabilities are limited to the English language. Moreover, the ability to understand and generate highly technical images is limited.
Behind the news: OpenAI introduced GPT-4 in March with a demo that translated a napkin sketch of a website into code, but Google was first to make visual input and output to a large language model widely available. Google announced visual features at May’s Google I/O conference and the public could use them by midsummer.
Why it matters: ChatGPT has already redefined the possibilities of AI among the general public, businesses, and technical community alike. Voice input opens a world of new applications in any setting where English is spoken, and the coupling of language and vision is bound to spark applications in the arts, sciences, industry, and beyond. DALL·E 3’s safety features sound like an important step forward for image generation.
We’re thinking: The notion of generative models that "do everything" has entered the public imagination. Combining text, voice, and image generation is an exciting step in that direction.