Refining Words in Pictures Z.ai’s GLM-Image blends transformer and diffusion architectures for better text in images

Published
Reading time
3 min read
Collage with comic strip, concert poster, diagrams on water cycle and trash sorting, and movie poster.
Loading the Elevenlabs Text to Speech AudioNative Player...

Image generators often mangle text. An open-weights model outperforms open and proprietary competitors in text rendering.

What’s new: Z.ai released GLM-Image, an open-weights image generator that works in two stages. One stage determines an image’s layout while the second fills in details. You can try it here.

  • Input/output: Text, text plus image in, image out (1,024x1,024 pixels to 2,048x2,048 pixels) 
  • Architecture: Autoregressive transformer (9 billion parameters) fine-tuned from earlier GLM-4-9B-0414, decoder (7 billion parameters) based on earlier diffusion transformer CogView4Glyph-ByT5 text encoder
  • Features: Image alteration, style transfer, identity consistency, multi-subject consistency
  • Availability: Weights free to download for noncommercial and commercial uses under MIT license, API access $0.015 per image
  • Undisclosed: Training data

How it works: Given a text or text-and-image prompt, GLM-Image’s autoregressive model generates approximately 256 low-resolution tokens that represent the output image’s layout patch by patch, then 1,000 to 4,000 higher-resolution tokens, depending on the resolution of the output image, that represent proportionately smaller patches. To improve text rendering, a Glyph-ByT5 text encoder produces tokens that represent the shape of each character to be rendered. The decoder takes the high-resolution tokens and text tokens and produces an image.

  • The team trained two components separately using GRPO, a reinforcement learning method.
  • The autoregressive model learned from three rewards: (i) an unspecified vision-language model judged how well images matched prompts; (ii) an unspecified optical character-recognition model scored legibility of generated text; and (iii) HPSv3, a model trained on human preferences, evaluated visual appeal.
  • The decoder learned from three rewards related to details: LPIPS, which scored how closely outputs matched reference images; an unspecified optical character-recognition model scored legibility of generated text, and an unspecified hand-correctness model scored the anatomical correctness of generated hands.

Performance: In Z.ai’s tests, GLM-Image led open-weights models in rendering English and Chinese text, while showing middling performance in its adherence to prompts. Z.ai didn’t publish results of tests for aesthetic quality.

  • On CVTG-2K, a benchmark that tests English text rendering, GLM-Image achieved around 91.16 percent average word accuracy, better than the open-weights Z-Image (86.71 percent) and Qwen-Image-2512 (86.04 percent). It also outperformed the proprietary model Seedream 4.5 (89.9 percent).
  • LongText-Bench evaluates rendering of long and multi-line text in English and Chinese. In Chinese, GLM-Image (97.88 percent) outperformed the open-weights Qwen-Image-2512 (96.47 percent) and the proprietary Nano Banana 2.0 (94.91 percent) but fell behind Seedream 4.5 (98.73 percent). In the English portion, GLM-Image (95.24 percent) nearly matched Qwen-Image-2512 (95.61 percent) but fell behind Seedream 4.5 (98.9 percent) and Nano Banana 2.0 (98.08 percent).
  • On DPG-Bench, which uses a language model to judge how well generated images match prompts that describe multiple objects with various attributes and relationships, GLM-Image (84.78 percent) outperformed Janus-Pro-7B (84.19 percent) but underperformed Seedream 4.5 (88.63 percent) and Qwen-Image (88.32 percent).

Behind the news: Zhipu AI says GLM-Image is the first open-source multimodal model trained entirely on Chinese hardware, specifically Huawei’s Ascend Atlas 800T A2. The company, which OpenAI has identified as a rival, framed the release as proof that competitive AI models can be built without Nvidia or AMD chips amid ongoing U.S. export restrictions. However, Zhipu AI did not disclose how many chips it used or how much processing was required for training, which makes it difficult to compare Huawei’s efficiency to Nvidia’s.

Why it matters: Many applications of image generation, such as producing marketing materials, presentation slides, infographics, or instructional content, require the ability to generate text. Traditional diffusion models have struggled with this. GLM-Image provides an option that developers can fine-tune or host themselves.

We’re thinking: Division of labor can yield better systems. A workflow in which an autoregressive module sketches a plan and a diffusion decoder paints the image plays to the strength of each approach.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox