Grok 4 Shows Impressive Smarts, Questionable Behavior Grok 4 launches with benchmark records and idiosyncratic behavior

Published
Reading time
3 min read
Grok 4 achieves high benchmarks in reasoning, coding, and science, outperforming Gemini, Claude, and OpenAI models.

xAI updated its Grok vision-language model and published impressive benchmark results. But, like earlier versions, Grok 4 showed questionable behavior right out of the gate.

What’s new: The update to xAI’s flagship vision-language model, which operates the chatbot integrated with the X social media platform, comes in two versions: Grok 4, which improves the earlier version’s knowledge, reasoning, and voice input/output, and Grok 4 Heavy, an agentic mode intended to solve more-demanding reasoning tasks. Like its predecessor, Grok 4 is designed to produce output that may challenge conventional wisdom, particularly by weighing posts written by X users including X CEO Elon Musk.

  • Input/output: Text, images in and out (app up to 128,000 tokens; API up to 256,000 tokens) 
  • Architecture: Mixture of experts transformer, 1.7 trillion parameters
  • Features: Reasoning, web search, code execution, structured outputs, improved voice mode 
  • Availability: Grok 4 $30 per month, Grok Heavy $300 per month, API $3.00/$0.75/$15.00 per 1 million tokens input/cached/output tokens
  • Undisclosed: Architectural details, training methods, training datasets, pretraining knowledge cutoff

How it works: xAI has not yet published a model card or described how it built Grok 4. However, it did reveal broad outlines.

  • Training the new model consumed more than an order of magnitude more processing power than training the previous version.
  • Grok 4 was pretrained to predict the next token in math, coding, and other data. It was fine-tuned via reinforcement learning on chain-of-thought reasoning. Unlike Grok 3, it was trained to use certain tools. In a launch video, Musk promised to provide more sophisticated tools, such as finite element analysis and flow dynamics, later in the year.
  • Grok 4 Heavy spawns multiple agents that process input independently, in parallel. The agents compare findings and decide on the best answer. Musk said they determine the best answer not by majority vote by “comparing notes.”
  • On the day of Grok 4’s launch, users reported that the model, when asked its opinion on the Israeli-Palestinian conflict, searched X for Musk’s statements on these issues and replied accordingly. Later, asked to give its surname with no other text, Grok 4 consistently replied “Hitler.” A subsequent report explored the model’s lack of conventional guardrails.

Performance: Tests conducted by xAI and third parties show that Grok 4’s performance on popular benchmarks is as good as or better than some leading AI models.

  • Tested by Artificial Analysis, Grok 4 outperformed Anthropic Claude 4 Opus, Google Gemini 2.5 Pro, OpenAI o3-pro, and DeepSeek-R1 on GPQA Diamond (scientific reasoning), LiveCodeBench (coding), and AIME 2024 (competition math). It tied with Claude 4 Opus for the top spot on MMLU-Pro, came in behind o4-mini set to high on SciCode (coding), and came in fourth on HumanEval (coding). 
  • In xAI’s tests, on ARC-AGI-2, a test of abstract reasoning, Grok 4 (15.9 percent) set a new state of the art, nearly double that of its closest competitor, Claude Opus 4 (8.6 percent). On Humanity’s Last Exam (PhD-level questions in subjects that include math, engineering, and physics), Grok 4 (25.4 percent without tools, 38.6 percent with tools) outperformed Google’s Gemini 2.5 Pro (21.6 percent without tools, 26.9 percent with tools) and OpenAI’s o3 (21 percent without tools, 24.9 percent with tools). On the same test, Grok 4 Heavy without tools achieved 44.4 percent.  
  • In speed tests by Artificial Analysis, Grok 4 (73 tokens per second) fell well behind the speediest models such as Google Gemini 2.5 Flash-Reasoning (374 tokens per second), but ahead of Claude 4 Opus Thinking (68 tokens per second) and DeepSeek-R1 0528 (24 tokens per second).

Behind the news: Grok 4’s debut was clouded by reports the previous week that Grok 3 had posted antisemitic statements and praised Adolf Hitler. xAI said a code update caused the model to rely too heavily on extremist views from users of the X platform. The company deleted the offensive posts and apologized. That mishap follows a series of similar outputs in recent months. xAI attributed some of them to rogue employees who had circumvented the company’s code-review process to modify the chatbot.

Why it matters: The xAI team has built a series of high-performing models in record time. If its performance lives up to the promise of its benchmark results, Grok 4 could set new standards. That said, the previous version has been fragile and prone to misbehavior, and xAI has shown a worrisome tendency to modify its models without following its own stated protocols.

We’re thinking: Last year, Musk said that xAI “will open source its models, including weights and everything,” and as it created each new version, it would open the prior version. Open source is a huge boon to AI, and we hope xAI will resume its open releases.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox