Large Language Models (LLMs)

129 Posts

Chart showing LLM accuracy increasing with reasoning tokens across math and science benchmarks like AIME24 and GPQA.
Large Language Models (LLMs)

One Weird Trick for Better Reasoning: Researchers fine-tune LLM for reasoning with only 1,000 examples

Researchers showed that supervised fine-tuning on as few as 1,000 examples can enable a pretrained large language model to reason — and a clever gambit can boost its performance to rival that of top reasoning models.
LLM performance benchmark table comparing Qwen, OpenAI, Gemini, and others on coding, math, and language tasks.
Large Language Models (LLMs)

Qwen3 Takes On DeepSeek-R1: Alibaba releases the Qwen3 family of open LLMs with optional reasoning

Alibaba’s new model family may unseat DeepSeek-R1’s four-month reign as the top open-weights large language model.
Diagram of LLM-based preference approximation and multimodal sequential recommendation for personalized product suggestions.
Large Language Models (LLMs)

Inferring Customer Preferences: LLMs boost shopping recommendations by decoding what users want

Large language models can improve systems that recommend items to purchase by inferring customer preferences.
Comparison chart of GPT-4.1, o3, and o4-mini with other models on coding, math, tool use, and multimodal reasoning benchmarks.
Large Language Models (LLMs)

OpenAI Launches Cost-Effective Alternatives: OpenAI replaces GPT-4.5 with GPT-4.1 Family, plus o3 and o4-mini, new models focused on reasoning and coding

OpenAI refreshed its roster of models and scheduled the largest, most costly one for removal.
Diagram of latent transformer model using byte-level encoding, patching, and cross-attention for next-byte prediction.
Large Language Models (LLMs)

Toward LLMs That Understand Misspellings: New byte-based model beats Llama 3 on spelling, noise, and translation

Researchers built a model that’s more robust to noisy inputs like misspellings, smarter about character-level information like the number of R's in strawberry, and potentially better able to understand unfamiliar languages that might share groups of letters with familiar languages.
Illustration of a businessman in a blue suit sitting alone at the head of a long boardroom table with black chairs.
Large Language Models (LLMs)

The Fall and Rise of Sam Altman: Inside Sam Altman’s brief ouster from OpenAI

A behind-the-scenes account provides new details about the abrupt firing and reinstatement of OpenAI CEO Sam Altman in November 2023.
Diagram of Modal Context Protocol showing MCP client-server architecture, APIs, and local and remote data sources.
Large Language Models (LLMs)

Open Standard for Tool Use and Data Access Gains Momentum: OpenAI adopts Model Context Protocol to boost LLM tool integration

OpenAI embraced Model Context Protocol, providing powerful support for an open standard that connects large language models to tools and data.
AI benchmark comparison chart showing Gemini 2.5 Pro, GPT-4.5, Claude, Grok, and others across science, math, code, and reasoning.
Large Language Models (LLMs)

Google Unveils Gemini 2.5: Google’s Gemini 2.5 Pro Experimental outperforms top AI models

Google’s new flagship model raised the state of the art in a variety of subjective and objective tests.
Llama 4 Behemoth benchmark chart comparing coding, reasoning, and multilingual scores with Claude, Gemini, and GPT-4.5.
Large Language Models (LLMs)

Llama’s Mixture of Vision-Language Experts: Meta releases Llama 4 models, claims edge over AI competitors

Meta updated its popular open-weights models, claiming performance superior to closed competitors in three size classes.
AI tutoring system interface showing real-time context integration, privacy, and expert-like feedback generation.
Large Language Models (LLMs)

LLM Support for Tutors: GPT-4 boosts remote tutors’ performance in real time, study finds

Students benefit from tutoring, but training tutors is expensive. A study shows that large language models can boost tutors’ effectiveness in real time.
Comparison table of Gemini and Gemma models across benchmarks like MMLU, MATH, and GPQA with radar charts.
Large Language Models (LLMs)

Vision-Language, Compact and Open: Google releases Gemma 3 vision-language models with open weights

Google updated its open-weights family of large language models to include versions that handle image and video inputs.
GIF of AI-assisted art: A landscape is edited, a cyborg sketch turns photorealistic, and a cat reads a newspaper, showing human input for copyright
Large Language Models (LLMs)

Some AI-Generated Works Are Copyrightable: U.S. Copyright Office says that no new laws are needed for AI-generated works

The United States Copyright Office determined that existing laws are sufficient to decide whether a given AI-generated work is protected by copyright, making additional legislation unnecessary.
AI model performance benchmark comparing R1 1776 and DeepSeek-R1 across MMLU, DROP, MATH-500, and AIME 2024 tests.
Large Language Models (LLMs)

DeepSeek-R1 Uncensored: Perplexity launches uncensored version of DeepSeek-R1

Large language models built by developers in China may, in some applications, be less useful outside that country because they avoid topics its government deems politically sensitive. A developer fine-tuned DeepSeek-R1 to widen its scope without degrading its overall performance.
QwQ-32B vs DeepSeek-R1 AI model performance benchmark across AIME24, LiveCodeBench, LiveBench, IFEval, and BFCL datasets.
Large Language Models (LLMs)

Compact Reasoning: QwQ-32B challenges DeepSeek-R1 and other larger reasoning models

Most models that have learned to reason via reinforcement learning were huge models. A much smaller model now competes with them.
Table comparing Claude 3.7, 3.5, o1, o3-mini, DeepSeek R1, and Grok 3 Beta on reasoning, coding, tools, visuals, and math.
Large Language Models (LLMs)

Budget for Reasoning to the Token: Claude 3.7 Sonnet adds extended thinking mode

Anthropic’s Claude 3.7 Sonnet implements a hybrid reasoning approach that lets users decide how much thinking they want the model to do before it renders a response.
Load More

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox