Large Language Models (LLMs)

144 Posts

Diagram comparing LLM answers with and without hints. Hints may influence LLM output without being mentioned in reasoning traces.

Reasoning for No Reason: Anthropic finds chain-of-thought reasoning traces may omit key influences

Does a reasoning model’s chain of thought explain how it arrived at its output? Researchers found that often it doesn’t.

BitNet b1.58 matrix multiplication shows ternary weights enabling faster neural network computation.

Large Language Models (LLMs)

Low Precision, High Performance: Researchers at Microsoft and Tsinghua researchers propose 1.58-bit AI model that rivals full-precision competitors

Reducing the number of bits used to represent each parameter in a neural network from, say, 16 bits to 8 bits shrinks the network’s size and boosts its speed. Researchers took this approach to an extreme: They built a competitive large language model whose weights are limited to three values.

Apple AI models outperform rivals in instruction accuracy and human text evaluations across devices and servers.

Large Language Models (LLMs)

Apple Sharpens Its GenAI Profile: Apple updates its on-device and cloud AI models, introduces a new developer API

Apple revamped two vision-language models in a bid to catch up with fast-moving competitors.

Diagram showing AI pipeline using OCR and LLMs to detect racist clauses in historic California property deeds.

Large Language Models (LLMs)

LLM Rights Historical Wrongs: Stanford and Princeton researchers fine-tune a language model to identify racial discrimination in property

In Northern California, old property deeds may still include racial clauses: language, made illegal decades ago, that was designed to ban people of color from owning or living in certain homes.

OpenAI o3-pro outperforms o3 and o1-pro on math, science, and coding benchmarks, but responds much more slowly.

Large Language Models (LLMs)

Better Video, Fewer Tokens: STORM Processes Fewer Tokens And Still Beats GPT-4o On Video Understanding Benchmarks

Researchers reduced the number of tokens needed to represent video frames to be fed to a transformer.

Charts from Mary Meeker’s “Trends — Artificial Intelligence” report show global AI user growth, rising AI job demand, and revenue lagging behind compute costs.

Large Language Models (LLMs)

AI Market Trends in Charts and Graphs: Venture Capitalist Mary Meeker Revives Her Trend Reports With a Deep Dive Into the AI Boom

Renowned investment analyst Mary Meeker is back with a report on the AI market, six years after publishing her last survey of the internet.

Bar graph comparing AI model accuracies for AIME 2024-2025, GPQA, LiveCodeBench, Aider, and Humanity's Last Exam.

Large Language Models (LLMs)

Next-Level DeepSeek-R1: DeepSeek-R1’s update leads all open models and brings it up to date with the latest from Google and OpenAI

DeepSeek updated its groundbreaking DeepSeek-R1 large language model to strike another blow for open-weights performance.

DeepSeek computation diagram showing transformer blocks, multi-head attention, and routing, using FP8 and BF16 precision.

Large Language Models (LLMs)

How DeepSeek Did It: Researchers describe training methods and hardware choices for DeepSeek’s V3 and R1 models

DeepSeek made headlines late last year, when it built a state-of-the-art, open-weights large language model at a cost far lower than usual. The upstart developer shared new details about its method.

AI model performance comparison chart: Claude Opus 4, Sonnet 4, Sonnet 3.7, OpenAI o3, GPT-4.1, and Gemini 2.5 Pro.

Large Language Models (LLMs)

Claude 4 Advances Code Generation: Anthropic debuts new Claude 4 Sonnet and Claude 4 Opus models, featuring top benchmarks in coding

Anthropic continued its tradition of building AI models that raise the bar in coding tasks.

Diagram of FP4 training scheme showing BF16 tensor quantization and FP4 tensor core processing for efficient computation.

Large Language Models (LLMs)

4-Bit Efficiency, 16-Bit Accuracy: Microsoft researchers show that heavily quantized versions of Llama can perform as well as near-full-precision

Using an 8-bit number format like FP8 during training saves computation compared to 16- or 32-bit formats, but it can yield less-accurate results. Researchers trained models using 4-bit numbers without sacrificing accuracy.

Chat interface discussing code error with special character filenames. Terminal shows Unix commands for troubleshooting.

Large Language Models (LLMs)

Your Robot Dev Team: OpenAI introduces Codex, a multi-agent cloud-based software engineering tool in ChatGPT

OpenAI launched an agentic software-development system.

Dual line graphs showing factual QA accuracy and NLL against memory size for NQ and TQA datasets in AI models.

Large Language Models (LLMs)

Memory Layers for More-Factual Output: Meta researchers build Llama-style models that recall details without needing more computing resources

Improving a large language model’s factual accuracy typically requires making it bigger, which in turn, involves more computation. Researchers devised an architecture that enables models to recall relevant details without significantly increasing the amount of computation required.

Comparison table of AI models ranked by LCB score and Codeforces rating with percentiles for competitive programming.

Large Language Models (LLMs)

Open, Compact Code Generator: DeepCoder-14B-Preview further fine-tunes reasoning models for coding

An open-source code generator performs comparably to the reasoning models DeepSeek-R1 and OpenAI o1 with a much smaller model.

Table comparing AI model accuracy on math and reasoning benchmarks including AIME, HMMT, OmniMath, GPQA-D, and Codeforces.

Large Language Models (LLMs)

Reasoning Models With Recipes: Microsoft unveils training details for Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning

Microsoft published its latest recipe for training reasoning models, substantially expanding what is still a fairly small base of public knowledge.

Large Language Models (LLMs)

Reasoning for No Reason: Anthropic finds chain-of-thought reasoning traces may omit key influences

Low Precision, High Performance: Researchers at Microsoft and Tsinghua researchers propose 1.58-bit AI model that rivals full-precision competitors

Apple Sharpens Its GenAI Profile: Apple updates its on-device and cloud AI models, introduces a new developer API

LLM Rights Historical Wrongs: Stanford and Princeton researchers fine-tune a language model to identify racial discrimination in property

More Reasoning for Harder Problems: OpenAI debuts o3-pro, an updated reasoning model that applies more tokens at inference

Better Video, Fewer Tokens: STORM Processes Fewer Tokens And Still Beats GPT-4o On Video Understanding Benchmarks

AI Market Trends in Charts and Graphs: Venture Capitalist Mary Meeker Revives Her Trend Reports With a Deep Dive Into the AI Boom

Next-Level DeepSeek-R1: DeepSeek-R1’s update leads all open models and brings it up to date with the latest from Google and OpenAI

How DeepSeek Did It: Researchers describe training methods and hardware choices for DeepSeek’s V3 and R1 models

Claude 4 Advances Code Generation: Anthropic debuts new Claude 4 Sonnet and Claude 4 Opus models, featuring top benchmarks in coding

4-Bit Efficiency, 16-Bit Accuracy: Microsoft researchers show that heavily quantized versions of Llama can perform as well as near-full-precision

Your Robot Dev Team: OpenAI introduces Codex, a multi-agent cloud-based software engineering tool in ChatGPT

Memory Layers for More-Factual Output: Meta researchers build Llama-style models that recall details without needing more computing resources

Open, Compact Code Generator: DeepCoder-14B-Preview further fine-tunes reasoning models for coding

Reasoning Models With Recipes: Microsoft unveils training details for Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning

Subscribe to The Batch