Transformers

21 Posts

Graph shows Ernie-4.5 outperforming competitors in document understanding and visual reasoning tasks.
Transformers

Baidu’s Multimodal Bids: Giant Ernie 5 natively generates multiple media; Ernie-4.5-VL-28B-A3B-Thinking tops Vision-Language metrics

Baidu debuted two models: a lightweight, open-weights, vision-language model and a giant, proprietary, multimodal model built to take on U.S. competitors.
Chart highlights Kimi K2’s top performance in agentic tasks, outperforming rivals in reasoning and coding.
Transformers

Top Agentic Results, Open Weights: Kimi K2 Thinking outperforms proprietary models with new techniques for agentic tool use

The latest open-weights large language model from Moonshot AI challenges top proprietary LLMs at agentic tasks by executing hundreds of tool calls sequentially and pausing to think between each.
Series of graphs transformed via tokenization and transformer layers, resulting in predicted outputs.
Transformers

Forecasting Multiple Time Series: Amazon’s Chronos-2 sorts out tangled variables to make better predictions

Transformers are well suited to predicting future values of time series like energy prices, wages, or weather, but often — as in those examples — multiple time series often influence one another. Researchers built a model that can forecast multiple time series simultaneously.
A performance comparison table highlights Ling-1T's success in reasoning and coding tasks against rivals.
Transformers

Reasoning Without “Thinking”: All about Ant Group’s Ling-1T, an open, non-reasoning model that outperforms closed competitors

Reasoning models typically learn to undertake a separate process of “thinking” through their output of before they produce final response. Ant Group built a top non-reasoning model that can take similar steps as part of its immediate response.
Graphs compare DeepSeek models showing reduced cost per million tokens with V3.2-Exp over V3.1-Terminus.
Transformers

DeepSeek Cuts Inference Costs: DeepSeek-V3.2-Exp streamlines processing using a "lightning indexer," boosting efficiency

DeepSeek’s latest large language model can cut inference costs by more than half and processes long contexts dramatically faster relative to its predecessor.
Energy-Based Transformer refines predictions step by step, lowering energy for higher context compatibility.
Transformers

Transformers Energized: Energy-Based Transformers (EBTs) use gradient descent to gradually predict the next token

A new type of transformer can check its work. Instead of guessing the next output token in one shot like a typical transformer, it starts with a rough version of the token and improves it step by step.
Diagram of Qwen3-Next architecture with Mixture of Experts, Gated Attention, and Gated DeltaNet layers.
Transformers

Qwen3-Next Accelerates: Alibaba’s new model uses hybrid attention layers and a sparse MoE architecture for speed and performance

Alibaba updated its popular Qwen3 open-weights models with a number of fresh, speed-boosting tweaks.
Diagram comparing sliding window attention and ATLAS memory, showing wider context tracking in ATLAS.
Transformers

10 Million Tokens of Input Context: ATLAS, a transformer-like architecture, can process a context window as large as ten million tokens

An alternative to attention enables large language models to track relationships among words across extraordinarily wide spans of text.
STORM pipeline overview: Mamba layers link the image encoder and LLM, adding temporal info to tokens and reducing image tokens without losing key details.
Transformers

Better Video, Fewer Tokens: STORM Processes Fewer Tokens And Still Beats GPT-4o On Video Understanding Benchmarks

Researchers reduced the number of tokens needed to represent video frames to be fed to a transformer.
Dual line graphs showing factual QA accuracy and NLL against memory size for NQ and TQA datasets in AI models.
Transformers

Memory Layers for More-Factual Output: Meta researchers build Llama-style models that recall details without needing more computing resources

Improving a large language model’s factual accuracy typically requires making it bigger, which in turn, involves more computation. Researchers devised an architecture that enables models to recall relevant details without significantly increasing the amount of computation required.
Diagram of LLM-based preference approximation and multimodal sequential recommendation for personalized product suggestions.
Transformers

Inferring Customer Preferences: LLMs boost shopping recommendations by decoding what users want

Large language models can improve systems that recommend items to purchase by inferring customer preferences.
Diagram of latent transformer model using byte-level encoding, patching, and cross-attention for next-byte prediction.
Transformers

Toward LLMs That Understand Misspellings: New byte-based model beats Llama 3 on spelling, noise, and translation

Researchers built a model that’s more robust to noisy inputs like misspellings, smarter about character-level information like the number of R's in strawberry, and potentially better able to understand unfamiliar languages that might share groups of letters with familiar languages.
TabPFN neural network diagram showing synthetic training, prediction on real-world tabular data, and attention layers.
Transformers

Better Than Trees for Tabular Data: Transformers can outperform decision trees at predicting unlabeled spreadsheet cells

If you have a collection of variables that represent, say, a cancer patient and you want to classify the patient’s illness as likely cancer or not, algorithms based on decision trees, such as gradient-boosted trees, typically perform better than neural networks.
Diagram of Coconut, a method training LLMs to process thought chains as vectors, comparing it to Chain-of-Thought (CoT).
Transformers

Reasoning in Vectors, Not Text: Meta introduces Chain of Continuous Thought (Coconut) to improve next-token prediction

Although large language models can improve their performance by generating a chain of thought (CoT) — intermediate text tokens that break down the process of responding to a prompt into a series of steps.
X-CLR loss: training models to link text captions and image similarity.
Transformers

Calibrating Contrast: X-CLR, an approach to contrastive learning for better vision models

Contrastive loss functions make it possible to produce good embeddings without labeled data. A twist on this idea makes even more useful embeddings.
Load More

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox