Machine Learning Research

438 Posts

AI diagram showing generator and scorer loop to produce final output based on test image of a cat.
Machine Learning Research

Text-Only LLM Goes Multimodal: LLMs learn to caption images, video, and audio without further training

Large language models excel at processing text but can’t interpret images, video, or audio directly without further training on those media types. Researchers devised a way to overcome this limitation.
Person interacting with a humanoid robot using virtual reality headset and controllers.
Machine Learning Research

Hugging Face Rolls Out Open Robot: Hugging Face acquires Pollen Robotics, launches Reachy 2 robot for open-source research

Hugging Face has made a name by providing open AI models. Now it’s providing an open robot.
Comparison chart of GPT-4.1, o3, and o4-mini with other models on coding, math, tool use, and multimodal reasoning benchmarks.
Machine Learning Research

OpenAI Launches Cost-Effective Alternatives: OpenAI replaces GPT-4.5 with GPT-4.1 Family, plus o3 and o4-mini, new models focused on reasoning and coding

OpenAI refreshed its roster of models and scheduled the largest, most costly one for removal.
Diagram of latent transformer model using byte-level encoding, patching, and cross-attention for next-byte prediction.
Machine Learning Research

Toward LLMs That Understand Misspellings: New byte-based model beats Llama 3 on spelling, noise, and translation

Researchers built a model that’s more robust to noisy inputs like misspellings, smarter about character-level information like the number of R's in strawberry, and potentially better able to understand unfamiliar languages that might share groups of letters with familiar languages.
Diagram of Modal Context Protocol showing MCP client-server architecture, APIs, and local and remote data sources.
Machine Learning Research

Open Standard for Tool Use and Data Access Gains Momentum: OpenAI adopts Model Context Protocol to boost LLM tool integration

OpenAI embraced Model Context Protocol, providing powerful support for an open standard that connects large language models to tools and data.
AI benchmark comparison chart showing Gemini 2.5 Pro, GPT-4.5, Claude, Grok, and others across science, math, code, and reasoning.
Machine Learning Research

Google Unveils Gemini 2.5: Google’s Gemini 2.5 Pro Experimental outperforms top AI models

Google’s new flagship model raised the state of the art in a variety of subjective and objective tests.
TabPFN neural network diagram showing synthetic training, prediction on real-world tabular data, and attention layers.
Machine Learning Research

Better Than Trees for Tabular Data: Transformers can outperform decision trees at predicting unlabeled spreadsheet cells

If you have a collection of variables that represent, say, a cancer patient and you want to classify the patient’s illness as likely cancer or not, algorithms based on decision trees, such as gradient-boosted trees, typically perform better than neural networks.
Architecture of Qwen2.5-Omni showing multimodal processing with vision and audio encoders, thinker, talker, and decoder.
Machine Learning Research

Better Multimodal Performance With Open Weights: Qwen2.5-Omni 7B raises the bar for small multimodal models

Alibaba’s latest open-weights system raises the bar for multimodal tasks in a relatively small model.
Llama 4 Behemoth benchmark chart comparing coding, reasoning, and multilingual scores with Claude, Gemini, and GPT-4.5.
Machine Learning Research

Llama’s Mixture of Vision-Language Experts: Meta releases Llama 4 models, claims edge over AI competitors

Meta updated its popular open-weights models, claiming performance superior to closed competitors in three size classes.
Diagram comparing original transformer model with a replacement model using token-level attention and neuron-level outputs.
Machine Learning Research

Ordinary LLMs Implicitly Take Reasoning Steps: Anthropic experiment finds Claude shows signs of unprompted reasoning

Even without explicit training in reasoning, large language models “think” in ways that may be more deliberate than previously understood.
3D scene comparison of human-object interaction for ZeroHSI, LINGO, and CHOIS models in a synthetic indoor environment.
Machine Learning Research

Human Action in 3D: Stanford researchers use generated video to animate 3D interactions without motion capture

AI systems designed to generate animated 3D scenes that include active human characters have been limited by a shortage of training data, such as matched 3D scenes and human motion-capture examples. Generated video clips can get the job done without motion capture.
Mochi-style illustrated characters with diverse facial expressions used for AI emotion recognition visualizations.
Machine Learning Research

Interactive Voice-to-Voice With Vision: MoshiVis adds image understanding to voice-first conversations

Researchers updated the highly responsive Moshi voice-to-voice model to discuss visual input.
Visual model aligning diffusion embeddings with DINOv2 encoders using REPA and DiT/SiT blocks.
Machine Learning Research

Faster Learning for Diffusion Models: Pretrained embeddings accelerate diffusion transformers’ learning

Diffusion transformers learn faster when they can look at embeddings generated by a pretrained model like DINOv2.
Diagram comparing diffusion, flow matching, and shortcut models for image generation with fewer steps.
Machine Learning Research

Better Images in Fewer Steps: Researchers introduce shortcut models to speed up diffusion

Diffusion models usually take many noise-removal steps to produce an image, which takes time at inference. There are ways to reduce the number of steps, but the resulting systems are less effective. Researchers devised a streamlined approach that doesn’t sacrifice output quality.
Comparison table of Gemini and Gemma models across benchmarks like MMLU, MATH, and GPQA with radar charts.
Machine Learning Research

Vision-Language, Compact and Open: Google releases Gemma 3 vision-language models with open weights

Google updated its open-weights family of large language models to include versions that handle image and video inputs.
Load More

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox