Reasoning models in the 1 to 2 billion-parameter range typically require more than 1 gigabyte of RAM to run. Liquid AI released one that runs in less than 900 megabytes, and does it with exceptional speed and efficiency.
What’s new: Liquid AI’s LFM2.5-1.2B-Thinking is designed to run on small devices. It complements base, instruction-tuned, Japanese, vision-language, and audio-language LFM2.5 variants, which debuted in January.
- Input/output: Text in (up to 32,768 tokens), text out.
- Architecture: Hybrid transformer-convolutional neural network, 1.17 billion parameters
- Performance: Matched or exceeded Qwen3-1.7B on most reasoning benchmarks while running twice as fast, requiring less memory, and generating fewer output tokens
- Features: Reasoning, tool use, eight languages (English, Arabic, Chinese, French, German, Japanese, Korean, Spanish)
- Availability: Free web user interface, weights available for download and licensed for noncommercial and commercial uses to organizations up to $10 million annual revenue
- Undisclosed: Training data
How it works: The architecture mixes attention layers with convolutional layers which, given a new token, process only an adjacent group of tokens — rather than the entire input sequence, as attention does — and thus use less computation and memory. Small models can develop issues such as forgetting as they’re trained on successive domains. To overcome such problems, the team trained LFM2.5-12B-Thinking in phases.
- The team pretrained the model on 28 trillion tokens, up from 10 trillion for earlier variants.
- They introduced step-by-step reasoning data during mid-training, a phase after pretraining that typically uses mid-size datasets to sharpen distinct skills prior to fine-tuning.
- They continued with supervised fine-tuning on synthetic reasoning data.
- During the reinforcement-learning (RL) phase, the team produced 25 versions of the model specialized for different domains such as reasoning, mathematics, and tool use and merged them into a single model. (The authors don’t describe the model-merging method.) For example, after RL training in tool use, they merged the tool-use version with a math version to restore any degraded math capacity.
Results: On Artificial Analysis’ Intelligence Index, a weighted average of 10 benchmarks, LFM2.5-1.2B-Thinking matched models of similar size and larger size including Qwen3-1.7B in thinking mode.
- In tests performed by Liquid AI, LFM2.5-1.2B-Thinking outperformed or matched Qwen3-1.7B in thinking mode on GPQA Diamond, IFEval, IFBench, Multi-IF, GSM8K, MATH-500, BFCLv3. It underperformed that model on MMLU-Pro and AIME 2025.
- On all benchmarks mentioned above, LFM2.5-1.2B-Thinking outperformed Google Gemma 3 1B IT, IBM Granite-4.0-1B, IBM Granite-4.0-H-1B (a hybrid transformer/mamba architecture), and Meta Llama 3.2 1B Instruct.
- On Liquid AI’s tests of inference speed, LFM2.5-1.2B-Thinking leading. Running on CPUs (Samsung Galaxy S25 Ultra and AMD Ryzen AI Max+ 395), it generated output tokens roughly twice as fast as Qwen3-1.7B (without thinking mode) while using around 45 percent less memory.
Yes, but: Small models struggle with hallucinations, and LFM2.5-1.2B-Thinking underperforms competing models in this regard.
- Artificial Analytics’ AA-Omniscience test penalizes hallucinations to evaluate models on a scale between 100 and -100, higher is better. LFM2.5-1.2B-Thinking (-83) came in behind Qwen3-1.7B in thinking mode (-78) and LFM2.5-1.2B-Instruct (-75). In contrast, Qwen3-8B in thinking mode achieved -66 and DeepSeek v3.2 in thinking mode achieved -23.
- Consequently, Liquid AI recommends using the model for “agentic tasks, data extraction, and RAG” and against using it for “knowledge-intensive tasks and programming.”
Why it matters: LFM2.5-1.2B-Thinking is well suited to drive on-device agents that orchestrate tool calls, extract data, or query local databases. Such agents need the ability to follow instructions more than encyclopedic knowledge, since they’re likely to fetch external information. They also benefit from speed to handle lengthy chains of requests and a small memory footprint that leaves room for other applications.
We’re thinking: While many developers try to pack the most intelligence into their models, LFM2.5-1.2B strikes a balance between intelligence, inference speed, and memory requirements.