An open-source code generator performs comparably to the reasoning models DeepSeek-R1 and OpenAI o1 with a much smaller model.
What’s new: A team at the model platform Together.AI and Agentica, an open-source project devoted to reinforcement learning (RL), released DeepCoder-14B-Preview. The release includes weights, code, dataset, training logs, and data optimizations under an MIT license that allows noncommercial and commercial uses.
How it works: The team fine-tuned DeepSeek-R1-Distilled-Qwen-14B, which distills knowledge from DeepSeek-R1 (671 billion parameters) into Qwen-14B (14 billion parameters).
- The authors curated 24,000 coding problems from TACO Verified, SYNTHETIC-1, and LiveCodeBench). They removed duplicates, problems with less than five unit tests, problems whose solutions failed to pass all associated unit tests, and those that appeared in both test and training sets.
- They fine-tuned DeepSeek-R1-Distilled-Qwen-14B using a streamlined reinforcement learning approach that enhancedGroup Relative Policy Optimization (GPRO) with training optimizations from Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO). Among other optimizations, they (i) removed the KL loss (typically used to keep the new model’s outputs from straying too far from the base model’s outputs), which eliminated the need to compute the base model’s output at each training step, and (ii) ignored the loss for outputs that exceeded the output size limit (16,000 tokens for the first training phase, 32,000 tokens for the second), which kept the model from being penalized for generating programs that didn’t work properly because they had been truncated.
- The authors updated the reinforcement learning library verl to improve the way the model parallelized sampling, computing the reward, and training. Instead of alternating between sampling new outputs, computing rewards, and training (as verl does), they sampled new outputs while training on the previous batch. (They computed the reward immediately after sampling a new output.) For coding problems, this cut total training time in half.
- To prevent the model from developing behaviors based on flaws in the reward model, the reward model dispensed rewards only when DeepCoder-14B-Preview’s output passed all 15 of a problem's most challenging unit tests (judged by input length) within 6 to 12 seconds. Otherwise, the model received no reward.
Results: DeepCoder-14B-Preview is competitive on several coding benchmarks with DeepSeek-R1 as well as proprietary models including OpenAI o3-mini and OpenAI o1, which is believed to be much larger.
- On LiveCodeBench (regularly updated coding problems), DeepCoder-14B-Preview (60.6 percent Pass@1 accuracy) was just shy of o3-mini-2025-1-31 set to low effort (60.9 percent) and slightly ahead of o1-2024-12-17 set to low effort (59.5 percent).
- On Codeforces (competitive coding problems), DeepCoder-14B-Preview (1936 CodeElo, higher is better) performed significantly better than DeepSeek-R1-Distill-Qwen-14B (1791 CodeElo). It performed comparably to o3-mini-2025-1-31 set to low effort (1918 CodeElo), o1-2024-12-17 set to low effort (1991 CodeElo), and Deepseek-R1 (1948 CodeElo).
Why it matters: Applying reinforcement learning to coding works, but it has two big issues: (i) Training examples of verifiable code are relatively scarce and (ii) computing reward signals for code is time-consuming, since it requires evaluating many test cases. DeepCoder-14B-Preview’s optimizations reduced this complexity, shrinking RL training from months to weeks. Those optimizations are built into Verl-pipeline, an open source RL library from Together.AI and Agentica, giving developers a powerful tool for model training.
We’re thinking: Kudos to the DeepCoder team for open sourcing their reasoning recipe! A handful of companies have developed the know-how to execute RL well, but many teams still have trouble implementing successfully. Open recipes for RL training methods and data curation techniques are important to move the field forward.