DeepSeek open-sourced DeepSeek-R1, an LLM fine-tuned with reinforcement learning ( RL ) to improve reasoning capability. DeepSeek-R1 achieves outcomes on line with OpenAI’s o1 design on many metrics, including MATH-500 and SWE-bench.  ,
DeepSeek-R1 is based on , a mixture of experts ( MoE ) model recently open-sourced by DeepSeek. This base model is fine-tuned using ( GRPO ), a reasoning-oriented variant of RL. The research team also performed information extraction from DeepSeek-R1 to open-source Qwen and Llama designs and released several variations of each, these concepts outperform larger models, including GPT-4, on mathematics and programming benchmarks.  ,
The first step in using pure reinforcement learning ( RL ) is [ DeepSeek-R1]. Our goal is to explore the potential of LLMs to create reasoning capabilities without any controlled data, focusing on their self-evolution through a real RL process… DeepSeek-R1… excels in a wide range of tasks, including creative writing, general question answering, editing, summarization, and more. Additionally, DeepSeek-R1 demonstrates outstanding performance on tasks requiring long-context understanding, substantially outperforming DeepSeek-V3 on long-context benchmarks.
To develop the model, DeepSeek started with DeepSeek-V3 as a base. They first tried fine-tuning it only with RL, and without any supervised fine-tuning ( SFT), producing a model called DeepSeek-R1-Zero, which they have also released. This model exhibits strong reasoning performance, but “powerful reasoning behaviors, it faces several issues. For instance, DeepSeek-R1-Zero struggles with challenges like poor readability and language mixing”.
To address this, the team used a short stage , of SFT to prevent the” cold start” problem of RL. They collected several thousand examples of chain-of-thought reasoning to use in SFT of DeepSeek-V3 before running RL. After the RL process converged, they then collected more SFT data using rejection sampling, resulting in a dataset of 800k samples. The distilled models created by Llama and Qwen were derived from this dataset for further refinement and refinement.
DeepSeek evaluated their model on a variety of reasoning, math, and coding benchmarks and compared it to other models, including Claude-3.5-Sonnet, GPT-4o, and o1. DeepSeek-R1 outperformed all of them on several of the benchmarks, including AIME 2024 and MATH-500.
DeepSeek-R1 Performance. Image Source:
Within a few days of its release, the that DeepSeek-R1 was ranked# 3 overall in the arena and# 1 in coding and math. It was also tied for# 1 with o1 in” Hard Prompt with Style Control” category.
Simon Willison, a co-creator of the Django framework, shared a blog post about his experiments with one of the DeepSeek distilled Llama models:
Each response starts with a <, think>,… <, /think>, pseudo-XML tag containing the chain of thought used to help generate the response. ]Given the prompt]” a joke about a pelican and a walrus who run a tea room together”… The joke was then discarded after 20 more thought-provoking paragraphs… ]T] he joke is awful. However, the process of getting there revealed a lot about how these new models operate.
Andrew Ng’s newsletter The Batch :
DeepSeek is quickly gaining ground as a reliable builder of open models. These models are excellent performers, and their license allows the use of their outputs for distillation, potentially advancing the development of the state of the art for language models ( and multimodal models ) of all sizes.
The are available on HuggingFace.