How We’ve Been Training Foundation Models
If you wanted to train a state-of-the-art foundation model over the past few years, the playbook was pretty well established. Start with pre-training: grab a massive, largely unlabeled corpus of internet-scale text, throw it into a gigantic transformer—and let it soak up the statistical patterns of language. This stage is computationally expensive but crucial—it’s what gives models like GPT-4, Llama, and Claude their broad generalization abilities.
Once you’ve got a decently capable language model, the next step is Supervised Fine-Tuning (SFT). Here, you refine the model on high-quality labeled datasets for specific tasks. The idea dates back to GPT-1 (2018), which introduced fine-tuning as a critical step for transfer learning in large-scale transformers. Other key milestones include DecaNLP (2018) and T5 (2019), which further popularized the notion of multi-task learning leading to fine-tuning. ULMFiT (2018) was another early example demonstrating how pre-trained language models could be adapted efficiently to downstream tasks.
Finally, there’s RLHF—Reinforcement Learning from Human Feedback. This phase is what makes modern chatbots feel human-like. RLHF starts with human annotators ranking multiple model outputs, and then the model is trained via reinforcement learning (typically PPO) to generate responses that align better with human preferences.
This stack—pre-training → SFT → RLHF—has been the standard. DeepSeek is now challenging that paradigm.
The DeepSeek Breakthrough: Reinforcement Learning First
DeepSeek-R1 flips the script. Instead of fine-tuning on a curated dataset before reinforcement learning, DeepSeek-R1-Zero skips supervised fine-tuning entirely. The model is trained via pure RL from the get-go, incentivizing reasoning capabilities directly. This is a big deal.
What’s Special About R1-Zero?
Without any SFT, DeepSeek-R1-Zero still manages to exhibit strong reasoning capabilities, particularly in Chain-of-Thought (CoT) style problem-solving. The model achieves impressive benchmarks purely through RL optimization, with no initial human-labeled data.
A few notable highlights:
CoT emerges naturally—not because it was explicitly trained on examples, but because RL incentivized longer reasoning chains.
Reflection as an emergent behavior—the model learns to step back, reassess its approach, and self-correct, all through RL.
GRPO (Group Relative Policy Optimization)—an RL technique that avoids the traditional critic model, instead using a ranking-based approach to improve efficiency.
Majority voting and test-time compute—DeepSeek-R1-Zero leverages two key test-time compute strategies to enhance accuracy and reasoning depth:
Extended Generation Length—The model autonomously allocates more compute per response, generating longer reasoning chains (hundreds to thousands of tokens) as it refines its thought process. This emergent behavior leads to higher accuracy on complex reasoning tasks.
Majority Voting—For benchmarks like AIME 2024, 16 candidate responses were generated per question, and accuracy was computed based on majority voting.
Benchmark results for R1-Zero show remarkable performance gains across multiple reasoning tasks. On AIME 2024, it achieved a pass@1 score of 71.0%, and when using majority voting, the accuracy jumped to 86.7%, rivaling OpenAI’s o1-0912. The model also demonstrated strong performance on math (MATH-500 at 95.9%) and coding (Codeforces percentile of 60.0%). These results underscore how powerful reinforcement learning alone can be in developing advanced reasoning capabilities without the need for initial supervised fine-tuning.
DeepSeek-R1: Adding a Cold Start
DeepSeek-R1 improves upon R1-Zero by incorporating a small amount of high-quality “cold start” data before RL. This helps stabilize training and improve readability. The pipeline looks like this:
Cold Start Data—800K carefully curated high-quality samples.
600K are generated via rejection sampling from an RL-trained checkpoint.
200K cover task-specific data (e.g., factual QA, writing, self-cognition, and translation).
The rejection sampling process refines outputs to ensure higher reasoning coherence.
Two RL stages—
Stage 1: Reasoning Optimization—The first RL stage focuses purely on enhancing reasoning capabilities, driving emergent CoT-style responses and improving problem-solving efficiency.
Stage 2: Alignment with Human Preferences—Once reasoning optimization stabilizes, a second RL stage fine-tunes outputs for readability and human alignment, reducing incoherence while preserving strong reasoning skills.
SFT after RL—Unlike traditional approaches, DeepSeek uses SFT to refine the RL-learned behaviors rather than kickstarting them.
Distillation: Making It Even More Efficient
One of the most exciting results from DeepSeek-R1 is that distillation alone, without reinforcement learning, still produces highly capable models. By using the same 800K high-quality samples curated from RL-trained outputs, DeepSeek was able to train distilled versions of both Llama and Qwen models that demonstrated remarkable reasoning performance.
Llama and Qwen models distilled—Distillation was successfully applied to both Llama and Qwen, producing models that outperformed their non-distilled counterparts in reasoning tasks. The distilled Qwen-32B and Llama-70B models demonstrated superior pass@1 scores in math, coding, and general reasoning benchmarks.
Distillation alone yields strong Chain-of-Thought (CoT) reasoning—The distilled models retained CoT capabilities without needing explicit reinforcement learning, highlighting how easily these reasoning skills transfer.
Distillation vs RL on Qwen—DeepSeek compared training RL directly on Qwen vs distilling R1’s output into Qwen. The latter consistently outperformed RL-only models, suggesting that knowledge transfer from a stronger model is more effective than RL on smaller models alone.
Prompting Simplicity
A major advantage across all DeepSeek models is prompting simplicity. Unlike traditional models that require extensive prompt engineering, DeepSeek models leverage a straightforward "Think-Answer" format, making them easier to integrate into applications with minimal adaptation.
What This Might Mean for Startups
Simple prompt—DeepSeek’s "Think-Answer" format significantly reduces the need for complex prompt engineering. Efforts in tuning prompts could instead be pushed up the stack into distilling reasoning models.
Ease of training CoT - With RL-first approaches yielding strong reasoning capabilities, smaller teams may be able to bypass expensive supervised fine-tuning, allowing them to deploy domain-specific reasoning models quickly. The bar for training high-quality reasoning models has dropped significantly.
Distillation on a small dataset—The ability to fine-tune reasoning models effectively with a relatively small dataset (800K samples) demonstrates that high performance can be achieved without massive data collection. Spending resources building a high-quality data moat seems correct.
Test-time compute optimization— The trend continues. Product teams should delineate which AI tasks require high-quality, infrequent responses (where test-time compute strategies like extended generation and majority voting can be leveraged) versus high-throughput, latency-sensitive tasks better served by smaller, fine-tuned models.
The research community is moving quickly to validate and expand on these findings, with teams like Hugging Face already diving in. Expect rapid iteration, fresh insights, and potential breakthroughs as more teams test and refine DeepSeek’s approach. Hats off to the DeepSeek team for this work!