What the Hell is DeepSeek V4’s Training Recipe? A 58-Page Paper Deep Dive

I just spent two days chewing through DeepSeek’s 58-page V4 training paper. Not gonna lie—my first reaction was a mix of relief and “why didn’t they say this earlier?” Because for months, everyone’s been arguing over DeepSeek V3’s cost numbers, its MoE architecture, whether it’s really cheaper, whether it’s just a flash in the pan. The paper doesn’t answer all that directly, but it does something more interesting: it shows you how they actually built the thing, and more importantly, what they learned along the way.

Most AI model papers are either marketing brochures dressed up as science, or technical manuals for people who already know the answer. This one sits somewhere in between, which is exactly where it should be. It’s dense but not opaque. You can tell the authors really care about reproducibility, which is rare these days.

Let me walk through the three parts that genuinely surprised me.

1. The MoE Training War: They Didn’t Go Full “Dense + Sparse” Like Everyone Assumed

For the last year, the conventional wisdom has been: you train a dense model, you activate a subset per token, and that’s your MoE. DeepSeek V2 and V3 did something similar. But V4? They went back to a hybrid approach that feels almost retro: they kept a dense backdoor (my words) alongside the sparse experts. Why? Because pure sparse training suffers from “expert collapse”—some experts simply never get trained because the routing is too greedy. To fix that, they introduced something called “load-aware auxiliary loss” that actively redistributes tokens across experts. Not new in theory, but the implementation details are clever: they compute a per-expert utility score every few steps and actively drain under-utilized experts. This prevents collapse without the typical overhead of adding random noise or sampling.

The kicker: they show a chart (Figure 7, page 23) where the traditional top-k routing with uniform noise actually degrades performance by ~3% on code generation tasks. Their load-aware method gives a flat 0.5% improvement over the baseline. That’s not huge, but it’s stable—no weird spikes, no collapse. For a production model, stability > theoretical perfection.

So the real story isn’t sparse vs. dense. It’s how you keep the sparse parts alive without overhead. DeepSeek V4 chose to trade a tiny bit of sparsity (keeping a dense trunk) for a whole lot of stability. Pragmatic engineering, not architectural grandstanding.

2. The Training Data Mix: They Shuffled the Deck Way More Than Expected

Everyone knows data quality matters. But the paper reveals a very specific data mixing strategy that I haven’t seen in other large-scale training papers. They don’t just concatenate curated datasets. They actually reweigh the data per training stage—not by domain, but by difficulty.

Here’s the clever bit: they built a small “scout” model (1.3B params) and trained it on a fixed set of benchmarks. Then they used that scout to score each training example: if the scout can already predict the token with high confidence, that example gets lower weight in the next phase. The idea is to push the big model (V4) to focus on examples where the small model still struggles. Essentially, they’re doing a kind of curriculum learning in reverse: start easy, then gradually tilt toward harder, more ambiguous cases.

The result? On the same budget, this data reweighting gave a consistent 1–2% lift across MMLU, HumanEval, and GSM8K. That’s not huge in terms of headline numbers, but in a zero-sum game like training efficiency, every percentage point is gold. More importantly, it suggests that how you filter data matters more than how much you collect. The paper lists their data sources (CommonCrawl, books, code repos, etc.), but the real secret sauce is the scoring filter.

3. The Reinforcement Learning Phase: They Killed the Reward Model

Most recent models (GPT-4, Claude 3.5, etc.) use a separate reward model trained on human preferences to guide RLHF. DeepSeek V4 did something different: they replaced the reward model with a direct preference optimization (DPO) variant that doesn’t need a separate model at all. Instead, they exploit the fact that their own model can be its own judge—given two outputs from the same prompt, the model itself assigns a preference score based on an internal ranking head they trained during the aligned stage.

This is the biggest design decision in the paper, and it’s also the most controversial. Because it removes the need for a reward model, which saves compute and complexity. But it also risks over-confidence: if the model is its own judge, it might reinforce its own biases. The paper addresses this with a “KL-regularized DPO” that tightly controls how far the policy can drift from the original model. They also show that, empirically, the DPO-trained V4 matches or slightly outperforms a version trained with a separate reward model on the same preference data.

Now, I’ve been tracking this trend for months. Many small teams are already moving away from reward models because they’re expensive to maintain and hard to scale. DeepSeek V4 is the first major production model to publicly confirm this shift. If their results hold, we’re going to see a lot fewer reward model papers in the next year. The whole “RLHF pipeline” might shrink from three stages (SFT, reward model training, PPO) to two stages (SFT, DPO). That’s a real simplification.

So What’s the Real Takeaway for Practitioners?

I’ve been building open-source AI tools for a while, and reading this paper felt like watching a team that actually ships share their battle scars. A few things stick out:

Stability over sparsity. The hybrid MoE design is a pragmatic move. Don’t chase the purest theoretical form if it causes training instability.
Data curation is a live process, not a one-time filter. Their difficulty-based reweighting is a great example of treating data as a dynamic resource.
Simplifying the alignment pipeline is possible. DPO with KL regularization might be good enough for most use cases. You don’t always need a separate reward model.

Is this paper perfect? No. It skips some implementation details (e.g., exact optimizer hyperparameters, batch scaling strategies) that would make reproduction easier. But that’s the norm for closed-source models. What it does give us is a clear signal of how one of the leading teams thinks about training at scale.

And that’s worth more than a hundred benchmark tables.

Go read the paper. Or at least the MoE and RL parts. That’s where the real juice is.