There is more. To use its learning to strengthen as effective as possible, Deepseek has also developed a new algorithm called optimization of the relative group policy (GRPO). He first used GRPO a year ago to build a model called Deepseekmath.
We will jump the details– You just need to know that the learning of strengthening involves calculating a score to determine whether a potential movement is good or bad. Many existing learning strengthening techniques require an entire separate model to carry out this calculation. In the case of models of large languages, this means a second model which could be as expensive to build and execute as the first. Instead of using a second model to predict a score, GRPO simply makes an enlightened supposition. It’s cheap, but always precise enough to work.
A common approach
The use by Deepseek of Reinforcement Learning is the main innovation that the company describes in its article R1. But Deepseek is not the only farm experimenting with this technique. Two weeks before the R1 fall, a team from Microsoft Asia announced a model called RSTAR-Math, which was formed in a similar way. “He also has huge jumps of performance,” explains Matt Zeiler, founder and CEO of the Ai Clarifai firm.
The AI2 tulu was also built using effective strengthening learning techniques (but in addition, not instead, steps led by humans such as supervised fine adjustment and RLHF). And the face of the American society rushing to rush to reproduce R1 with OpenR1, a clone of the Deepseek model according to which the hopes of the face embraced will expose even more of the ingredients of the special R1 sauce.
In addition, it is a secret of Polichinelle that the best companies like Openai, Google Deepmind and Anthropic can already use their own versions of the Deepseek approach to form their new generation of models. “I’m sure they are doing almost exactly the same thing, but they will have their own flavor,” says Zeiler.
But Deepseek has more than one turn in his round. He formed his basic model V3 to do something called Multi-Token prediction, where the model learns to predict a chain of words at a time at a place at a time. This training is cheaper and is also increasing precision. “If you think about how you talk, when you are half a sentence, you know what the rest of the sentence will be,” says Zeiler. “These models should also be able to.”
He also found cheaper ways to create large data sets. To train the model of last year, Deepseekmath, it took a free data set entitled Common Crawl – a large number of documents scratched from the Internet – and used an automated process to extract only documents that included mathematical problems . It was much cheaper than building a new data data set by hand. It was also more effective: the common ramp includes many more mathematics than any other set of specialized mathematics available.
And on the material side, Deepseek has found new ways of old tokens juice, which allows him to form high level models without coughing for the last equipment on the market. Half of their innovation comes from direct engineering, known as Zeiler: “They definitely have very good GPU engineers in this team.”