Chinese artificial intelligence company Deepseek released Deepseek-V3, a new general-purpose large language model (LLM), on Dec. 24, 2024 and Deepseek-R1, an AI model for completing complex logical tasks, along with open-source weights and training methods on Jan. 20.
The models, which compute with similar accuracy to OpenAI’s models at a fraction of the training cost, have sent waves through the LLM community. But what makes them so much more efficient?
Kangwook Lee, an assistant professor in the University of Wisconsin-Madison’s Electrical and Computer Engineering Department, described Deepseek-R1’s performance as similar to that of OpenAI’s o1 model, OpenAI’s newest LLM with more advanced reasoning ability than its previous ChatGPT-4o.
Like ChatGPT, Deepseek-V3 and Deepseek-R1 are very large models, with 671 billion total parameters. Thirty-seven billion parameters are activated per token, the smallest fundamental unit of data processed by an AI model. The concepts of parameters and tokens are essential in transformer architectures, deep learning networks that most AI models are based upon.
“[Deepseek’s] model is pretty big, but they only activate a small portion of it [at test time],” Lee said. “So, the effective number of parameters being used versus the number of parameters they have is very different.”
Deepseek’s model increases efficiency by leaps and bounds
Some artificial intelligence experts believe Deepseek distilled from OpenAI — in other words, transferred knowledge from the older models into the newer models. Although Deepseek-R1 and OpenAI’s o1 model are both based on transformer architectures and use training methods like supervised fine-tuning and reinforcement learning, many innovations powering the two models are different.
Deepseek-V3 and Deepseek-R1 take a sparse mixture-of-experts (MoE) transformer approach instead of a non-MoE dense approach. Instead of using all the model’s parameters to process each token, as in the more common dense approach, Deepseek’s models process different tokens with different, specified parts of its parameters, referred to as “experts.”
Lee likened the transformer to a circuit — the dense approach would use every component of the circuit when generating a token, whereas the sparse MoE approach would use only a small fraction of the circuit.
“During the generation time, basically, you have a single circuit… The same circuit is used to generate a single word, or token, and you keep doing it again and again,” Lee said. “Mixture-of-experts has some tag for some parts of the model or circuit, and it only enables a very small fraction of it every time you use it. There's a small model that decides which part you want to use, so there's routing inside the model: given the input, which subpart I need to use.”
Deepseek improved upon the previous MoE model by adding a weight, or bias, to experts selected for use less frequently to ensure their use in future steps, increasing the system’s efficiency.
“[MoE models] tend to see a collapse where they rely on a single expert,” Lee said. “They keep using the same sub-part again and again without using the rest of the model. [Deepseek] wanted to make use of all of the experts they have to encourage diversification and higher utilization.”
Cross-node MoE training, common with very large models like Deepseek, refers to when different “experts” are housed in different Graphics Processing Units (GPUs). Although only one expert is used to process a single token, each expert must be accessible for management purposes, according to Lee. The experts must communicate with each other across GPUs to produce a coherent output, slowing down processing time. With U.S.-imposed restrictions on the trade of H100 GPUs, the fastest technology, to India and China, many shareholders assumed that non-Western companies lacked the processing power to train LLMs competitively with Western LLMs. Deepseek’s algorithm minimized communication between GPUs, enabling them to use inferior hardware, and less than half of the processing power.
Deepseek’s innovations jumpstart next ‘phase’ of AI
Another way that Deepseek maximized performance with limited resources was by using Multi-head Latent Attention (MLA), a strategy that compresses large vectors of data into smaller, more manageable dimensions to save memory. An attention mechanism in AI is a way of assigning different weights, or values, to specific parts of input data so that the model can focus on more important information. Essentially, the multi-head attention strategy allows the model to focus its attention on different parts of the input at once.
“[Deepseek] wanted to make it much faster,” Lee said. “These vectors are pretty big, and there are tons of them because you have a multi-head. Instead of 1000-dimensional vectors, [Deepseek] wanted to make them 50-dimensional. This kind of projection to the lower dimension without losing much information, and then going up to the original dimension after doing some processing, is not a very new technique.”
Most AI models are only taught to predict the next token, or word, given a string of data. That word is added to the previous input and used to predict the following token, and so on. But Deepseek-R1 used Multi-Token Prediction training, which trains an AI model to predict multiple tokens at once, without feeding the first predicted token back into the input to generate the second.
“[Most models] only learn how to predict a single next word, and we never train the model to predict the next, next token,” Lee said. “But you can also train a model to predict not just the next token, but two next tokens, three next tokens or four next tokens. This idea has been floating around for about a year from the very small scale research side, showing that there are some benefits.”
Deepseek primarily utilized a Floating-Point 8 (FP8) mixed precision training framework, as opposed to the more common FP16 framework. Essentially, FP8 mixed precision training allowed Deepseek to train with a smaller range of data in cases where the results would not affect the final accuracy, saving money on data processing.
“Typically we use 16 bits, or 32 bits, to represent a number. But [32 bits] is just too expensive. So, to squeeze more hardware out of it, people use 16 bit. That’s what the standard is. Mixed precision means sometimes you use eight bits, and sometimes you use 16 bits. So they use these amounts of bits, assigned to different components,” Lee said.
There were differences between Deepseek and leading models both in pre-training and post-training, two separate stages in the AI training process. In pre-training, large amounts of data, like code, message board text, books and articles, are fed into the AI’s transformer model and it learns to generate similar data.
In post-training, the AI learns to generate specific answers to user queries. Deepseek-R1 used a post-training technique called the long Chain-of-Thought method, in which queries are answered in multiple steps, or chains, of logic that build into a final solution. Deepseek-R1 was the first published large model to use this method and perform well on benchmark tests. It used two types of supervised fine-tuning after the reinforcement learning step to enhance the model. This is atypical, because most models use supervised fine-tuning before the reinforcement learning step.
Lee was most impressed by the differences in pre-training, like using FP8 mixed-precision training, an MoE model, and MLA.
“All of the other players out there are using an almost identical solution in terms of architecture, training algorithms, everything,” Lee said. “They’re racing to see who's going to scale better, and they've been mostly focusing on how to make better data. For the pre-training part, most of them were doing the same thing. Each change [Deepseek] introduced is something that existed before, but they made use of these good ideas that were developed in the past but somehow faded away, and they found a really good combination to solve their practical challenge.”
In the past few months, among other research, Lee’s lab has been trying to recreate OpenAI’s o1 model on a small-scale computing system. But OpenAI never released open-source software for its models, complicating Lee’s research. Deepseek’s open-source code provided insights into the methods used to produce both working AI models.
“Back then they didn't say too much about the training recipes,” Lee said. “Reinforcement learning is one of the keywords they shared, but they didn't talk about the details, and there were four or five different speculations floating around. I picked one of those few speculations, and my lab students and I worked on reproducing what [OpenAI’s] o1 model did. It turns out that OpenAI used a different idea — it came out just before we submitted the paper. But now we want to see whether those two different methods can have a synergistic effect.”
Reinforcement learning is a tool common in post-training for all AI models, with which the model is trained to predict a certain output, given an input of data that it has been trained on. Lee described reinforcement learning as playing a board game with the AI model.
“The current state of the board is the input state,” Lee said. “Given the state, I take an action. Then it updates the state because the opponent will also play the game. Now that I see the new state, I take another action… This is repeated again and again in the game.”
When the game ends, the winner’s actions are seen as good actions. Good data sets are valenced positively and bad data sets are valenced negatively within the model. The model is incentivized to repeat positive actions and decrease negative actions in a way similar to the idea of positive and negative reinforcement within the field of psychology. Every new “game” generates a new data set.
Lee split the steps of new inventions into phases: in the first, high-risk ideas are explored and one is selected, and in the second, these ideas are improved upon. Deepseek’s R1 model seemed to signal a move to the second phase, earlier than many researchers anticipated, according to Lee.
“I would say this is more like a natural transition between phase one and phase two,” Lee said. “Phase one is more developing new ideas, exploring crazy ideas, finding the pathway. But then, who's going to drive faster?"
Deepseek’s arrival brought other new AI innovations on its tail, especially because providing open-source model weights invited all developers to suggest improvements. New models include Chinese manufacturer Alibaba’s Qwen model, which it claims surpasses Deepseek-R1. OpenAI’s ChatGPT also updated with a new Reason feature that closely resembles the Chain-of-Thought structure in Deepseek-R1. With Deepseek-V3 and Deepseek-R1, Deepseek upset the balance of current AI powerhouse companies and set a new precedent for AI training efficiency.