Preface#
In the past six months, the author has experienced a decline in the desire to update due to a breakup. Later, I followed my research group to study image recognition and algorithm analysis, autonomous driving path planning, and intelligent vehicle control algorithms, which led to a halt in updates. Thanks to the nurturing from the Dean of Computer Science and discussions on recent popular directions, I was asked to do a technical sharing related to large models, which reminded me to update my blog.
In this blog post, I will focus on three topics:
- From cloud computing power to edge computing power.
- Data privacy in the AI era.
- Computing power demands in the AI era.
The content includes the latest QwQ, large language model training methods, EXO-based network computing power sharing, and global computing power statistics.
From Cloud Computing Power to Edge Computing Power#
Let's introduce this topic using Apple's product, Apple Intelligence.
In June 2024, at the Worldwide Developers Conference, Apple announced:
The personal intelligent system Apple Intelligence introduces powerful generative models for iPhone, iPad, and Mac.
As the most typical application scenario for natural language large models, after the explosive popularity of ChatGPT, Huawei, Xiaomi, and many other companies began to train and release their own voice assistants based on large language models, while Apple announced the integration of GPT. However, due to policy reasons in mainland China, Apple's AI features have not been able to function normally in the region to this day.
Until February of this year:
Apple will integrate Alibaba's large language model and other AI services for iPhone and other devices in China. Apple announced the integration of Alibaba's AI large model for the Chinese version of Apple products.
CNBC
At that time, everyone must have been puzzled. It was precisely when Deepseek was gaining popularity. Why did Apple choose to integrate Alibaba's Qwen series instead of the open-source large model Deepseek? When Apple made this decision, its stock even plummeted. NVIDIA's stock had also dropped significantly due to Deepseek; why would Apple make this choice?
Before answering this question, let's focus on Deepseek. It has powerful performance, its advantage is being open-source and lightweight, and compared to other closed-source models, just 671B parameters are enough to compete with the world's most advanced models. But how much GPU memory is needed to deploy it?
A whopping 700G! If we want to deploy a Deepseek R1 617B version with normal output speed, we might need about 700G of GPU memory, roughly 10 A100 graphics cards, costing around 2 million.
Although it has significantly reduced the computing power required compared to the initial GPT3.5 and GPT4.0, at that time GPT could only be used in the cloud, relying on the massive computing power cluster owned by OpenAI, which was unattainable for most companies. Fortunately, OpenAI's products drove a revolution in this area, allowing Deepseek to be born, and thanks to Deepseek's open-source nature, more new models could stand on the shoulders of giants, leading to the birth of QwQ!
On March 6, 2025, Alibaba's Qianwen team released the QwQ open-source large model, with only 32B parameters!
The powerful QwQ, with its 32B parameters (considered small for large models), is enough to rival the full version of Deepseek R1 671B, even surpassing Deepseek! QwQ responded to Apple's choice of Alibaba, as it only requires a single 4090 graphics card to run for personal users, reducing costs from 2 million to 20,000. But is this the reality? Recent media have been promoting it, but the truth is...?
The Alibaba team described QwQ as follows:
This is a model with 32 billion parameters, whose performance can rival that of DeepSeek-R1 with 671 billion parameters (of which 37 billion are activated).
"QwQ-32B: Experience the Power of Reinforcement Learning"
Indeed, QwQ only compared its performance with the activated 370B Deepseek... right?
Architecture#
One thing I did not mention at the technical seminar is that QwQ only compared its performance with the activated 370B Deepseek, but this is a significant part I want to discuss—the architecture issue of Deepseek and the QwQ model based on its approach. (Since the sharing session invited two non-professional teachers and the audience asked us questions, I had to keep it simple and understandable, so many details were omitted during the sharing session.)
Thanks to the Dense architecture, consumer-grade computing devices can run QwQ, while Deepseek uses the MoE architecture, or Hybrid MoE mixed expert architecture. Based on the results, it seems that the Dense architecture performs better? In reality, that's not the case; we need to start with the architectural details.
What are the differences between these two architectures?
The Mixture of Experts (MoE) model is an architecture that divides the model into multiple expert sub-networks and dynamically selects the appropriate expert for computation based on the characteristics of the input data. Each "expert" has strong processing capabilities in a specific domain, and MoE intelligently selects the appropriate expert for computation based on task requirements. This mechanism significantly enhances the model's expressiveness and flexibility while ensuring a smaller computational overhead. Especially when facing large-scale datasets, the MoE model can avoid redundant computations by precisely selecting different experts to handle specific tasks, thereby effectively reducing resource consumption.
To put it simply, the MoE architecture is designed to specialize AI functions, delegating specific types of tasks to designated "expert" modules, which is also the reason why Deepseek only activates about 370B during inference.
In contrast to the MoE model, the Dense model is a traditional deep neural network architecture. The design philosophy of the Dense model is very straightforward—every neuron (or computational unit) participates in every computation. Regardless of the difficulty of the task, every parameter in the Dense model is involved in each computation. This allows the Dense model to perform relatively stably when handling simpler tasks, but it struggles with complex problems.
Because the Dense model does not intelligently select suitable computational units like MoE, it requires all parameters to be computed and updated during each training session, leading to enormous computational and storage demands. Therefore, the computational cost of the Dense model is relatively high, especially when dealing with large-scale datasets or complex tasks, where efficiency is significantly reduced.
In simple terms, the Dense architecture is like all parameters rushing in, exerting great force but lacking precision.
Earlier, I mentioned that the Dense architecture seems superior in results, but from an architectural perspective, the MoE architecture should be better. So why does QwQ, using the Dense architecture, surpass the MoE architecture of Deepseek?
Starting from performance demands, the MoE architecture benefits from distributing tasks among various expert models, giving it powerful computational reasoning capabilities. However, it has extremely high hardware requirements, needing very strong parallel computing capabilities. In contrast, the Dense architecture is quite the opposite; even in a CPU + memory running mode, it does not significantly reduce its operational efficiency.
MoE is like a pampered super student, while Dense is like a super easy-to-nurture child.
In summary, small parameter models using Dense architecture can improve quality, while large parameter models using MoE can enhance efficiency.
Apart from the small parameter model compensating for QwQ's shortcomings in the Dense architecture, the more important factor is its training method.
Technical Details#
The title of the QwQ release page reads: Experience the Power of Reinforcement Learning.
The Qwen team explains: Recent research shows that reinforcement learning can significantly enhance the model's reasoning capabilities. For example, DeepSeek R1 achieves state-of-the-art performance by integrating cold start data and multi-stage training, enabling deep thinking and complex reasoning.
Since the technical report on QwQ is still scarce, it was trained based on Deepseek's model, so we can interpret why QwQ has such powerful performance by combining some of Deepseek's training methods.
The training process of QwQ-32B is divided into three stages: pre-training, supervised fine-tuning, and reinforcement learning, with reinforcement learning further divided into two key stages, allowing QwQ to achieve performance surpassing Deepseek.
To understand the reasons, let's gradually explain what "cold start" is. What is reinforcement learning? What are the two key stages?
"Cold Start"
To help everyone better understand cold start, let me give a daily life example.
Big Data Recommendation Algorithm
Recommendation algorithms suggest videos, products, etc., based on each user's preferences.
However, newly registered accounts lack previously accumulated data, making it impossible for the system to make accurate recommendations.
This is "cold start."
When large models are in a cold start, they behave like a "child who knows nothing," making constant mistakes and generating a bunch of illogical answers, and may even fall into meaningless loops.
Using "cold start data," in the early stages of AI training, a small batch of high-quality reasoning data is used to fine-tune the model, akin to providing AI with a "beginner's guide."
Referring to Deepseek's optimization of cold start steps:
- Generating data from large models – Researchers use few-sample prompts.
- Generating data from Deepseek R1 Zero – Since R1-Zero has certain reasoning capabilities, researchers select readable reasoning results from it and reorganize them as cold start data.
- Manual screening and optimization – Some data is manually reviewed and optimized for clarity and intuitiveness in the reasoning process.
Ultimately, DeepSeek-R1 used thousands of cold start data for initial fine-tuning before proceeding to reinforcement learning training.
Reinforcement Learning#
Other machine learning methods mainly include supervised learning and unsupervised learning, while reinforcement learning is a third paradigm of machine learning, aside from supervised and unsupervised learning.
The characteristics of reinforcement learning can be summarized in four points:
- There is no supervisor, only a reward signal.
- Feedback is delayed rather than immediate.
- It has a time series nature.
- The agent's behavior affects subsequent data.
Four Basic Elements
A reinforcement learning system generally includes four elements: policy, reward, value, and environment (or model). Next, we will introduce these four elements individually.
Policy
The policy defines the behavior the agent takes for a given state; in other words, it is a mapping from states to actions. In fact, the state includes both the environment state and the agent state. We summarize the characteristics of the policy as follows:
- The policy defines the agent's behavior.
- It is a mapping from states to actions.
- The policy itself can be a concrete mapping or a random distribution.
Reward
The reward signal defines the goal of the reinforcement learning problem. At each time step, the scalar value sent by the environment to the reinforcement learning agent is the reward. We summarize the characteristics of the reward as follows:
- The reward is a scalar feedback signal.
- It represents how well the agent performed at a certain step.
- The agent's task is to maximize the total reward accumulated over a period.
Value
Value, or value function, is a very important concept in reinforcement learning. Unlike the immediacy of rewards, the value function measures long-term returns. We often say, "One must be grounded while also gazing at the stars." Evaluating the value function is akin to "gazing at the stars," judging the returns of current actions from a long-term perspective, rather than just focusing on immediate rewards. We summarize the characteristics of the value function as follows:
- The value function predicts future rewards.
- It can assess the goodness of a state.
- The calculation of the value function requires analyzing transitions between states.
Environment (Model)
The external environment, or model, simulates the environment. Once the state and action are given, the model allows us to predict the next state and corresponding reward. We summarize the characteristics of the model as follows:
- The model can predict the environment's next performance.
- The performance can be reflected by the predicted state and reward.
Reinforcement Learning Architecture
Two Key Stages#
First Stage of Reinforcement Learning
Focuses on enhancing mathematical and programming abilities. Starting from the cold start checkpoint, the reward-driven reinforcement learning expansion method based on results is employed.
Mathematical problem training uses a specialized accuracy validator rather than a traditional reward model.
Programming task training evaluates whether the code passes predefined test cases using a code execution server.
Second Stage of Reinforcement Learning
Emphasizes enhancing general capabilities.
The model introduces a general reward model and rule validator for training.
Even with a small number of training steps, it significantly improves instruction following, human preference alignment, and agent performance, achieving enhancements in general capabilities without significantly reducing the mathematical and programming abilities obtained in the first stage.
Relying on these techniques allows the QwQ model to achieve remarkable efficiency, rivaling MoE large parameter models, and brings computing power from the cloud to the edge (local). But why focus on bringing the cloud computing model back to the edge, pursuing efficient models with low computing power demands, and developing local computing modes?
Data Privacy in the AI Era#
Each of us leaves a large amount of data on the internet every day.
Most of this data involves privacy issues.
In the AI era, how to protect personal privacy data has become a significant challenge.
Apple has been focusing on user data privacy for a long time.
They emphasize protecting user data privacy, and even experts from Huawei have written thousands of words analyzing Apple's data privacy protection strategy.
Individual users cannot possess large computing power clusters locally; currently, the only way for individual users to utilize large model functions is to connect to the cloud, which poses a significant challenge to data privacy.
For example, there have been regions that banned Tesla cars due to concerns about road data being processed overseas. This raises a series of sensitive issues, and we are very concerned that data used in AI may be transmitted for illegal purposes, harming national interests. Therefore, protecting data security is crucial.
Running large models on the edge can reduce costs and enhance privacy data protection, thus requiring models to be more efficient.
If a model is efficient enough, it can accomplish more and achieve higher precision on the edge under the same computing power.
One of the future directions for AI development should be higher efficiency and lower edge computing power demands, like QwQ.
Computing Power Demands in the AI Era#
Thanks to the development of AI technology, we can now run AI models locally and perform computations using CPU + memory. The open-source project EXO shares computing power by distributing multiple layers of AI models across different devices via the network, further reducing AI operating costs.
However, this only allows for operation; it cannot achieve efficient operation and AI training. To accomplish these, we still require extremely high computing power.
As previously mentioned, all large models and their training methods cannot escape the support of "computing power." Whether it is the powerful Deepseek or the efficient QwQ, they cannot escape the reliance on computing power resources, whether it is the 700G memory requirement or consumer-grade 4090 graphics cards or even using CPU calculations.
Global Electricity Demand Data Comparison#
Year | Global Electricity Demand | Major Changes and Trends |
---|---|---|
2019 | About 25,000 TWh | China's electricity demand growth is expected to be 5%, with total electricity consumption around 7.28 trillion - 7.41 trillion kWh. |
2024 | About 30,000 TWh | Mainly driven by data centers and AI training demands. Google's and Microsoft's data centers' electricity consumption has reached 24 TWh, doubling since 2019. |
Global AI Computing Power Demand Data Comparison#
Year | Global AI Computing Power | Major Changes and Trends |
---|---|---|
2019 | About 10¹⁸ FLOP/s | AI computing power is mainly provided by traditional GPUs and TPUs, with relatively stable growth, not yet entering an explosive period. |
2024 | About 10²¹ FLOP/s | AI computing power has significantly increased, with global machine learning hardware performance growing at a rate of 43% per year, and top hardware efficiency doubling every 1.9 years. Google has over 1 million equivalent H100 computing power, and global NVIDIA-supported computing capabilities double on average every 10 months. |
Expected Funding Investments from Major Companies in 2025#
Company | AI Computing Power (Equivalent H100) | Funding Expenditure | Major Changes and Trends |
---|---|---|---|
Over 1 million | $75 billion (approx. 540 billion yuan) | For AI infrastructure. | |
Microsoft | 750,000 - 900,000 | $80 billion (approx. 576 billion yuan) | For AI data center construction, mainly for training AI models and deploying AI applications. |
Alibaba | 230,000 (NVIDIA GPU) | 150 billion yuan | Mainly for AI and cloud computing infrastructure construction. |
Tencent | 230,000 (NVIDIA GPU) | 82 billion yuan | For intelligent computing centers, mainly for graphics card servers and network construction. |
It is evident that with the explosive development of AI, the global demand for computing power has reached unprecedented heights, and electricity demand has also increased significantly. Major manufacturers are racing to occupy the computing power high ground, highlighting the importance of computing power in the AI era.
At the same time, there has even been a notion that "computing power = national power"!
This article is synchronized and updated to xLog by Mix Space. The original link is https://fmcf.cc/posts/technology/AIComputingPower