Unveiling DeepSeek: A More Ultimate Chinese Technological Idealism Story

Author: Yu Lili; Source: An Yong Waves

Among the seven major model startups in China, DeepSeek is the most low-key, but it always manages to be remembered in unexpected ways.

One year ago, this unexpected source came from the quantitative private placement giant behind it, which was the only company outside of the big factories to reserve 10,000 A100 chips. A year later, it was from it that the source of the price war for large models in China was triggered.

In the AI-bombarded month of May, DeepSeek rose to fame. It was due to their release of an open-source model called DeepSeek V2, which provided an unprecedented cost-effectiveness: the inference cost was reduced to only 1 yuan per million tokens, approximately one-seventh of Llama3 70B and one-seventieth of GPT-4 Turbo.

While DeepSeek was quickly dubbed the “Pinduoduo of the AI industry”, ByteDance, Tencent, Baidu, Alibaba, and other big players couldn’t resist and started lowering their prices. Thus, the price war for large-scale models in China is about to begin.

**The smoke actually obscures the fact that unlike many large factories that burn money to subsidize, DeepSeek is profitable. **

Behind this, DeepSeek has carried out a comprehensive innovation on the model architecture. It proposes a brand-new MLA (a new multi-head latent attention mechanism) architecture, reducing the memory usage to 5%-13% of the most commonly used MHA architecture in the past. At the same time, its original DeepSeekMoESparse structure also minimizes the computational load to the extreme, all of which ultimately leads to cost reduction.

In Silicon Valley, DeepSeek is known as the “mysterious power from the East.” SemiAnalysis’ chief analyst believes that the DeepSeek V2 paper “may be the best paper of the year.” Former OpenAI employee Andrew Carr believes the paper is “full of astonishing wisdom” and has applied its training settings to his own model. Jack Clark, former policy director of OpenAI and co-founder of Anthropic, believes that DeepSeek “employs a group of highly sophisticated geniuses” and also believes that large models made in China “will become an undeniable force, just like drones and electric cars.”

In the AI ​​wave, which is mainly driven by Silicon Valley, this is a rare situation. Many industry insiders told us that this strong response stems from architectural innovation, which is a rare attempt by domestic large model companies and even global open source large model platforms. An AI researcher said that the Attention architecture has been proposed for many years, and has almost never been successfully modified, let alone large-scale verification. “This is even a thought that will be cut off when making decisions, because most people lack confidence.”

On the other hand, domestic large-scale models have rarely ventured into architectural innovation in the past, mainly because few people have actively tried to break the stereotype that the United States excels in 0-1 technological innovation, while China excels in 1-10 application innovation. Moreover, this behavior is very uneconomical - a new generation of models will naturally be developed in a few months, and Chinese companies only need to follow and focus on application. Innovating the structure of the model means there is no established path, and it requires experiencing many failures, consuming huge time and economic costs.

DeepSeek is clearly a maverick. Amidst the clamor that large model technology is inevitably converging and following is a smarter shortcut, DeepSeek values the value accumulated in the “detour” and believes that Chinese large model entrepreneurs can also join the global technology innovation tide in addition to application innovation.

Many of DeepSeek’s choices are different. Up to now, among the 7 Chinese large-scale model startups, it is the only one that has given up the “both want and want” route, and has so far focused on research and technology, and has not done toC applications, and it is also the only company that has not fully considered commercialization, firmly chose the open source route, and has not even raised capital. As a result, it is often forgotten outside the table, but at the other end of the spectrum, it is often spread in the community by users.

How exactly was DeepSeek created? We interviewed Liang Wenfeng, the founder of DeepSeek, who rarely appears in public.

This founder from the magic square era has been quietly studying technology behind the scenes since the 1980s. In the DeepSeek era, he still continues his low-key style and, like all researchers, spends his days “reading papers, writing code, and participating in group discussions”.

Like many quantitative fund founders, Liang Wenfeng has a background in overseas hedge funds. What sets him apart is his local background, having studied artificial intelligence in the Department of Electronic Engineering at Zhejiang University in his early years, unlike many others who come from physics, mathematics, and other disciplines.

Many industry professionals and DeepSeek researchers have told us that Liang Wenfeng is a rare figure in the Chinese AI industry, with “powerful infrastructural engineering capabilities and model research abilities, as well as the ability to mobilize resources,” and is someone who “can make precise judgments from a high level and surpass frontline researchers in detail.” He possesses a “terrifying learning ability” while also “completely unlike a boss, but more like a geek.”

This is a particularly rare interview. In the interview, this technological idealist provides a voice that is currently particularly scarce in the Chinese technology community: he is one of the few who puts the “right and wrong view” before the “interest view,” and reminds us to see the inertia of the times and put “original innovation” on the agenda.

One year ago, when DeepSeek just came on the scene, we interviewed Liang Wenfeng for the first time: “Crazy Magic Square: The Road of an Invisible AI Giant’s Large Model”. If the phrase “We must embrace ambition crazily and sincerely” was still a beautiful slogan at that time, a year later, it has become a kind of action.

The following is the dialogue part

How was the first shot of the price war fired?

**‘Dark Tide’: After the release of DeepSeek V2 model, it quickly triggered a bloody and brutal large-scale model price war, and some people say you are the ‘dark horse’ of the industry.

Liang Wenfeng: We did not intend to be a catfish, we just accidentally became a catfish.

“Dark Tide”: Does this result surprise you?

Liang Wenfeng: Very surprised. I didn’t expect the price to be so sensitive to everyone. We just do things at our own pace and price based on cost accounting. Our principle is not to subsidize or make huge profits. This price is also slightly profitable above cost.

“Dark Wave”: In 5 days, Zhipu AI followed up, and then ByteDance, Alibaba, Baidu, Tencent and other big factories will follow.

Liang Wenfeng: Zhiju AI is an entry-level product, and other models at the same level are still very expensive. ByteDance was the first to follow suit. The flagship model was lowered to the same price as ours, triggering other major factories to lower their prices. Because the cost of the models from major factories is much higher than ours, we didn’t expect anyone to lose money doing this, and it eventually became the logic of burning money subsidies in the Internet era.

**“Undercurrent”: From the outside, price reductions are very much like grabbing users, as is often the case with price wars in the Internet era. **

Liang Wenfeng: Snatching users is not our main purpose. We are reducing prices partly because the cost has come down as we explore the structure of the next generation model, and also because we believe that both API and AI should be affordable and accessible to everyone.

**‘Surge of Darkness’: Before this, most Chinese companies would directly copy the Llama structure of this generation to develop applications. Why did you choose to start from the model structure?

Liang Wenfeng: If the goal is to develop applications, then sticking to the Llama structure and quickly launching products is also a reasonable choice. However, our destination is AGI, which means we need to research new model structures and achieve stronger model capabilities with limited resources. This is one of the fundamental research tasks required to scale up to larger models. In addition to model structures, we have conducted a lot of other research, including how to construct data, how to make the model more human-like, etc., all of which are reflected in the models we have released. Furthermore, the Llama structure is estimated to be two generations behind the advanced foreign level in terms of training efficiency and inference cost.

**“Dark Flow”: Where does this spread mainly come from?

Liang Wenfeng: First of all, there is a gap in training efficiency. We estimate that compared with the best level domestically, there may be a double gap in model structure and training dynamics compared with the best level abroad. Just for this point, we need to consume twice the computing power to achieve the same effect. In addition, there may also be a gap in data efficiency, that is, we need to consume twice the training data and computing power to achieve the same effect. Combined, we need to consume 4 times the computing power. What we need to do is to constantly reduce these gaps.

‘Dark Flow’: Most Chinese companies choose to have both models and applications, why does DeepSeek currently choose to only do research exploration?

Liang Wenfeng: Because we believe that it is most important to participate in the global innovation wave. In the past many years, Chinese companies have been accustomed to others’ technological innovation, and we have taken it and applied it for monetization. However, this is not something that can be taken for granted. In this wave, our starting point is not to make a profit, but to go to the forefront of technology and promote the development of the entire ecosystem.

**“Undercurrent”: The prevailing perception for most people in the Internet and mobile Internet era is that the United States excels in technical innovation, while China is better at application.

Liang Wenfeng: We believe that as the economy develops, China should gradually become a contributor instead of always hitchhiking. In the IT wave of the past 30 years, we have basically not participated in real technological innovation. We have become accustomed to Moore’s Law coming down from the sky, and better hardware and software will come out in 18 months of lying at home. The Scaling Law is also being treated in this way.

But in fact, this is created by generations of tireless efforts in the Western-dominated technical community, just because we did not participate in this process before, so we ignored its existence.

The real difference is not one or two years, but the difference between originality and imitation.

‘Undercurrent’: Why does DeepSeek V2 surprise many people in Silicon Valley?

Liang Wenfeng: Among the large number of innovations that occur in the United States every day, this is a very common one. The reason they are surprised is that this is a Chinese company, joining their game as an innovative contributor. After all, most Chinese companies are used to following, rather than innovating.

**“Dark Tide”: But this choice, placed in the Chinese context, is also too luxurious. Big models are a high-investment game, not all companies have the capital to only focus on innovation, rather than considering commercialization first.

Liang Wenfeng: The cost of innovation is definitely not low, and the inertia of taking things for granted in the past is also related to the past national conditions. But now, if you look at the size of China’s economy, as well as the profits of companies like ByteDance and Tencent, they are not low on a global scale. What we lack in innovation is definitely not capital, but confidence and not knowing how to organize high-density talent to achieve effective innovation.

**‘Undercurrent’: Why do Chinese companies, including well-funded large factories, prioritize rapid commercialization so easily?

Liang Wenfeng: In the past 30 years, we have only emphasized making money and neglected innovation. Innovation is not solely driven by business, but also requires curiosity and a desire to create. We have been constrained by the inertia of the past, but it is also temporary.

“Dark wave”: But you are a commercial organization, not a public welfare research institution. If you choose innovation and share it through open source, where will the moat be formed? Like the innovation of the MLA architecture in May, will it be copied by other companies soon?

Liang Wenfeng: The protective barrier formed by closed source is short-lived in the face of disruptive technology. Even if OpenAI is closed source, it cannot prevent others from catching up. Therefore, we deposit our value in the team. Our colleagues grow and accumulate a lot of know-how in this process, forming an innovative organization and culture, which is our protective barrier.

Open source, publishing papers, actually doesn’t lose anything. For technical personnel, being followed is a great sense of achievement. In fact, open source is more like a cultural behavior than a business behavior. Giving is actually an extra honor. A company doing this will also have cultural appeal.

‘Undercurrent’: How do you view the market faith faction views similar to Zhu Xiaohu’s?

Liang Wenfeng: Zhu Xiaohu is self-consistent, but his playing style is more suitable for companies that make quick money. If you look at the most profitable companies in the United States, they are all high-tech companies that have accumulated wealth and developed it slowly.

“Undercurrent”: However, for large models, simply having a technical lead is also difficult to form an absolute advantage. What is the bigger thing you are betting on?

Liang Wenfeng: What we see is that Chinese AI cannot always be in a position of following. We often say that there is a one or two year gap between Chinese AI and the United States, but the real gap is the difference between originality and imitation. If this doesn’t change, China will always be just a follower, so some exploration is also inevitable.

NVIDIA’s leadership is not just the effort of a single company, but the result of the collective efforts of the entire Western technology community and industry. They can see the next generation of technological trends and have a roadmap. The development of Chinese AI also requires such an ecosystem. Many domestic chip developments cannot take off due to the lack of a supporting technology community and only having second-hand information, so China inevitably needs someone to stand at the forefront of technology.

More investment does not necessarily lead to more innovation

‘Surge’: DeepSeek now has an idealistic and open-source temperament similar to early OpenAI. Will you choose to go closed-source in the future? Both OpenAI and Mistral have gone through the process of transitioning from open-source to closed-source.

Liáng Wénfēng: We will not close source. We believe that having a strong technological ecosystem is more important.

‘Undercurrent’: Do you have financing plans? According to media reports, Magic Square has plans to independently split and list DeepSeek, a Silicon Valley AI startup, ultimately all inevitably have to be tied to major companies.

Liang Wenfeng: There is no financing plan in the short term. Our problem has never been about money, but rather the high-end chips being banned.

**‘Dark Tide’: Many people believe that doing AGI and doing quantification are completely different things. Quantification can be done quietly, but AGI may require more high-profile actions, alliances, which can make your investment larger.

Liang Wenfeng: More investment does not necessarily lead to more innovation. Otherwise, large factories can monopolize all the innovation.

‘Undercurrent’: Is it because you lack the genes for operation that you are not currently developing applications?

Liang Wenfeng: We believe that the current stage is a period of technological innovation outbreak, rather than an application outbreak. In the long run, we hope to form an ecosystem where the industry directly uses our technology and outputs, and we are only responsible for basic models and cutting-edge innovation, and then other companies build toB and toC businesses based on DeepSeek. If a complete industrial upstream and downstream can be formed, there is no need for us to develop applications ourselves. Of course, if necessary, we can also develop applications without obstacles, but research and technological innovation will always be our top priority.

‘An Yong’: But when it comes to choosing an API, why choose DeepSeek instead of a major company?

Liang Wenfeng: The future world is likely to be characterized by specialized division of labor. The large-scale models at the foundation require continuous innovation, and large factories have their own capability boundaries, which may not necessarily be suitable.

**‘Undercurrent’: Can technology really make a difference? You have also said that there are no absolute technical secrets.

Liang Wenfeng: There are no secrets in technology, but resetting takes time and cost. Nvidia’s graphics cards theoretically have no technical secrets and are easy to replicate, but reorganizing teams and catching up with the next generation of technology takes time, so the actual moat is still wide.

**“Undercurrent”: After you lowered the price, ByteDance took the lead, indicating that they still feel some kind of threat. How do you see the new solution for the competition between startups and big companies?

Liang Wenfeng: To be honest, we don’t really care about this matter, we just did it incidentally. Providing cloud services is not our main goal. Our goal is still to achieve AGI.

Currently, no new solutions have been seen, but the large factories do not have obvious advantages either. Large factories have existing users, but its cash flow business is also its burden, which will make it a potential object of disruption at any time.

**“Undercurrent”: How do you see the ultimate fate of the 6 major model startups outside of DeepSeek?

Liang Wenfeng: Maybe 2 to 3 will survive. They are all still in the stage of burning money, so those with clear self-positioning and more refined operations have a better chance of surviving. Other companies may undergo a transformation. Valuable things will not disappear, but they will change in some way.

**“Dark Tide”: In the era of magic square, the attitude towards competition is evaluated as “being oneself” and rarely pays attention to horizontal comparison. What is the original point you think about competition?

Liang Wenfeng: What I often think about is whether something can improve the efficiency of social operation and whether you can find a position that you are good at in its industrial division of labor chain. As long as the ultimate goal is to improve social efficiency, then it is justified. Many things in between are transitional, and excessive attention will inevitably cause confusion.

A group of young people doing “mysterious” things

‘Surge’: Jack Clark, former policy director of OpenAI and co-founder of Anthropic, believes that DeepSeek has hired a group of “mysterious and talented geniuses” to create DeepSeek v2. Who are these people?

Liang Wenfeng: There are no mysterious geniuses here. They are all recent graduates from some top universities, current PhD interns, and some young people who have only graduated for a few years.

“Surge”: Many large-scale modeling companies are persistently recruiting overseas. Many people feel that the top 50 talents in this field may not be in Chinese companies. Where do your people come from?

Liang Wenfeng: V2 model does not have people returning from overseas, they are all local. The top 50 talents may not be in China, but maybe we can create such talents ourselves.

“Surge”: How did this MLA innovation come about? I heard the idea originated from a young researcher’s personal interest?

Liang Wenfeng: After summarizing some mainstream changes in the Attention architecture, he had the idea to design an alternative solution. However, it was a long process from conception to implementation. We formed a team for this purpose and it took several months to get it up and running.

‘Surge’: The birth of this divergent inspiration is closely related to the innovative organizational structure of your group. During the era of magic squares, you rarely assigned goals or tasks from top to bottom. But for AGI, this frontier exploration full of uncertainty, does it require more management actions?

Liáng Wénfēng: DeepSeek is also entirely bottom-up. And we generally don’t pre-allocate tasks, but rather rely on natural division of labor. Each person has their own unique growth experience and comes with their own ideas, so there’s no need to push them. During the exploration process, when someone encounters a problem, they will naturally discuss it with others. However, when an idea shows potential, we will also allocate resources from top to bottom.

**‘Surge’ : I heard that DeepSeek is very flexible in gathering cards and people.

Liang Wenfeng: There is no upper limit for the adjustment of cards and people for each of us. If there is an idea, everyone can call the cards of the training cluster at any time without approval. At the same time, because there are no levels and cross-departmental restrictions, everyone can flexibly call anyone, as long as the other party is also interested.

‘Surge’: A loose management style that also depends on your ability to screen a group of highly motivated people. I heard that you are good at recruiting people based on details and can select excellent individuals based on non-traditional evaluation criteria.

Liang Wenfeng: Our criteria for selecting people has always been passion and curiosity, so many people have had some unique experiences, which is very interesting. Many people’s desire for research far exceeds their concern for money.

“Surge” : Transformer was born in Google’s AI Lab, and ChatGPT was born in OpenAI. What do you think is the difference in the value of innovation between a big company’s AI Lab and a startup company?

Liang Wenfeng: Both Google’s lab and OpenAI, as well as the AI Lab of major Chinese companies, are very valuable. The final result being produced by OpenAI is also a historical coincidence.

**‘Undercurrent’: Is innovation largely a matter of chance? I see that in the middle of your office area, there are meeting rooms on both sides with doors that can be pushed open at will. Your colleagues say that this is to leave room for chance. The birth of Transformer has a story of someone passing by accidentally, hearing about it, and eventually turning it into a universal framework.

Liang Wenfeng: I think innovation is first and foremost a matter of faith. Why does Silicon Valley have such an innovative spirit? First of all, they dare to innovate. When Chatgpt came out, the entire country lacked confidence in doing cutting-edge innovation, from investors to big companies, they all felt that the gap was too big and it was better to stick to applications. But innovation requires self-confidence first. This kind of confidence is usually more evident in young people.

**‘Undercurrent’: But you do not participate in financing, rarely speak out, and certainly do not have as much social influence as those active in financing. How do you ensure that DeepSeek is the preferred choice for those who build large models?

Liang Wenfeng: Because we are doing the hardest thing. What attracts top talents is definitely to solve the world’s most difficult problems. In fact, top talents are underestimated in China. Because there are too few hardcore innovations in the whole society, they have no chance to be recognized. We are doing the hardest thing, which is attractive to them.

“Undercurrent”: The previous release of OpenAI did not bring GPT-5 as many people expected, and many people feel that the technological curve is clearly slowing down. Many people have also begun to question the Scaling Law. What do you think?

Liang Wenfeng: We are somewhat optimistic, and the entire industry seems to be in line with expectations. OpenAI is not a god, and it is not possible to always be at the forefront.

“Surge”: How long do you think it will take for AGI to be realized? Before releasing DeepSeek V2, you have released code generation and mathematical models, and also switched from dense models to MOE. So what are the coordinates of your AGI roadmap?

Liang Wenfeng: It may be 2 years, 5 years or 10 years, but it will be achieved in our lifetime. As for the roadmap, even within our company, there is no unified opinion. But we have indeed bet on three directions. One is mathematics and code, the other is multimodal, and the third is natural language itself. Mathematics and code are the natural experimental ground for AGI, a bit like Go, a closed and verifiable system, which may achieve high intelligence through self-learning. On the other hand, it may be necessary for AGI to participate in the real world of human learning. We keep all possibilities open.

**‘Undercurrent’: What do you think the ultimate form of the big model is?

Liang Wenfeng: There will be specialized companies providing basic models and basic services, with a long chain of professional division of labor. More people will meet the diverse needs of the whole society.

All routines are the product of the previous generation

‘Dark Surge’: In the past year, there have been many changes in China’s large-scale model entrepreneurship. For example, Wang Huiwen, who was still very active at the beginning of last year, exited halfway through, and the companies that joined later also began to show differentiation.

Liang Wenfeng: Wang Huiwen took on all the losses and let everyone else go unharmed. He made a choice that was unfavorable to himself but beneficial to everyone, so he is very honest in his dealings, and I admire him for that.

‘Surge’: Where are you currently putting most of your energy?

Liang Wenfeng: The main focus is on researching the next generation of large models. There are still many unresolved issues.

“Anchong”: Other large-scale model start-ups insist on both, after all, technology will not bring permanent advantages, it is also important to seize the time window to apply technological advantages to products. Is DeepSeek focusing on model research because its model capabilities are not enough yet?

Liang Wenfeng: All routines are products of the previous generation, and may not necessarily stand in the future. Discussing the profit model of future AI with the business logic of the Internet is like discussing General Electric and Coca-Cola when Ma Huateng started his business. It is likely to be a case of seeking a sword with a boat.

**‘Surge’: In the past, the magic square has strong technical and innovative genes, and the growth is relatively smooth. Is this the reason for your optimism?

Liang Wenfeng: To some extent, the Magic Square has enhanced our confidence in technology-driven innovation, but it hasn’t been a smooth road all the way. We have gone through a long process of accumulation. What outsiders see is only a part of the Magic Square after 2015, but in fact, we have been working on it for 16 years.

“Surge”: Returning to the topic of original innovation. Now that the economy is entering a downturn and capital is entering a cold cycle, will it bring more inhibition to original innovation?

Liang Wenfeng: I don’t necessarily think so. The adjustment of China’s industrial structure will rely more on the innovation of hardcore technology. When many people realize that fast money in the past may have come from the luck of the times, they will be more willing to bend down to do real innovation.

‘Undercurrent’: So you’re optimistic about this too?

Liang Wenfeng: I grew up in a small city in Guangdong in the 1980s. My father was an elementary school teacher. In the 1990s, there were many money-making opportunities in Guangdong. At that time, many parents came to my house because they felt that studying was useless. But now, if I go back and take a look, the mindset has changed. Because it’s not easy to make money anymore, even the opportunity to drive a taxi may be gone. The times have changed for a generation.

There will be more and more hardcore innovation in the future. It may not be easy to understand now because the entire social group needs to be educated by facts. When this society lets the hardcore innovators succeed, the collective thinking will change. We just need a lot of facts and a process.

WAVES-0,04%
TOKEN1,03%
GPT-9,43%
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin

Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)