By QuietWord @Adobe Stock

Bryce Elde of the Financial Times reports that DeepSeek’s $5 million cost claim is disputed, as it likely excludes other expenses. The company’s V3 model uses a Mixture-of-Experts (MoE) approach to reduce training costs, but experts argue the innovation isn’t revolutionary, with similar efficiencies seen in other models. Elde writes:

For anyone wanting to train an LLM on analyst responses to DeepSeek, the Temu of ChatGPTs, this post is a one-stop shop. We’ve grabbed all relevant sellside emails in our inbox and copy-pasted them with minimal intervention.

Backed by High-Flyer VC fund, DeepSeek is a two-years-old, Hangzhou-based spinout of a Zhejiang University startup for trading equities by machine learning. Its stated goal is to make an artificial general intelligence for the fun of it, not for the money. There’s a good interview on ChinaTalk with founder Liang Wenfeng, and mainFT has this excellent overview from our colleagues Eleanor Olcott and Zijing Wu. […]

Though it’s the Chinese, so people are suspicious. Here’s Citi’s Atif Malik:

“While DeepSeek’s achievement could be groundbreaking, we question the notion that its feats were done without the use of advanced GPUs to fine tune it and/or build the underlying LLMs the final model is based on through the Distillation technique. While the dominance of the US companies on the most advanced AI models could be potentially challenged, that said, we estimate that in an inevitably more restrictive environment, US’ access to more advanced chips is an advantage. Thus, we don’t expect leading AI companies would move away from more advanced GPUs which provide more attractive $/TFLOPs at scale. We see the recent AI capex announcements like Stargate as a nod to the need for advanced chips.”

People, such as Bernstein’s Stacy A Rasgon and team, also question the estimates for cost and efficiency. The Bernstein team says today’s panic is about a “fundamental misunderstanding over the $5mn number” and the way in which DeepSeek has deployed smaller models distilled from the full-fat one, R1.

“It seems categorically false that ‘China duplicated OpenAI for $5M’ and we don’t think it really bears further discussion,” Bernstein says:

“Did DeepSeek really “build OpenAI for $5M?” Of course not…There are actually two model families in discussion. The first family is DeepSeek-V3, a Mixture-of-Experts (MoE) large language model which, through a number of optimizations and clever techniques can provide similar or better performance vs other large foundational models but requires a small fraction of the compute resources to train. DeepSeek actually used a cluster of 2048 NVIDIA H800 GPUs training for ~2 months (a total of ~2.7M GPU hours for pre-training and ~2.8M GPU hours including post-training). The oft-quoted “$5M” number is calculated by assuming a $2/GPU hour rental price for this infrastructure which is fine, but not really what they did, and does not include all the other costs associated with prior research and experiments on architectures, algorithms, or data. The second family is DeepSeek R1, which uses Reinforcement Learning (RL) and other innovations applied to the V3 base model to greatly improve performance in reasoning, competing favorably with OpenAI’s o1 reasoning model and others (it is this model that seems to be causing most of the angst as a result). DeepSeek’s R1 paper did not quantify the additional resources that were required to develop the R1 model (presumably they were substantial as well).”

To that end investments are still accelerating. Right on top of all the DeepSeek newsflow last week we got META substantially increasing their capex for the year. We got the Stargate announcement. And China announced trillion yuan (~$140B) AI spending plan. We are still going to need, and get, a lot of chips…

Read more here.