According to Beating Monitoring, the DeepSeek open-source V4 series preview version, licensed under MIT, with weights now released on Hugging Face and ModelScope. The series includes two MoE models: V4-Pro, with a total of 1.6T parameters and 49B (49 billion) activated per token; V4-Flash, with a total of 284B (2840 billion) parameters and 13B (130 billion) activated. Both support a context length of 1M tokens or more.The architecture includes three upgrades: a hybrid attention mechanism (compressed sparse attention CSA + heavily compressed attention HCA) that significantly reduces long-context overhead; at a 1M-token context length, the single-token inference FLOPs for V4-Pro are only a continuation of “only for V3.2’s …”

BlockBeatNews

2026-04-24 03:06:06

According to Beating Monitoring, the open-source preview version of DeepSeek V4 series, licensed under MIT, with weights now available on Hugging Face and ModelScope. The series includes two MoE models: V4-Pro with a total of 1.6 trillion parameters and 49 billion active tokens per token; V4-Flash with a total of 284 billion parameters and 13 billion active tokens. Both support approximately 1 million tokens of context.

Three architectural upgrades: hybrid attention mechanisms (compressed sparse attention CSA + heavily compressed attention HCA) significantly reduce long-context overhead, with V4-Pro achieving only 27% of the FLOPs per token compared to V3.2 under 1 million token context, and KV cache (memory used to store historical information during inference) only 10% of V3.2; manifold-constrained super-connection mHC replaces traditional residual connections, enhancing stability of cross-layer signal propagation; training is accelerated using the Muon optimizer. Pretraining data exceeds 32 trillion tokens.

Post-training is divided into two phases: first, domain experts are trained separately using SFT and GRPO reinforcement learning; then, online distillation is used to unify and merge them into a single model. V4-Pro-Max (highest inference mode) claims to be the currently strongest open-source model, with top-tier encoding benchmarks, and a significant narrowing of gaps in inference and agent tasks compared to closed-source cutting-edge models. V4-Flash-Max, after sufficient reasoning budget, approaches Pro in inference performance but is limited in pure knowledge and complex agent tasks due to parameter scale. Weights are stored in FP4+FP8 mixed precision.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
WCTCTradingKingPK
135.24K Popularity
#
CryptoMarketSeesVolatility
205.78K Popularity
#
rsETHAttackUpdate
59.51K Popularity
#
US-IranTalksStall
93.58K Popularity
#
ETHMemeCoinFLORKSurges
32.16K Popularity

Sitemap

DeepSeek V4 Released: 1.6T Parameter Flagship Supports Around 1 Million Contexts, Inference Computing Power Only 27% of V3.2

Trending Topics

WCTCTradingKingPK

CryptoMarketSeesVolatility

rsETHAttackUpdate

US-IranTalksStall

ETHMemeCoinFLORKSurges

Pin