SemiAnalysis GTC In-Depth Analysis: Behind Three New Systems, NVIDIA Is Redefining the Boundaries of AI Infrastructure

SnapshotLaborer · 2026-03-24T13:06:15+00:00

At the GTC 2026 conference, NVIDIA unveiled three new systems in one fell swoop—the Groq LPX inference rack, Vera ETL256 CPU rack, and STX storage reference architecture, extending its product portfolio from GPU compute core to low-latency inference, CPU orchestration, and storage layer comprehensively, marking NVIDIA's systematic restructuring of the definition boundaries of AI infrastructure.Among them, the Groq LPX system has garnered the most market attention. This is the first productized result unveiled by NVIDIA less than four months after completing a $20 billion licensing deal for Groq's intellectual property and integration of its core team.The LPX rack deeply integrates Groq's LP30 chip with NVIDIA GPUs and introduces "Attention and Feedforward Network Decoupling" (Attention FFN

SnapshotLaborer

2026-03-24 13:06:15

At GTC 2026, NVIDIA unveiled three new systems simultaneously—Groq LPX inference rack, Vera ETL256 CPU rack, and STX storage reference architecture—expanding its product portfolio from GPU compute cores to low-latency inference, CPU orchestration, and storage infrastructure, marking NVIDIA’s systematic redefinition of AI infrastructure boundaries.

The most attention-grabbing is the Groq LPX system. This is NVIDIA’s first product after completing a $20 billion licensing agreement for Groq’s intellectual property and acquiring its core team in less than four months.

The LPX rack integrates Groq’s LP30 chips with NVIDIA GPUs and introduces “Attention and Feedforward Network Disaggregation” (AFD) technology, specifically optimizing decoding latency in high-interaction inference scenarios, opening previously unavailable optimization pathways for large-scale inference systems.

Meanwhile, Vera ETL256 packs 256 CPUs into a single liquid-cooled rack, achieving full interconnectivity within the rack via copper cabling topology, directly addressing the CPU supply bottleneck that becomes more prominent as AI scale expands; STX extends NVIDIA’s control from compute and networking layers to storage infrastructure through a standardized storage reference architecture.

SemiAnalysis believes these three systems signal a strategic shift: NVIDIA is no longer just a GPU supplier but evolving into a full-stack AI infrastructure platform provider. Its reach now covers inference optimization, CPU density, and storage orchestration—areas previously dominated by other vendors—deeply impacting the competitive landscape of the entire AI hardware supply chain.

LPX and LP30: Groq Architecture Fully Integrated into NVIDIA Inference Stack

The deal between NVIDIA and Groq is structured as an IP licensing and talent acquisition, not a traditional acquisition. NVIDIA thus almost immediately gained all of Groq’s IP and core team, and within less than four months, launched the LP30 chip and LPX system based on Groq’s third-generation LPU architecture.

LP30 is fabricated with Samsung’s SF4 process, equipped with 500MB on-chip SRAM, delivering 1.2 PFLOPS of FP8 compute, a significant upgrade from Groq’s first-generation LPU (230MB SRAM, 750 TFLOPS INT8). The performance boost is mainly driven by the process node migration from GF16 to SF4.

LP30 exists as a single monolithic die, eliminating the need for advanced packaging. Notably, the SF4 process does not compete for NVIDIA’s limited TSMC N3 capacity nor consume the scarce HBM resources, making the LPX system a true incremental capacity and revenue addition. According to SemiAnalysis, this is a differentiated advantage competitors cannot replicate.

Core Value and Inherent Limitations of LPU

The LPU architecture’s competitive edge lies in high-bandwidth SRAM and deterministic pipeline execution, enabling it to generate the first token at speeds unmatched by GPUs in low-latency, single-user scenarios. However, the high-density SRAM comes with capacity limitations—after loading weights, remaining space is minimal, and as batch size increases, KV cache quickly saturates, resulting in overall throughput significantly lower than GPUs.

SemiAnalysis notes that standalone LPU systems are not cost-effective for large-scale token services, but in latency-sensitive scenarios, they can command significant premiums—forming the basis of LPU’s role in decoupled decoding systems.

AFD Technology: Role Division Between GPU and LPU

AFD technology splits attention (Attention) and feedforward network (FFN) computations in large model inference across different hardware. Attention involves dynamic KV cache loading, making it naturally suited for GPU processing; FFN, being stateless and statically schedulable, aligns well with LPU’s deterministic architecture.

Under this framework, GPUs handle attention calculations, leveraging full HBM capacity for KV cache, increasing the total number of tokens processed concurrently; LPUs handle FFN computations, exploiting their low-latency advantage. Token distribution and aggregation between GPU and LPU are managed via All-to-All collective communication, with ping-pong pipelining to hide communication latency.

Additionally, LPUs can be used in speculative decoding (Speculative Decoding), deploying draft models or multi-token prediction (MTP) layers to further reduce decoding latency, often increasing output tokens per decoding step by 1.5 to 2 times.

LPX Rack Architecture

The LPX rack consists of 32 1U LPU compute trays and 2 Spectrum-X switches. Each compute tray hosts 16 LP30 chips, 2 Altera FPGAs (NVIDIA calls these “Fabric Expansion Logic”), 1 Intel Granite Rapids host CPU, and 1 BlueField-4 front-end module.

FPGAs serve multiple critical functions: converting LPU’s C2C protocol to Ethernet for Spectrum-X network access, bridging PCIe between LPU and host CPU, and providing up to 256GB DDR5 memory per module for KV cache storage. The entire rack’s aggregate bandwidth reaches approximately 640TB/s.

LPU modules are mounted “face-to-face” on PCB sides, with 8 modules on top and bottom, minimizing interconnect length in X and Y directions. The 16 LPU modules within a node are interconnected via a mesh topology, with inter-node connections via copper backplane cables and cross-rack links through front panel OSFP ports.

Vera ETL256: The Density Limit of 256 CPUs

As AI workloads increasingly demand data preprocessing, orchestration, and reinforcement learning validation, CPUs are becoming a new bottleneck limiting GPU utilization. Reinforcement learning scenarios are especially demanding—requiring CPUs to run simulations, execute code, and verify outputs in parallel. The rapid expansion of GPU scale outpaces CPU growth, necessitating larger CPU clusters to keep GPUs fully utilized.

NVIDIA’s solution is Vera ETL256, integrating 256 Vera CPUs into a single rack cooled by liquid cooling.

This system’s design follows the NVL compute rack concept: increasing compute density to the point where copper cabling can cover the entire rack, eliminating the need for optical transceivers at the backbone network level. The cost savings from copper cabling offset the additional expense of liquid cooling.

Specifically, the Vera ETL rack comprises 32 compute trays, with 16 on top and 16 on bottom, arranged symmetrically around four MGX ETL switches based on Spectrum-6. This layout deliberately compresses cable lengths between compute trays and backbone switches, ensuring all connections are within copper reach.

The back-end ports of each switch handle copper backbone communication within the rack, while 32 front-facing OSFP ports connect to other nodes via fiber. The internal network employs Spectrum-X’s multi-plane topology, distributing 200 Gb/s channels across four switches, achieving full Ethernet interconnection of 256 CPUs within a single network layer, with each compute tray hosting 8 Vera CPUs.

STX: NVIDIA’s Systematic Extension into Storage Layer

STX is a storage reference rack architecture announced at GTC 2026, complementing the earlier CMX context storage platform, forming NVIDIA’s comprehensive layout for penetrating storage infrastructure.

Building on CMX, STX further defines a reference architecture specifying the number of disks, Vera CPUs, BF-4 DPUs, CX-9 NICs, and Spectrum-X switches needed per cluster.

Each STX chassis contains 2 BF-4 units, totaling 2 Vera CPUs, 4 CX-9 NICs, and 4 SOCAMM modules; the entire rack comprises 16 chassis, totaling 32 Vera CPUs, 64 CX-9 NICs, and 64 SOCAMMs.

NVIDIA also named several major storage vendors—DDN, Dell Technologies, HPE, IBM, NetApp, Supermicro, and VAST Data—indicating these will support the STX standard, continuing its practice of industry endorsement to strengthen the reference architecture’s authority.

SemiAnalysis notes that the combination of BlueField-4, CMX, and STX signifies that, after establishing dominance in the compute (GPU) and networking (Spectrum-X and NVLink) layers, NVIDIA is systematically advancing into storage, software, and infrastructure operations layers.

These three new systems collectively broaden NVIDIA’s product moat, implying a growing market share in the AI infrastructure supply chain will increasingly concentrate around NVIDIA.

Risk Warning and Disclaimers

Market risks exist; investments should be cautious. This article does not constitute personal investment advice and does not consider individual user’s specific investment goals, financial situations, or needs. Users should evaluate whether any opinions, viewpoints, or conclusions herein are suitable for their circumstances. Invest at your own risk.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes