At NVIDIA GTC 2026, the most talked-about topic isn’t just Vera Rubin NVL72, but also the new inference paradigm formed by pairing it with Groq 3 LPX. AI infrastructure is shifting from a single GPU-driven computing model to a heterogeneous architecture centered on division of labor.
Groq 3 LPX is positioned as an accelerator specialized for low-latency inference, complementing the Rubin GPU. In traditional architectures, GPUs must handle both long-context input processing and token-by-token generation simultaneously. As model sizes and context lengths rapidly expand, this integrated design increasingly faces efficiency bottlenecks.
NVIDIA has therefore broken down the inference process, allowing the Rubin GPU to focus on high-throughput preprocessing and attention calculations, while LPX handles the decoding stage that requires the most immediate response, especially for feedforward neural networks and MoE expert computations. Last year, NVIDIA acquired Groq for about $20 billion in cash, driven by this strategy. Groq’s core product is the LPU (Language Processing Unit) architecture, designed specifically for AI inference, offering ultra-low latency, stable response times, and high energy efficiency—particularly suitable for real-time conversations, voice assistants, and similar scenarios.
(NVIDIA’s largest acquisition: investing $64 billion to acquire Groq’s technology and the father of Google TPU)
GPU Collaboration with LPU for Inference
This design, called “Disaggregated Inference,” separates inference tasks from a single processor, instead relying on collaboration between GPU and LPU.
In practice, the model first builds context and KV cache on the GPU. During each token generation cycle, the GPU handles attention, then passes intermediate results to the LPX for FFN calculations, and finally the GPU combines the outputs. This division of labor allows each processing unit to handle what it does best, greatly improving overall efficiency.
NVIDIA’s acquisition of Groq integrates its LPU into LPX
The core of LPX lies in its LPU architecture. Unlike GPUs that rely on dynamic scheduling and high-bandwidth external memory, LPUs emphasize predictability through a design that allows the compiler to directly control computation and data flow, reducing latency variability. Its SRAM-first architecture keeps critical data on-chip as much as possible, minimizing memory access uncertainties and stabilizing token generation times. This feature is crucial for real-time interactive AI applications, where latency directly impacts user experience.
LPX Rack Specifications Revealed: Comprising 256 LPUs
In terms of hardware scale, an LPX rack consists of 256 LPUs, offering extremely high on-chip memory bandwidth and inter-chip communication capabilities, optimized for low-latency inference. Compared to Rubin GPU’s high FLOPS and large memory capacity, LPX functions more like a dedicated “last mile” engine, responsible for converting model outputs into real-time usable results.
This article, NVIDIA GTC 2026 | Analyzing NVIDIA’s $10 billion Groq Acquisition Strategy and How LPX Will Change Inference Processes, first appeared on Chain News ABMedia.