Futures
Access hundreds of perpetual contracts
TradFi
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
a16z Long Article: The Next Frontier of AI Is Not in Language, but in the Physical World—The Triple Flywheel of Robots, Autonomous Science, and Brain-Computer Interfaces
Author: Oliver Hsu (a16z)
Translation: Deep潮 TechFlow
Deep潮 Guide: This article by a16z researcher Oliver Hsu is the most systematic “Physical AI” investment map since 2026. His judgment is: the main thread of language/code continues to scale, but the truly next-generation disruptive capabilities are emerging in three adjacent fields—general robotics, autonomous science (AI scientists), brain-machine interfaces and other new human-machine interfaces. The author dissects five underlying capabilities supporting them and argues that these three fronts will form a mutually reinforcing structural flywheel. For those seeking to understand the investment logic of physical AI, this is currently the most comprehensive framework.
Today’s dominant AI paradigm revolves around language and code organization. The scaling laws of large language models have been clearly mapped out, and the business flywheel of data, compute, and algorithm improvements is spinning, with each capability leap bringing significant and mostly visible returns. This paradigm justifies the capital and attention it attracts.
But another set of adjacent fields has already made substantial progress during their incubation period. These include VLA (Vision-Language-Action models), WAM (World Action Models), and other general robotics routes, as well as physical and scientific reasoning centered around “AI scientists,” and new interfaces that reshape human-computer interaction using AI advances (including brain-machine interfaces and neurotechnology). Beyond the technology itself, these directions are beginning to attract talent, capital, and founders. The fundamental primitives for extending AI into the physical world are maturing simultaneously, and the progress over the past 18 months indicates these fields will soon enter their respective scaling phases.
In any given technological paradigm, the areas with the greatest delta between current capabilities and medium-term potential often share two features: first, they can benefit from the same scaling dividends driving the current frontier; second, they are just one step away from the mainstream paradigm—close enough to inherit its infrastructure and research momentum, yet distant enough to require substantial additional work. This gap itself plays a dual role: it naturally creates a moat for fast followers and defines a more sparse, less crowded problem space, thus increasing the likelihood of emergent new capabilities—precisely because the shortcuts have not yet been fully explored.
Figure caption: Illustration of the relationship between the current AI paradigm (language/code) and adjacent frontier systems.
Today, three fields fit this description: robotics learning, autonomous science (especially in materials and life sciences), and new human-machine interfaces (including brain-machine interfaces, silent speech, neuro-wearables, and new sensory channels like digital olfaction). They are not entirely independent; they belong to the same overarching set of “frontier systems in the physical world.” They share a set of foundational primitives: learning representations of physical dynamics, architectures for embodied actions, infrastructure for simulation and synthetic data, expanding sensory channels, and closed-loop agent orchestration. They reinforce each other through cross-domain feedback. They are also the most likely to produce qualitative breakthroughs—interactions among model scale, physical grounding, and new data modalities—products of these interactions.
This article will outline the core primitives supporting these systems, explain why these three fields represent frontier opportunities, and propose that their mutual reinforcement forms a structural flywheel, propelling AI into the physical world.
Five Underlying Primitives
Before diving into specific applications, it’s essential to understand the shared technical foundations of these frontier systems. Advancing AI into the physical realm relies on five main primitives. These technologies are not exclusive to any single application domain; they are components—building blocks—that enable the creation of systems that extend AI into the physical world. Their synchronized maturation is what makes this moment particularly special.
Figure caption: The five foundational primitives supporting physical AI.
Primitive 1: Learning Representations of Physical Dynamics
The most fundamental primitive is the ability to learn a compressed, generalizable representation of physical behaviors—how objects move, deform, collide, and respond to forces. Without this layer, every physical AI system would have to learn physical laws from scratch, which is prohibitively costly.
Several architectural approaches are converging toward this goal from different directions. VLA models approach from the top: starting with pre-trained vision-language models—these models already possess semantic understanding of objects, spatial relations, and language—and adding an action decoder to output motion control commands. The key is that the enormous cost of learning “seeing” and “understanding the world” can be amortized via internet-scale pretraining on images and text. Projects like Physical Intelligence’s π₀, Google DeepMind’s Gemini Robotics, and NVIDIA’s GR00T N1 are validating this architecture at increasing scales.
WAM models approach from the bottom: using video diffusion transformers pretrained on internet-scale videos, inheriting rich priors about physics—how objects fall, occlude, and interact under forces—and coupling these priors with action generation. NVIDIA’s DreamZero demonstrates zero-shot generalization to new tasks and environments, with few adaptation data, enabling cross-entity transfer from human demonstrations and meaningful real-world generalization.
A third route, perhaps most insightful for future directions, skips pretraining of VLMs and video diffusion altogether. Generalist models like GEN-1 are trained from scratch as embodied foundational models, using over 500k hours of real physical interaction data collected mainly via low-cost wearable devices on humans performing daily tasks. It’s neither a standard VLA (lacking a visual-language backbone fine-tuned for this purpose) nor a WAM. It’s a dedicated physical interaction foundation model, learning not from internet images, text, or videos, but from the statistical regularities of human-object contact.
Companies like World Labs working on spatial intelligence find this primitive valuable because it addresses a common shortcoming of VLA, WAM, and native embodied models: none explicitly model the 3D structure of the scene. VLA inherits 2D visual features from image-text pretraining; WAM learns dynamics from 2D projections of 3D scenes in videos; models trained on wearable sensor data capture forces and kinematics but not scene geometry. Spatial intelligence models can fill this gap—learning to reconstruct and generate complete 3D physical environments and reason about geometry, lighting, occlusion, object relations, and spatial layout.
The convergence of these routes is itself a key point. Whether the representation is inherited from VLMs, learned via collaborative video training, or built natively from physical interaction data, the underlying primitive is the same: a compressed, transferable model of physical behaviors. The data flywheel enabled by these representations is enormous and largely untapped—beyond internet videos and robot trajectories, it includes the vast corpus of human bodily experience now being scaled through wearable devices. The same representation can serve a robot learning to fold towels, an autonomous lab predicting reaction outcomes, or a neural decoder interpreting motor cortex signals.
Primitive 2: Embodied Action Architectures
Having a physical representation alone isn’t enough. To translate “understanding” into reliable physical actions, architectures must address several interconnected challenges: mapping high-level intent to continuous motion commands, maintaining consistency over long action sequences, operating under real-time latency constraints, and improving through experience.
Hierarchical dual-system architectures have become standard for complex embodied tasks: a slow but powerful vision-language model handles scene understanding and task reasoning (System 2), paired with a fast, lightweight vision-motor policy for real-time control (System 1). Variants like NVIDIA’s GR00T N1, Gemini Robotics, and Figure’s Helix adopt this approach, resolving the fundamental tension between “rich reasoning” and “millisecond-level control.” Generalist models take a different route, employing “resonant reasoning” to enable simultaneous thinking and acting.
Action generation mechanisms are evolving rapidly. The π₀ approach, based on flow matching and diffusion, has become mainstream for producing smooth, high-frequency continuous motions, replacing discrete tokenization borrowed from language modeling. These methods treat action generation as a denoising process akin to image synthesis, producing trajectories that are physically smoother and more robust to error accumulation, outperforming autoregressive token prediction.
The most critical architectural advance may be extending reinforcement learning (RL) to pretrained VLA models—creating foundational models trained on demonstration data that can continue to improve through autonomous practice, akin to humans refining skills via repetition and self-correction. Physical Intelligence’s π*₀.₆ exemplifies this at scale. Their RECAP method (Advantage-Conditioned Policy Experience and Correction Reinforcement Learning) addresses the credit assignment problem in long sequences: if a robot slightly misgrips an espresso machine handle, failure may only manifest after several steps. Imitation learning alone cannot attribute failure to earlier actions; RL can. RECAP trains a value function estimating success probability from any intermediate state, guiding the VLA to select high-advantage actions. The key is integrating heterogeneous data—demonstrations, autonomous policy experience, remote expert corrections—within a unified training pipeline.
This approach bodes well for RL’s future in action domains. π*₀.₆ can reliably fold 50 unseen clothing items, assemble boxes, and make espresso in real homes for hours without human intervention. On the hardest tasks, RECAP doubles throughput over pure imitation baselines and halves failure rates. It also demonstrates that post-training RL produces qualitative behavioral improvements—smoother recovery motions, more efficient grasping, and adaptive error correction not present in demonstration data.
These gains suggest that the compute-driven scaling from GPT-2 to GPT-4 is beginning to operate in embodied domains—though still at an earlier curve segment, with continuous, high-dimensional action spaces and relentless physical constraints.
Primitive 3: Simulation and Synthetic Data Infrastructure
In language, data issues are solved by the internet: trillions of tokens of naturally generated, freely available text. In the physical world, the problem is magnified many times over—this is now widely recognized. The most direct signal is the rapid rise of startups providing physical-world data solutions. Collecting real robot trajectories is costly, risky at scale, and limited in diversity. A language model can learn from billions of dialogues; a robot (for now) cannot have billions of physical interactions.
Simulation and synthetic data generation form the foundational infrastructure to address this constraint. Their maturation is a key reason why physical AI is accelerating now rather than five years ago.
Modern simulation stacks combine physics-based engines, photorealistic ray-traced rendering, procedural environment generation, and world models that generate photo-realistic videos from simulated inputs—bridging the sim-to-real gap. The pipeline starts with neural reconstruction of real environments (possible with just a smartphone), then creates physically accurate 3D assets, and finally produces large-scale annotated synthetic datasets.
Improvements in simulation infrastructure are changing the economic assumptions behind physical AI. If the bottleneck shifts from “collecting real data” to “designing diverse virtual environments,” costs plummet. Simulation scales with compute, not physical hardware or labor. This transformation in the economics of training physical AI systems is akin to how internet-scale text data revolutionized language models—investment in simulation infrastructure becomes a powerful lever for the entire ecosystem.
But simulation isn’t just for robotics primitives. The same infrastructure supports autonomous science (digital twins of lab equipment, simulation environments for hypothesis testing), new interfaces (training BCI decoders in simulated neural environments, synthetic sensory data for calibration), and other AI-physical interactions. Simulation is the universal data engine for physical AI.
Primitive 4: Expanding Sensory Channels
Physical signals are far richer than vision and language. Tactile sensing conveys material properties, grip stability, contact geometry—information invisible to cameras. Neural signals encode movement intent, cognitive states, perceptual experiences at bandwidths far beyond current human-machine interfaces. Subglottic muscle activity encodes speech intent before vocalization. The fourth primitive is AI’s rapid expansion into these previously inaccessible sensory modalities—driven not only by research but also by an entire ecosystem building consumer devices, software, and infrastructure.
Figure caption: Expanding AI sensory channels—from AR, EMG to brain-machine interfaces.
The most immediate indicator is the emergence of new device categories. AR headsets have significantly improved in recent years (with companies deploying them in consumer and industrial applications); voice-first wearables are enabling language AI to access richer physical context—truly following users into physical environments. Long-term, neural interfaces may unlock more complete interaction modalities. The shift in computational paradigms enabled by AI creates opportunities for substantial upgrades in human-machine interaction; companies like Sesame are developing new modalities and devices for this purpose.
Voice, as a more mainstream modality, also benefits emerging interaction methods. Products like Wispr Flow promote voice as the primary input (due to its high information density and natural advantages), improving the market for silent speech interfaces. Silent speech devices use sensors to capture tongue and vocal fold movements, enabling silent speech recognition—a higher-density human-machine interaction modality.
Brain-computer interfaces (both invasive and non-invasive) represent a deeper frontier, with a growing ecosystem of commercial efforts. Signals are appearing at the intersection of clinical validation, regulatory approval, platform integration, and institutional funding—once purely academic. Companies like Neuralink have implanted devices in multiple patients; surgical robots and decoding algorithms are iterating. Synchron’s intravascular Stentrode enables paralyzed users to control digital and physical environments. Echo Neurotechnologies is developing a BCI system for speech restoration based on high-resolution cortical decoding. New startups like Nudge are attracting talent and capital to develop novel neural interfaces and brain interaction platforms. Milestones in research include BISC chips demonstrating 65,536-electrode wireless neural recording; BrainGate decoding internal language directly from motor cortex.
The common thread across AR glasses, AI wearables, silent speech devices, and invasive BCI is not just “they are all interfaces,” but that they form a spectrum of increasing bandwidth between human physical experience and AI systems—each point supporting ongoing primitive advances behind the three main fields discussed. A robot trained with millions of high-quality first-person videos from AR glasses learns different priors than one trained solely on remote operation data; a lab AI responding to subglottic commands differs fundamentally from a keyboard-controlled system; a neural decoder trained on high-density BCI data produces motion plans impossible to obtain via other channels.
The diffusion of these devices expands the effective data manifold for training frontier physical AI systems—and much of this expansion is driven by well-funded consumer companies, not just academia, meaning the data flywheel can accelerate alongside market adoption.
Primitive 5: Closed-Loop Agent Systems
The final primitive is more architectural: it refers to systems that integrate perception, reasoning, and action orchestration into continuous, autonomous, closed-loop operation over long durations.
In language models, this has manifested as the rise of agent systems—multi-step reasoning chains, tool use, self-correction—pushing models from single-turn Q&A tools to autonomous problem solvers. In the physical world, a similar transformation is underway, but with much higher demands. A language agent can backtrack at no cost; a physical agent that spills a reagent cannot undo it.
Physical agent systems are distinguished by three features: first, they must embed in experiments or operations with a closed loop—directly interfacing with raw instrument data streams, physical sensors, and execution primitives, grounding reasoning in physical reality rather than textual descriptions. Second, they require long-term persistence—memory, traceability, safety monitoring, recovery—linking multiple operational cycles rather than treating each task as an isolated episode. Third, they must adapt in a closed loop—revising strategies based on physical outcomes, not just textual feedback.
This primitive fuses independent capabilities—world models, reliable action architectures, rich sensor suites—into fully autonomous physical systems. It is an integration layer; its maturity is a prerequisite for the three application domains to exist as deployable, real-world systems rather than isolated research demos.
Three Domains
The primitives above are universal enablers; they do not specify where the most important applications will emerge. Many fields involve physical action, measurement, or perception. The distinction between “frontier systems” and “incremental improvements” lies in the degree of compound interest generated by model capabilities and scaling infrastructure—not just better performance, but the emergence of capabilities previously impossible.
Robotics, AI-driven science, and new human-machine interfaces are the three fields with the strongest compound effects. Each assembles the primitives in unique ways, each is constrained by the current limits of these primitives, and each produces structured physical data as a byproduct—data that feeds back into improving the primitives, creating a feedback loop that accelerates the entire system. They are not the only frontier AI domains, but they are the most densely interacting with physical reality, the furthest along from the current language/code paradigm, and thus the most fertile ground for new capabilities—while also being highly complementary and able to benefit from its dividends.
Robotics
Robotics is the most literal embodiment of physical AI: an AI system that perceives, reasons, and exerts physical actions in the material world. It also serves as a stress test for each primitive.
Consider what it takes for a general-purpose robot to fold a towel. It needs learned representations of deformable materials under force—a physical prior beyond language pretraining. It needs an action architecture capable of translating high-level commands into continuous control at over 20Hz. It requires simulation-generated training data, since millions of real towel-folding demonstrations are unavailable. Tactile feedback is necessary to detect slips and adjust grip force, as vision alone cannot reliably distinguish a stable from a failing grasp. It needs a closed-loop controller that can recognize errors and recover, rather than blindly executing memorized trajectories.
Figure caption: Simultaneous invocation of the five primitives in robotic tasks.
This is why robotics is a frontier system, not just a mature engineering discipline with better tools. These primitives are not just improvements on existing robotic capabilities; they unlock classes of operations, motions, and interactions that were impossible outside narrow industrial settings.
Significant progress has been made in recent years—something we’ve covered before. The first wave of VLA models proved foundational control is feasible. Architectural advances have connected high-level reasoning with low-level control. On-device inference is becoming practical, and cross-entity transfer means a single model can adapt to new robots with limited data. The remaining core challenge is scaling reliability—current success rates (~95%) over 10-step tasks are insufficient for deployment; industrial standards demand much higher. RL-based post-training fine-tuning holds great promise to reach the robustness thresholds needed for real-world use.
These advances impact market structure. For decades, the value of robotics has been concentrated in hardware—mechanical systems remain central. But as learning strategies become more standardized, value shifts toward models, training infrastructure, and data flywheels. Robotics also feeds back into the primitives: each real-world trajectory improves world models; deployment failures reveal simulation gaps; testing new embodiments broadens the physical experience base for pretraining. Robots are both the most demanding consumers of primitives and a key source of feedback signals for their improvement.
Autonomous Science
If robots test primitives through “real-time physical actions,” autonomous science tests a different aspect—multi-step reasoning about complex causal physical systems over hours or days, interpreting results, contextualizing, and revising strategies.
Figure caption: How autonomous science (AI scientist) integrates the five primitives.
AI-driven science is the most thorough application of these primitives. An SDL (self-driving lab) must learn physical and chemical dynamics to predict experimental outcomes; it needs embodied actions to pipette, position samples, operate instruments; simulation for pre-screening experiments and optimizing scarce instrument time; expanded sensing—spectroscopy, chromatography, mass spectrometry, and emerging chemical and biological sensors—to characterize results. It demands closed-loop agent orchestration capable of maintaining multi-round “hypothesize-experiment-analyze-revise” workflows autonomously, with traceability, safety, and adaptive strategies.
No other domain calls these primitives as deeply. That’s why autonomous science is a “system” frontier, not just better lab automation. Companies like Periodic Labs and Medra combine scientific reasoning and physical validation, iterating rapidly and generating experimental training data.
The value is intuitive: traditional materials discovery takes years; AI can accelerate this process dramatically. The key constraints are shifting from hypothesis generation (well-supported by foundational models) to manufacturing and validation (requiring physical instruments, robots, closed-loop optimization). SDL aims at this bottleneck.
Another defining feature of autonomous science—applicable across all physical systems—is its role as a data engine. Each experiment produces not just a scientific result but a physically grounded, validated training signal. A measurement of polymer crystallization under specific conditions enriches the world model; a verified synthesis route becomes training data for physical reasoning; a failed experiment reveals model inaccuracies. Data from real experiments are structured, causal, empirically validated—precisely what physical reasoning models need but cannot get from other sources. Autonomous science directly converts physical reality into structured knowledge, advancing the entire physical AI ecosystem.
New Interfaces
Robots extend AI into physical actions; autonomous science extends AI into physical research. New interfaces connect AI directly with human perception, sensory experience, and bodily signals—covering AR glasses, EMG wearables, and implanted brain-machine interfaces. The common function is expanding the bandwidth and modalities of human-AI channels—and generating direct human-world interaction data for building physical AI.
Figure caption: Spectrum of new interfaces—from AR glasses to brain-machine interfaces.
The challenge and potential lie in the gap from current paradigms. Language models conceptually understand these modalities but are not inherently familiar with silent speech motion patterns, olfactory receptor geometries, or the temporal dynamics of EMG signals. Decoding these signals requires learning from the expanding sensory channels themselves. Many modalities lack internet-scale pretraining corpora; data often only come from the interfaces, meaning the system and its training data co-evolve—without a direct analogy in language AI.
Recent progress is exemplified by the rapid rise of consumer AI wearables. AR glasses have improved significantly, with applications in consumer and industrial contexts; voice-first wearables are enabling language AI to access richer physical context—truly following users into environments. Long-term, neural interfaces may unlock more complete interaction modalities. The computational shift driven by AI creates opportunities for substantial human-AI interaction upgrades; companies like Sesame are developing new modalities and devices.
Voice, as a mainstream modality, also benefits new interaction forms. Products like Wispr Flow promote voice as the primary input (due to its high information density and naturalness), with silent speech interfaces improving the market. Silent speech devices use sensors to capture tongue and vocal fold movements, enabling silent speech recognition—a higher-density human-machine interface.
Brain-machine interfaces (both invasive and non-invasive) are a deeper frontier, with a growing ecosystem of startups. Signals are emerging at the intersection of clinical validation, regulation, platform integration, and funding—once purely academic. Neuralink has implanted devices in multiple patients; surgical robots and decoding algorithms are iterating. Synchron’s intravascular Stentrode enables paralyzed users to control digital and physical environments. Echo Neurotechnologies develops high-resolution cortical speech decoding. New startups like Nudge attract talent and capital for novel neural interfaces and brain interaction platforms. Milestones include BISC chips with 65,536 electrodes demonstrating wireless neural recording; BrainGate decoding internal language directly from motor cortex.
The common thread across AR glasses, AI wearables, silent speech devices, and BCI is not just “they are all interfaces,” but that they form a spectrum of increasing bandwidth connecting human physical experience and AI systems—each point supporting ongoing primitive advances behind the three main fields. A robot trained with millions of high-quality first-person videos from AR glasses learns different priors than one trained solely on remote operation data; a lab AI responding to subglottic commands differs fundamentally from a keyboard-controlled system; a neural decoder trained on high-density BCI data produces motion plans impossible to obtain via other channels.
The proliferation of these devices expands the data manifold for training frontier physical AI systems—and much of this expansion is driven by well-funded consumer companies, not just academia, meaning the data flywheel can accelerate with market adoption.
Primitive 5: Closed-Loop Intelligent Agent Systems
The last primitive is more architectural: systems that integrate perception, reasoning, and action orchestration into continuous, autonomous, closed-loop operation over extended periods.
In language models, this has manifested as agent systems—multi-step reasoning, tool use, self-correction—pushing models from single-turn Q&A to autonomous problem-solving. In the physical realm, a similar shift is happening, but with much higher stakes. A language agent can backtrack at no cost; a physical agent that spills a reagent cannot undo it.
Physical agent systems are characterized by three features: first, they must embed in experiments or operations with a closed loop—directly interfacing with raw instrument data streams, physical sensors, and execution primitives, grounding reasoning in physical reality rather than textual descriptions. Second, they require long-term persistence—memory, traceability, safety, recovery—linking multiple operational cycles rather than treating each task as isolated. Third, they must adapt in a closed loop—revising strategies based on physical outcomes, not just textual feedback.
This primitive fuses independent capabilities—world models, reliable action architectures, rich sensor suites—into fully autonomous physical systems. It is an integration layer; its maturity is a prerequisite for the three application domains to exist as deployable, real-world systems rather than isolated research demos.
Three Domains
The primitives above are universal enablers; they do not specify where the most critical applications will emerge. Many fields involve physical action, measurement, or perception. The distinction between “frontier systems” and “incremental improvements” depends on the degree of compound interest from model capabilities and scaling infrastructure—not just performance, but the emergence of capabilities previously impossible.
Robotics, AI-driven science, and new human-machine interfaces are the three fields with the strongest compound effects. Each assembles the primitives in unique ways, each constrained by the current limits of these primitives, and each produces structured physical data as a byproduct—data that feeds back into improving the primitives, creating a feedback loop that accelerates the entire system. They are not the only frontier AI domains, but they are the most densely interacting with physical reality, the furthest from the current language/code paradigm, and thus the most fertile for new capabilities—while also being highly complementary and able to benefit from its dividends.
Robotics
Robotics is the most literal physical AI: an AI system that perceives, reasons, and exerts physical actions in the material world. It also serves as a stress test for each primitive.
Consider what it takes for a general-purpose robot to fold a towel. It needs learned representations of deformable materials under force—a physical prior beyond language pretraining. It needs an action architecture capable of translating high-level commands into continuous control at over 20Hz. It requires simulation-generated training data, since millions of real towel-folding demonstrations are unavailable. Tactile feedback is necessary to detect slips and adjust grip force, as vision alone cannot reliably distinguish a stable from a failing grasp. It needs a closed-loop controller that can recognize errors and recover, rather than blindly executing memorized trajectories.
Figure caption: Simultaneous invocation of the five primitives in robotic tasks.
This is why robotics is a frontier system, not just a mature engineering discipline with better tools. These primitives are not just improvements on existing robotic capabilities; they unlock classes of operations, motions, and interactions that were impossible outside narrow industrial settings.
Significant progress has been made in recent years—something we’ve covered before. The first wave of VLA models proved foundational control is feasible. Architectural advances have connected high-level reasoning with low-level control. On-device inference is becoming practical, and cross-entity transfer means a single model can adapt to new robots with limited data. The remaining core challenge is scaling reliability—current success rates (~95%) over 10-step tasks are insufficient for deployment; industrial standards demand much higher. RL-based post-training fine-tuning holds great promise to reach the robustness thresholds needed for real-world use.
These advances impact market structure. For decades, the value of robotics has been concentrated in hardware—mechanical systems remain central. But as learning strategies become more standardized, value shifts toward models, training infrastructure, and data flywheels. Robotics also feeds back into the primitives: each real-world trajectory improves world models; deployment failures reveal simulation gaps; testing new embodiments broadens the physical experience base for pretraining. Robots are both the most demanding consumers of primitives and a key source of feedback signals for their improvement.
Autonomous Science
If robots test primitives through “real-time physical actions,” autonomous science tests a different aspect—multi-step reasoning about complex causal physical systems over hours or days, interpreting results, contextualizing, and revising strategies.
Figure caption: How autonomous science (AI scientist) integrates the five primitives.
AI-driven science is the most thorough application of these primitives. An SDL (self-driving lab) must learn physical and chemical dynamics to predict experimental outcomes; it needs embodied actions to pipette, position samples, operate instruments; simulation for pre-screening experiments and optimizing scarce instrument time; expanded sensing—spectroscopy, chromatography, mass spectrometry, and emerging chemical and biological sensors—to characterize results. It demands closed-loop agent orchestration capable of maintaining multi-round “hypothesize-experiment-analyze-revise” workflows autonomously, with traceability, safety, and adaptive strategies.
No other domain calls these primitives as deeply. That’s why autonomous science is a “system” frontier, not just better lab automation. Companies like Periodic Labs and Medra combine scientific reasoning and physical validation, iterating rapidly and generating experimental training data.
The value is intuitive: traditional materials discovery takes years; AI can accelerate this process dramatically. The key constraints are shifting from hypothesis generation (well-supported by foundational models) to manufacturing and validation (requiring physical instruments, robots, closed-loop optimization). SDL aims at this bottleneck.
Another defining feature of autonomous science—applicable across all physical systems—is its role as a data engine. Each experiment produces not just a scientific result but a physically grounded, validated training signal. A measurement of polymer crystallization under specific conditions enriches the world model; a verified synthesis route becomes training data for physical reasoning; a failed experiment reveals model inaccuracies. Data from real experiments are structured, causal, empirically validated—precisely what physical reasoning models need but cannot get from other sources. Autonomous science directly converts physical reality into structured knowledge, advancing the entire physical AI ecosystem.
New Interfaces
Robots extend AI into physical actions; autonomous science extends AI into physical research. New interfaces connect AI directly with human perception, sensory experience, and bodily signals—covering AR glasses, EMG wearables, and implanted brain-machine interfaces. The common function is expanding the bandwidth and modalities of human-AI channels—and generating direct human-world interaction data for building physical AI.
Figure caption: Spectrum of new interfaces—from AR glasses to brain-machine interfaces.
The challenge and potential lie in the gap from current paradigms. Language models conceptually understand these modalities but are not inherently familiar with silent speech motion patterns, olfactory receptor geometries, or the temporal dynamics of EMG signals. Decoding these signals requires learning from the expanding sensory channels themselves. Many modalities lack internet-scale pretraining corpora; data often only come from the interfaces, meaning the system and its training data co-evolve—without a direct analogy in language AI.
Recent progress is exemplified by the rapid rise of consumer AI wearables. AR glasses have significantly improved in recent years (with companies deploying them in consumer and industrial contexts); voice-first wearables are enabling language AI to access richer physical context—truly following users into environments. Long-term, neural interfaces may unlock more complete interaction modalities. The computational shift driven by AI creates opportunities for substantial human-AI interaction upgrades; companies like Sesame are developing new modalities and devices for this purpose.
Voice, as a more mainstream modality, also benefits new interaction forms. Products like Wispr Flow promote voice as the primary input (due to its high information density and naturalness), with silent speech interfaces improving the market. Silent speech devices use sensors to capture tongue and vocal fold movements, enabling silent speech recognition—a higher-density human-machine interface.
Brain-machine interfaces (both invasive and non-invasive) are a deeper frontier, with a growing ecosystem of startups. Signals are emerging at the intersection of clinical validation, regulation, platform integration, and funding—once purely academic. Neuralink has implanted devices in multiple patients; surgical robots and decoding algorithms are iterating. Synchron’s intravascular Stentrode enables paralyzed users to control digital and physical environments. Echo Neurotechnologies develops high-resolution cortical speech decoding. New startups like Nudge attract talent and capital for novel neural interfaces and brain interaction platforms. Milestones include BISC chips with 65,536 electrodes demonstrating wireless neural recording; BrainGate decoding internal language directly from motor cortex.
The common thread across AR glasses, AI wearables, silent speech devices, and BCI is not just “they are all interfaces,” but that they form a spectrum of increasing bandwidth connecting human physical experience and AI systems—each point supporting ongoing primitive advances behind the three main fields discussed. A robot trained with millions of high-quality first-person videos from AR glasses learns different priors than one trained solely on remote operation data; a lab AI responding to subglottic commands differs fundamentally from a keyboard-controlled system; a neural decoder trained on high-density BCI data produces motion plans impossible to obtain via other channels.
The proliferation of these devices expands the data manifold for training frontier physical AI systems—and much of this expansion is driven by well-funded consumer companies, not just academia, meaning the data flywheel can accelerate alongside market adoption.
Primitive 5: Closed-Loop Intelligent Agent Systems
The last primitive is more architectural: systems that integrate perception, reasoning, and action orchestration into continuous, autonomous, closed-loop operation over extended periods.
In language models, this has manifested as agent systems—multi-step reasoning, tool use, self-correction—pushing models from single-turn Q&A to autonomous problem-solving. In the physical realm, a similar shift is happening, but with much higher stakes. A language agent can backtrack at no cost; a physical agent that spills a reagent cannot undo it.
Physical agent systems are characterized by three features: first, they must embed in experiments or operations with a closed loop—directly interfacing with raw instrument data streams, physical sensors, and execution primitives, grounding reasoning in physical reality rather than textual descriptions. Second, they require long-term persistence—memory, traceability, safety, recovery—linking multiple operational cycles rather than treating each task as isolated. Third, they must adapt in a closed loop—revising strategies based on physical outcomes, not just textual feedback.
This primitive fuses independent capabilities—world models, reliable action architectures, rich sensor suites—into fully autonomous physical systems. It is an integration layer; its maturity is a prerequisite for the three application domains to exist as deployable, real-world systems rather than isolated research demos.
Three Domains
The primitives above are universal enablers; they do not specify where the most critical applications will emerge. Many fields involve physical action, measurement, or perception. The distinction between “frontier systems” and “incremental improvements” depends on the degree of compound interest from model capabilities and scaling infrastructure—not just better performance, but the emergence of capabilities previously impossible.
Robotics, AI-driven science, and new human-machine interfaces are the three fields with the strongest compound effects. Each assembles the primitives in unique ways, each constrained by the current limits of these primitives, and each produces structured physical data as a byproduct—data that feeds back into improving the primitives, creating a feedback loop that accelerates the entire system. They are not the only frontier AI domains, but they are the most densely interacting with physical reality, the furthest from the current language/code paradigm, and thus the most fertile for new capabilities—while also being highly complementary and able to benefit from its dividends.
Robotics
Robotics is the most literal physical AI: an AI system that perceives, reasons, and exerts physical actions in the material world. It also serves as a stress test for each primitive.
Consider what it takes for a general-purpose robot to fold a towel. It needs learned representations of deformable materials under force—a physical prior beyond language pretraining. It needs an action architecture capable of translating high-level commands into continuous control at over 20Hz. It requires simulation-generated training data, since millions of real towel-folding demonstrations are unavailable. Tactile feedback is necessary to detect slips and adjust grip force, as vision alone cannot reliably distinguish a stable from a failing grasp. It needs a closed-loop controller that can recognize errors and recover, rather than blindly executing memorized trajectories.
Figure caption: Simultaneous invocation of the five primitives in robotic tasks.
This is why robotics is a frontier system, not just a mature engineering discipline with better tools. These primitives are not just improvements on existing robotic capabilities; they unlock classes of operations, motions, and interactions that were impossible outside narrow industrial settings.
Significant progress has been made in recent years—something we’ve covered before. The first wave of VLA models proved foundational control is feasible. Architectural advances have connected high-level reasoning with low-level control. On-device inference is becoming practical, and cross-entity transfer means a single model can adapt to new robots with limited data. The remaining core challenge is scaling reliability—current success rates (~95%) over 10-step tasks are insufficient for deployment; industrial standards demand much higher. RL-based post-training fine-tuning holds great promise to reach the robustness thresholds needed for real-world use.
These advances impact market structure. For decades, the value of robotics has been concentrated in hardware—mechanical systems remain central. But as learning strategies become more standardized, value shifts toward models, training infrastructure, and data flywheels. Robotics also feeds back into the primitives: each real-world trajectory improves world models; deployment failures reveal simulation gaps; testing new embodiments broadens the physical experience base for pretraining. Robots are both the most demanding consumers of primitives and a key source of feedback signals for their improvement.
Autonomous Science
If robots test primitives through “real-time physical actions,” autonomous science tests a different aspect—multi-step reasoning about complex causal physical systems over hours or days, interpreting results, contextualizing, and revising strategies.
Figure caption: How autonomous science (AI scientist) integrates the five primitives.
AI-driven science is the most thorough application of these primitives. An SDL (self-driving lab) must learn physical and chemical dynamics to predict experimental outcomes; it needs embodied actions to pipette, position samples, operate instruments; simulation for pre-screening experiments and optimizing scarce instrument time; expanded sensing—spectroscopy, chromatography, mass spectrometry, and emerging chemical and biological sensors—to characterize results. It demands closed-loop agent orchestration capable of maintaining multi-round “hypothesize-experiment-analyze-revise” workflows autonomously, with traceability, safety, and adaptive strategies.
No other domain calls these primitives as deeply. That’s why autonomous science is a “system” frontier, not just better lab automation. Companies like Periodic Labs and Medra combine scientific reasoning and physical validation, iterating rapidly and generating experimental training data.
The value is intuitive: traditional materials discovery takes years; AI can accelerate this process dramatically. The key constraints are shifting from hypothesis generation (well-supported by foundational models) to manufacturing and validation (requiring physical instruments, robots, closed-loop optimization). SDL aims at this bottleneck.
Another defining feature of autonomous science—applicable across all physical systems—is its role as a data engine. Each experiment produces not just a scientific result but a physically grounded, validated training signal. A measurement of polymer crystallization under specific conditions enriches the world model; a verified synthesis route becomes training data for physical reasoning; a failed experiment reveals model inaccuracies. Data from real experiments are structured, causal, empirically validated—precisely what physical reasoning models need but cannot get from other sources. Autonomous science directly converts physical reality into structured knowledge, advancing the entire physical AI ecosystem.