8 Levels of Agent Engineering

DeepFlowTech

2026-03-17 07:09:31

Author: Bassim Eledath

Translation: Baoyu

AI’s programming capabilities are surpassing our ability to control them. That’s why all those efforts to relentlessly boost SWE-bench scores haven’t truly aligned with the productivity metrics that engineering leadership cares about. The Anthropic team launched Cowork in just 10 days, while another team using the same model couldn’t even produce a POC—proof of concept. The difference is that one team has bridged the gap between capability and practice, while the other hasn’t.

This gap won’t disappear overnight; it will gradually narrow in levels. There are a total of 8 levels. Most readers of this article may have already passed the first few levels, but you should be eager to reach the next—because each level-up signifies a huge leap in output, and every improvement in model capability further amplifies these gains.

Another reason to care is the effect of team collaboration. Your output depends more on your teammates’ levels than you might think. Suppose you’re a Level 7 expert; at night, an AI agent is helping you with several PRs behind the scenes. But if your code repository requires a colleague’s approval to merge, and that colleague is still at Level 2, manually reviewing PRs, your throughput is bottlenecked. Helping teammates level up benefits you as well.

Through conversations with many teams and individuals about their AI-assisted programming practices, here is the observed progression path (the order isn’t strictly fixed):

8 Levels of AI Engineering

Level 1 & 2: Tab Completion & AI IDEs

I’ll briefly cover these two levels mainly for completeness. Feel free to skim.

Tab completion is the starting point. GitHub Copilot kicked off this movement—press Tab to auto-complete code. Many may have forgotten this stage, and newcomers might skip it altogether. It’s more suited for experienced developers who can set up the code skeleton first, then let AI fill in the details.

Dedicated AI IDEs like Cursor have changed the game by connecting chat and codebases, making cross-file editing much easier. But the ceiling is always context. Models can only help with what they see, and frustratingly, they either lack the right context or see too much irrelevant info.

Most people at this level are experimenting with plan-based modes: transforming rough ideas into structured step-by-step plans for the LLM, iterating on these plans, then executing. This works well at this stage and is a reasonable way to maintain control. But in higher levels, reliance on plan modes diminishes.

Level 3: Context Engineering

Now we enter the interesting part. Context Engineering is the buzzword of 2025. It became a concept because models can finally reliably follow a reasonable number of instructions with just the right context. Noisy or insufficient context is equally bad, so the core work is to increase the information density per token. “Each token must fight for its position in the prompt”—that was the mantra.

The same information with fewer tokens—information density is king (Source: humanlayer/12-factor-agents).

In practice, context engineering covers more than most realize. It includes system prompts and rule files (.cursorrules, CLAUDE.md). It involves how you describe tools, as models read these descriptions to decide which tools to invoke. It includes managing dialogue history to prevent long-running agents from losing track after many turns. It also involves deciding which tools to expose each turn, since too many options can overwhelm the model—just like humans.

Today, the term “context engineering” is less common. The focus has shifted toward models that tolerate noisier contexts and can reason in more chaotic scenarios (larger context windows help too). But context consumption remains critical. The following scenarios still pose bottlenecks:

Smaller models are more sensitive to context. Voice applications often use smaller models, and context size affects response latency.
Token consumption is high. Protocols like Playwright MCP and image inputs quickly eat tokens, pushing you into “compressed sessions” earlier than expected in Claude Code.
Agents integrated with dozens of tools spend more tokens parsing tool definitions than actually doing work.

More broadly, context engineering hasn’t disappeared—it’s evolved. The focus has shifted from filtering bad context to ensuring the right context appears at the right time. This shift paves the way for Level 4.

Level 4: Compound Engineering

Context engineering improves the current session. Compound Engineering (coined by Kieran Klaassen) improves future sessions. This realization was a turning point for me and many others—it made us see that “programming by feel” is far more than just prototyping.

It’s a cycle of “plan, delegate, evaluate, consolidate.” You plan tasks, provide enough context for the LLM to succeed, delegate, evaluate outputs, and crucially, consolidate lessons learned: what works, what doesn’t, what patterns to follow next time.

The magic is in the “consolidate” step. LLMs are stateless. If it reintroduces a dependency you explicitly removed yesterday, it will do so again tomorrow—unless you tell it not to. The most common solution is updating your CLAUDE.md (or equivalent rules file) to embed lessons into future sessions. But beware: stuffing too many instructions into rules can backfire (more instructions ≠ better). A better approach is creating an environment where the LLM can easily discover useful context—like maintaining an up-to-date docs/ folder (covered in Level 7).

Practitioners of compound engineering are highly sensitive to the context fed to the LLM. When errors occur, their instinct isn’t to blame the model but to check if the context is missing something. This intuition makes Levels 5-8 possible.

Level 5: MCP & Skills

Levels 3 & 4 address context issues. Level 5 tackles capability issues. MCPs and custom skills enable your LLM to access databases, APIs, CI pipelines, design systems, and tools like Playwright for browser testing or Slack for notifications. The model no longer just thinks about your codebase—it can directly operate.

There’s plenty of quality material on MCPs and skills, so I won’t elaborate here. But some examples from my experience: our team shares a PR review skill, iterating on it collectively (still ongoing). It conditionally spawns sub-agents based on PR type—checking database security, analyzing complexity for redundancy, monitoring prompt health to ensure standards, running linters like Ruff.

Why invest so much in review skills? Because when the agent starts producing PRs in bulk, manual review becomes a bottleneck, not a quality gate. Latent Space makes a compelling case: traditional code review is dead. Automated, consistent, skill-driven review is the future.

For MCPs, I use Braintrust MCP to let the LLM query evaluation logs and make direct edits. I use DeepWiki MCP to access documentation from any open-source repo without manually pulling docs into context.

When multiple team members develop their own skills, it’s worth consolidating into a shared registry. Block (with respect) has a great article: they built an internal skills marketplace with over 100 skills, curated for specific roles and teams. Skills and code are treated equally—pull requests, reviews, version history.

A notable trend: LLMs increasingly use CLI tools instead of MCPs (and it seems every company is releasing its own: Google Workspace CLI, Braintrust upcoming). The reason is token efficiency. MCP servers inject full tool definitions into context each round, whether used or not. CLI, on the other hand, runs targeted commands, with only relevant output entering the context window. I prefer agent-browser over Playwright MCP for this reason.

Pause here. Levels 3-5 are the foundation for everything that follows. LLMs excel at some tasks and struggle with others. Developing intuition about these boundaries is essential before layering more automation. If your context is noisy, prompts are insufficient or inaccurate, or tool descriptions vague, then Levels 6-8 will only magnify these issues.

Level 6: Harness Engineering

The rocket truly takes off.

Context engineering focuses on what the model sees. Harness Engineering (Harness Engineering) focuses on building the entire environment—tools, infrastructure, feedback loops—that allows the agent to work reliably without your intervention. It’s not just an editor; it’s a complete feedback system.

OpenAI’s Codex toolchain—a comprehensive observability system—lets agents query, relate, and reason about their outputs (Source: OpenAI).

OpenAI’s Codex team integrated Chrome DevTools, observability tools, and browser navigation into the agent runtime, enabling screenshots, UI automation, log queries, and self-verification. Give a prompt, and the agent can reproduce bugs, record videos, and implement fixes. It then validates by manipulating the app, submits PRs, responds to reviews, and merges—only involving humans when necessary. The agent isn’t just coding; it sees what the code does and iterates—like a human.

My team builds voice and chat agents for technical troubleshooting, so I created a CLI tool called Converse, enabling any LLM to chat with our backend. The LLM modifies code, tests in the live system via Converse, and iterates. Sometimes this self-improvement loop runs for hours. When results are verifiable, it’s especially powerful: conversations follow this flow, or invoke specific tools in certain cases (e.g., handoff to human).

The core concept supporting all this is backpressure—an automated feedback mechanism (type systems, tests, linters, pre-commit hooks) that allows the agent to detect and correct errors without human intervention. If you want autonomy, you need backpressure; otherwise, you get a production machine that just produces garbage. This extends to security: Vercel’s CTO points out that agents, generated code, and your keys should be in different trust domains, because a prompt injection attack in logs could trick the agent into stealing credentials—if everything shares the same security context. Security boundaries are backpressure: they constrain what the agent can do if it goes out of control, not just what it should do.

Two principles clarify this:

Design for throughput, not perfection. Expecting every submission to be perfect leads to endless loops of bugs and overwrites. It’s better to tolerate small, non-blocking errors and do a final quality check before release—just like with human colleagues.

Constraints are better than instructions. Step-by-step prompts (“do A, then B, then C”) are becoming outdated. From experience, defining boundaries is more effective than a checklist, because the agent will fixate on the list and ignore everything outside it. Better prompts are: “This is the result I want; keep doing it until all tests pass.”

Another half of Harness Engineering is ensuring the agent can navigate the codebase autonomously. OpenAI’s approach: keep AGENTS.md under 100 lines, as a directory pointing to other structured docs, with their currency managed via CI, not relying on quick, temporary updates.

Once you’ve built all this, a natural question arises: if the agent can verify its work, navigate the repo freely, and correct errors without you—why are you still sitting there?

A reminder: for those at Levels 1-4, the following might sound sci-fi (but save it for later, and come back).

Level 7: Background Agents

Hot take: planning mode is dying.

Boris Cherny, creator of Claude Code, says 80% of tasks still start with planning. But with each new model generation, the success rate after planning is climbing. I believe we’re approaching a tipping point: planning as a separate manual step will gradually vanish. Not because planning isn’t important, but because models are smart enough to plan themselves. But only if you’ve done Levels 3-6 work. If your context is clean, constraints clear, tool descriptions complete, and feedback loops closed, the model can reliably plan without your review. If not, you still need to oversee the plan.

To clarify: planning as a general practice isn’t disappearing; it’s changing form. For beginners, planning remains the right entry point (see Levels 1 & 2). But for Level 7 complexity, “planning” becomes more exploratory—probing codebases, prototyping in worktrees, exploring solutions. Increasingly, background agents handle this exploration for you.

This is crucial because it unlocks background agents. If an agent can generate reliable plans and execute them without your sign-off, it can run asynchronously while you do other things. A key shift—from “I switch tabs” to “work progresses without me.”

The Ralph cycle is a popular entry: an autonomous agent loop that repeatedly runs a programming CLI until all items in the PRD are completed, each iteration starting a new instance with fresh context. In my experience, running Ralph loops well is tricky; any vague or inaccurate PRD description can backfire. It’s a bit too “throw it out and forget.”

You can run multiple Ralph loops in parallel, but the more you start, the more time you spend coordinating, scheduling, checking outputs, and pushing progress. You’re no longer coding—you’re managing. You need an orchestrator to handle scheduling, so you can focus on intent, not logistics.

Dispatch runs 3 models in parallel, launching 5 workers—keeping your session lean, with agents doing the work.

Recently, I’ve been heavily using Dispatch, a Claude Code skill I built that turns your session into a command center. You stay in a clean session, while workers in isolated contexts handle heavy lifting. The scheduler plans, delegates, tracks. Your main window is reserved for orchestration. When a worker stalls, it raises clarifying questions instead of silently failing.

Dispatch runs locally, ideal for rapid development with tight feedback, easier debugging, no infrastructure overhead. Ramp’s Inspect is complementary—designed for longer, more autonomous runs: each agent session runs in a cloud sandbox VM with a full dev environment. A PM spots a UI bug, flags it in Slack, and Inspect takes over when you close your laptop. The tradeoff is operational complexity (infrastructure, snapshots, security), but you get scale and reproducibility unmatched by local agents. I recommend using both—local and cloud.

At this level, a surprisingly powerful pattern is using different models for different tasks. The best engineering teams aren’t clones of each other. They have diverse thinking styles, backgrounds, strengths. The same applies to LLMs: models trained differently have distinct personalities. I often assign Opus for implementation, Gemini for exploration, Codex for review—collectively outperforming any single model working alone. Think of it as collective intelligence applied to code.

It’s also critical to decouple implementers from reviewers. I’ve learned this the hard way: if the same model instance both implements and reviews its work, it becomes biased—ignoring issues, claiming tasks are done when they’re not. Not malicious, just a bias. Use a different model (or a different instance with review prompts) for review. It greatly improves signal quality.

Background agents also open the door for CI + AI integration. Once agents can run autonomously, they can trigger existing infrastructure: a documentation bot regenerates docs after each merge and submits PRs to update CLAUDE.md; a security bot scans PRs and submits fixes; a dependency bot not only flags issues but upgrades packages and runs tests. Good context, continuous lessons, powerful tools, automated feedback—everything runs autonomously now.

Level 8: Autonomous Agent Teams

No one has fully mastered this level yet, though some are pushing toward it. It’s the cutting edge.

At Level 7, you have orchestrated LLMs distributing tasks to work LLMs in a hub-and-spoke pattern. Level 8 removes this bottleneck. Agents coordinate directly—claim tasks, share discoveries, resolve dependencies, handle conflicts—without a central orchestrator.

Claude Code’s experimental Agent Teams feature is an early implementation: multiple instances working in parallel on a shared codebase, communicating directly within their contexts. Anthropic built a Linux compiler from scratch using 16 parallel agents. Cursor ran hundreds of concurrent agents for weeks, building a browser from zero, migrating codebases from Solid to React.

But there are issues. Without hierarchy, agents become timid, stuck in place, making no progress. Anthropic’s agents kept breaking existing features until a CI pipeline was added to prevent regressions. Everyone experimenting at this level agrees: multi-agent coordination is very hard, and no one has found the optimal solution.

Honestly, I don’t think models are ready for this level of autonomy for most tasks. Even if they’re smart enough, building compilers or browsers is too slow, token-expensive, and economically unviable (impressive but far from mature). For most of our daily work, Level 7 is the real lever. I wouldn’t be surprised if Level 8 eventually becomes mainstream, but for now, I focus on Level 7—unless you’re Cursor, where breaking through is your business.

Level ?: The Next Step

The inevitable “what’s next” question.

Once you master orchestrating agent teams smoothly, the interaction shouldn’t be limited to text. Voice-to-voice (or maybe thought-to-thought?) interactions with programming agents—dialogue-based Claude Code—are the natural next step. Watching your app, describing a series of changes aloud, and seeing them happen before your eyes.

A group is chasing perfect one-shot generation: say what you want, and AI delivers it flawlessly in one go. But that assumes we always know exactly what we want. We don’t. We never do. Software development has always been iterative, and I believe it always will be. But it will become much easier, far beyond plain text, and much faster.

So: what level are you at? What are you doing to reach the next?

Where are you on this scale?

How do you usually start a programming task with AI?

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.