Anatomy of an AI agent

Thinking through what today’s agents get right and why real autonomy is still hard

Jul 23, 2025

For a while now, the dominant conversation in AI has been about agents. There’s ongoing debate about whether AI agents are actually delivering value, making people more productive, or if the promised transformation of entire industries is unfolding slower than expected and starting to disappoint. Joining in the conversation, I wanted to step back and reflect on the agents that I find myself addicted to and that actually make me more productive. For me, that’s been coding agents and research agents specifically as the ones I use on a day to day basis. (I’m excluding general AI tools like ChatGPT or transcript summarizers here, since they don’t operate autonomously.)

To be specific, I’m using “agent” to mean a system that operates across multiple steps, choosing actions based on intermediate feedback and not just generating a one-off response. These systems maintain some form of goal or task structure, invoke tools or actions, and adapt their behavior as they go. That rules out one-shot chatbots, and even most function-calling assistants.

What underlying design patterns make coding and research agents stand out as the most widely adopted examples of agentic software?

Good coding agents (like Windsurf Cascade, Cursor Agent, Claude Code) track changes across files, understand the structure of a codebase, parse intent from natural language, and surface outputs clearly for approval. Research agents (like ChatGPT Deep Research) pull from diverse sources, filter relevance, synthesize across formats, and present structured responses with citations. But beyond features, the real reason I return to them is that they can plan, adapt, and follow through across multiple steps without needing constant re-prompting.

Here’s an attempt to generalize the features of truly helpful agents:

Extensive contextual awareness
Coding agents can reference the full codebase, track recent edits, and understand the intent behind prompts. Research agents can identify and synthesize information across vast, heterogeneous sources. In both cases, the agent's ability to operate with a rich understanding of state is essential to usefulness.
Workflow-native interfaces
The best agents integrate seamlessly into existing tools. With coding agents like Windsurf or Cursor, the interaction layer is your IDE - enhanced, not replaced. Prompts are issued from a sidebar, changes appear inline, and control remains with the user. Similarly, research agents often extend familiar chat interfaces, making them easy to adopt and quick to trust.
Human-in-the-loop feedback
Trust comes from transparency. Effective coding agents surface changes clearly and allow for user approval before execution, preserving both confidence and competence. Research agents vary here, with some offering context gathering before initiating tasks, though they often lack mid-process checkpoints. Still, the best ones make their reasoning traceable through citations and structured output.
Multimodal reasoning
Top-tier agents can interpret and reason across multiple data types, not just text. Coding agents work with source code, logs, error traces. Research agents increasingly handle PDFs, spreadsheets, web content, and visual media. This broad input capability dramatically expands the range of questions they can tackle and the depth of their analysis.

The features I’ve outlined (contextual awareness, seamless interfaces, transparency, and multimodal reasoning) are what make these agents genuinely helpful in my own workflows. I’ve focused on coding and research agents not because they’re the only success stories, but because they’re the ones I use regularly and that seem to have found broader traction. There are certainly other domains where agents are making real progress, often in more specialized or enterprise-specific contexts, but these two illustrate what’s possible when the pieces come together.

That said, these environments are relatively structured, the tools are known, and the user goals are often well-scoped. Once you step outside those conditions into workflows with fuzzier goals, messier data, or higher risk of failure, then these same design patterns can start to break down.

So what’s actually holding agents back from working more broadly? In my view, the bottlenecks fall into two categories: execution and cognition.

Execution is the more tractable challenge, not because it’s easy but because it’s engineerable. It’s about giving agents the right tools, data, and interfaces so they can actually do things. This includes:

Enabling actions (APIs, environments, toolchains)
Maintaining memory or state across tasks
Operating agents in the messy real world

These are mostly systems engineering problems. But they’re not trivial and they directly constrain what agents can actually do.

Cognition is the harder part because it touches the model’s core limitations. LLMs generate plausible next steps based on text prediction, not on a model of the world or a simulation of outcomes. They don’t reason in the deliberative, goal-oriented sense that agents require. So everything from planning to error correction becomes an approximation layered on top of a stateless engine.

Let’s go over the main bottlenecks of each.

Execution bottlenecks

Tool Use & Orchestration

For an agent to be useful, it needs to take real actions and not just generate text. That might mean executing code, calling APIs, scraping data, writing to a database, or triggering downstream workflows.

Today’s systems typically manage this through function-calling APIs and orchestration layers. Examples include OpenAI’s tool use APIs, AutoGen-style multi-turn loops, or Toolformer-like prompting where the model decides when and how to invoke external tools. Historically, this has required bespoke APIs, custom tool wrappers, and tightly coupled orchestration code. But that’s starting to change with the introduction of Model Context Protocol (MCP), a new standard that gives AI agents a consistent way to connect with tools, services, and data, no matter where they live or how they’re built. MCP aims to simplify and modularize agent architecture, making it easier to plug tools into agent workflows without reinventing the interface each time.

What’s Still Hard

Even with MCP streamlining how tools are connected, execution remains a fragile part of agent workflows. A single tool failure due to a flaky API, malformed input, or unexpected response can still derail the entire task. While retries and basic error handling are common, most agents struggle to adapt their broader plan when a tool fails. Backtracking, revising subgoals, or rethinking strategy in response to execution failures remains difficult, especially outside of narrow, well-scripted domains.

Security is another important constraint. Giving agents the ability to act on real systems (accessing files, triggering deployments, moving money, etc.) introduces real-world risks. While efforts around sandboxing, fine-grained permissions, and output validation are advancing (e.g., tools like ReAct-style action plans, browser-based agents with guardrails, or permissions-aware runtimes), robust standards are still emerging. As agents grow more autonomous, managing trust boundaries becomes increasingly critical.

The last point here is that while current agents can select from a set of tools, orchestrating them in unfamiliar environments remains a frontier. Most agents operate in pre-defined contexts with known tools and expected inputs. Generalized tool use where agents learn to explore new APIs, adapt to changing environments, or compose novel workflows on the fly is still early-stage. It’s one thing to decide when to call a tool; it’s another to figure out how to use a new one, or why it’s appropriate in a novel situation.

Context Handling & Memory: Beyond Just Bigger Windows

An agent needs to remember what it’s doing. That means tracking goals, subgoals, prior actions, and environmental state, and importantly: not just what it tried, but why.

Today’s dominant memory pattern relies on vector databases and semantic embeddings. Agents store snapshots of prior interactions like tool calls, logs, and code diffs (in the case of coding agents) in external stores, then retrieve relevant items to inject back into the prompt. This helps with recall and continuity, but it’s not true memory. It’s retrieval without understanding or internal state.

Because large language models are stateless, agents rely on external memory to maintain continuity. Short-term memory (recent outputs, tool calls, goals) is often injected into the prompt; long-term memory (plans, decisions, and episodic traces) is stored in vector databases or serialized logs. Frameworks like Windsurf, AutoGen, or LangGraph manage this through structured traces using formats like JSON or DAGs. Emerging standards like MCP are beginning to formalize these traces, offering a consistent way for agents to log, retrieve, and reason over prior steps. But memory isn’t just infrastructure, it shapes behavior. Fragile, bloated, or poorly structured memory can silently degrade reasoning performance.

What’s Still Hard

The core challenge isn’t actually storing memory, but deciding what to store. Agents lack an innate sense of salience. They can’t tell which past steps will matter later, leading to two failure modes: saving too much irrelevant detail (slowing down or confusing the system), or forgetting crucial context too early.

Humans intuitively prioritize memory by importance, novelty, or emotional weight. Agents don’t. Most systems use blunt heuristics, like “keep the last N steps” or “store everything,” but these are brittle and context-agnostic.

Related to this is the concept of graceful forgetting. Human memory decays for a reason, compressing and abstracting. Agents don’t do this. Logs and embeddings accumulate indefinitely unless manually pruned. The result: bloated memory, slower retrieval, and noisier relevance scoring. In practice, this often leads to resets or manual cleanup, breaking the illusion of autonomous continuity.

Even when relevant memory is retrieved, agents often fail at context stitching: integrating past information into present reasoning. Retrieval isn’t integration. Agents frequently treat recalled events as static reference, rather than using them to update beliefs or modify plans. This leads to common failure modes like repeated actions, ignored failures, or incoherent sequences.

While memory systems live squarely in the execution stack, their limitations often expose deeper reasoning gaps. Salience, abstraction, and belief updates aren’t just system design questions but fundamental to how agents think. This is where execution and cognition begin to blur.

Cost, Latency, and Runtime Realities

Agents work as loops, rather than API calls. Real agents involve multiple steps, tools, and decision points over extended timeframes. This makes cost, latency, and runtime complexity major bottlenecks.

You can mitigate this with model cascading, letting smaller, cheaper models handle routine tasks while reserving large models for harder reasoning. Result caching helps avoid recomputation, and asynchronous pipelines allow agents to manage long-running tasks without blocking the entire system.

What’s Still Hard

There’s a tradeoff between interactivity and runtime. Real agents need to be continuous and responsive, not batch jobs that run overnight. But keeping agents alive across long time horizons while managing cost and latency is still a hard problem, especially for consumer use cases where margins are tight.

Cognition bottlenecks

Reasoning & Planning: The Limits of Next-Token Prediction

Generating plausible next tokens is not the same as planning. For agents to function in dynamic environments, they need to break tasks into subgoals, choose tools, decide when to stop, and adapt when things don’t go as expected. This requires a form of reasoning that goes beyond fluent language generation.

Several approaches have emerged to help LLM-based agents approximate this.

Chain-of-Thought prompting encourages models to "think aloud" and decompose problems step-by-step. ReAct-style loops add interaction: the model reasons, takes an action, observes the result, and reflects before continuing. Other systems incorporate external planners, like symbolic search, decision graphs, or hardcoded policy trees, to structure multi-step behavior.

What’s Still Hard

These methods work well for constrained environments, but they expose a core limitation: most agents do not maintain an internal model of the world. Language models generate each step based on local context, not on a persistent state that evolves with their actions. They don’t naturally track what’s changed or why, and there’s no built-in mechanism for causal reasoning or state updates.

This leads to brittle behavior. Agents may repeat steps they’ve already taken, ignore contradictory information, or fail to notice when a plan has gone off track. In more complex settings, where earlier actions constrain future options, this absence of statefulness becomes a fundamental bottleneck.

Several research directions are exploring how to address this. Simulation-based agents aim to build persistent internal worlds that evolve over time. Memory-augmented or recurrent architectures attempt to embed state directly into the model’s computation. These are still early-stage, but they represent promising moves beyond the stateless transformer loop.

A related issue is error recovery. Today’s agents rarely recognize when they’ve made a mistake. They tend to hallucinate plausible next steps rather than backtrack or revise a failing plan. This is partly due to the architecture, if there’s no introspective loop or self-monitoring mechanism. Some work is trying to reintroduce structured introspection, using self-reflection prompts or critic models that evaluate intermediate reasoning steps. Others explore scaffolding techniques that help agents monitor their own reasoning paths, though these remain brittle and highly context-sensitive.

Another challenge arises at the very start of the task: goal translation. Users often provide underspecified instructions, like “find the best option,” “analyze this dataset,” or “optimize this flow” which doesn’t detail constraints, trade-offs, or concrete objectives. Agents struggle to convert these into structured plans. This leads to brittle pipelines that succeed only when the user phrases the task just right.

There’s early progress here too. Some systems use interactive clarification to refine the task before execution. Others experiment with semi-structured goals, where the user provides partial constraints and the agent fills in the rest. A related line of work involves plan synthesis models, where the agent first produces a structured plan like an action graph or goal tree before attempting execution. This separation of planning and acting remains a promising but underdeveloped area.

Confidence Calibration & Knowing When to Stop

Agents need need to know when they're unsure, when to ask for help, and when to stop. Without calibrated confidence, even sophisticated agents can spiral: pressing forward with incorrect assumptions, compounding errors, or returning answers that sound plausible but are deeply wrong. There’s also a social dimension: users don’t just need correct answers, they need to trust that the agent knows when to act, when to wait, and when to ask. This becomes especially important in collaborative workflows where initiative and overreach can feel intrusive or disruptive.

Several techniques have emerged to address this. Some systems use self-critique loops, where an agent revisits its own output to check for errors or inconsistencies. Others use uncertainty proxies like token-level entropy or log-probabilities to estimate confidence. Human-in-the-loop checkpoints are also common in production systems: agents pause at decision boundaries, letting users review before taking action.

What’s still hard

The underlying problem remains: language models are not calibrated. They generate outputs token-by-token based on likelihood, not grounded beliefs. A confident-sounding answer doesn’t imply correctness and there’s no native mechanism for tracking what the model “knows” or where it’s extrapolating. This makes it difficult for agents to decide when to back off, when to escalate, or when to retry with a new plan.

Even when models are wrong, they often can’t explain why. There’s no internal attribution, no pointer to the faulty assumption, hallucinated fact, or misapplied rule. This limits both debuggability and corrective behavior: the agent can’t adjust based on where things went wrong, only that something failed.

Research is exploring paths forward. Some systems use critique models to flag errors in reasoning chains. Others experiment with meta-level scaffolding, where the agent explicitly reflects on its reasoning process before committing to an answer. But these are early efforts, and in practice, confidence calibration remains a core limitation for agents operating outside tightly constrained domains.

Beyond planning and error correction, several deeper constraints still hold back general-purpose autonomy. These issues are harder to surface in everyday workflows, but they quietly limit how adaptive, safe, and robust agents can be in new or evolving environments.

Here’s a brief overview of some of the most important ones.

Autonomy and initiative

Most agents today follow predefined workflows. They don't initiate action unless prompted, and rarely generate subgoals independently. Some systems are experimenting with limited autonomy, like suggesting next steps or identifying blockers, but initiating action safely and responsibly remains difficult, especially in open-ended settings where the agent’s scope isn't clearly bounded.

Meta-learning and task transfer

Meta-learning is the ability to improve across tasks, not just within them. Today’s agents typically start from scratch each time, not retaining lessons learned or refining their reasoning strategies over episodes. Transferring skills from one domain to another, or adapting to unfamiliar tasks, is still an open research area.

Exploration vs. exploitation

This presents another tradeoff. Agents often repeat known successful strategies, even when better alternatives exist. Safe exploration (trying new approaches without unintended consequences) is especially hard in high-stakes or cost-sensitive domains. And without a sense of epistemic uncertainty, agents can’t reason about what they don’t know, limiting curiosity or information-seeking behavior.

Alignment and goal drift

There’s also the issue of goal drift, where an agent’s behavior diverges subtly from its intended objective. This can happen when goals are underspecified or when proxies (like test coverage or speed) become targets themselves, a classic case of Goodhart’s Law. In systems like coding agents, this manifests when models optimize for passing tests rather than writing maintainable or readable code.

End-to-end reinforcement learning

Finally, there's growing interest in end-to-end reinforcement learning for agents, which involves training models to reason and act within a unified feedback loop, rather than separating language generation from control logic. Approaches like Decision Transformers and memory-augmented RL are promising, but most production agents today still rely on hand-rolled orchestration to manage tool use and planning. Closing this gap remains an active area of research.

Closing: The Road Ahead

We’ve looked at the kinds of agents I actively seek out, and what makes them useful. We’ve also unpacked some of the technical bottlenecks that need to be solved for agentic software to work more broadly.

For builders, the advice is simple: start small, but not too small. Look for workflows with clear friction points where autonomy could make a real difference - not just scripting tasks, but introducing flexibility, iteration, and decision-making. That’s how coding agents emerged: not as one big leap, but a series of targeted tools that slowly earned trust.

At the same time, not everything needs to be an agent. Just because you can wrap a task in an agent loop doesn’t mean it benefits from reasoning, planning, or exploration. Use agents when autonomy matters.

That said, the case for autonomy grows as we zoom out to larger problem spaces. As agents move from narrow tasks to broader responsibilities, the impact becomes more significant. That future isn’t about replacing everything with agents, but about knowing when the complexity of the task actually calls for one. There’s a growing wave of people tackling these challenges from all angles and it’s an exciting time to be thinking about what agents can and should do. Of course, there are relevant areas I didn’t go into: multi-agent swarms, embodied systems, alignment and safety, and more. For another time!

If you’ve made it this far, feel free to reach out - especially if you’re building agentic systems for real-world problems, or thinking through these challenges in practice.

Matthildur’s Substack

Discussion about this post