AI Coding Agents: What Actually Works After Six Months of Daily Use

Six months ago I started using AI coding agents as a core part of my development workflow, not as a novelty or a side experiment. I write code with them, review code they produce, and run them in CI/CD pipelines. After building production systems for over eighteen years, I’ve developed a fairly reliable sense for which tools earn their place and which ones create more problems than they solve.

AI coding agents have earned their place. But the way they deliver value is almost nothing like the marketing suggests, and the failure modes are more interesting than the success stories.

Context Engineering Is the Real Skill

The single biggest lesson I’ve learned is that the quality of an agent’s output is directly proportional to the quality of context you give it. This has been called “context engineering,” and it’s a more accurate description of the skill than “prompt engineering.” A well-structured prompt sent into a project with no architectural guidance, no conventions documented, and no test suite will produce mediocre code. A simple prompt sent into a project with a solid CLAUDE.md, clear directory structure, and comprehensive tests will produce code that genuinely looks like a team member wrote it.

In practice, this means I spend more time on project documentation than I ever did before — not documentation for humans, but documentation that serves both humans and agents. A CLAUDE.md file at the project root that describes architectural decisions, naming conventions, and deployment patterns. Agent-specific configuration files that define what tools are available and what constraints apply. Test suites that act as executable specifications.

Here’s what a minimal but effective project configuration looks like:

# CLAUDE.md

Built with Hugo. Run `hugo server` for local development.
Site content is in data/experience.yaml, data/skills.yaml, and content/_index.md.
Deploy with ./deploy.sh (builds Hugo then syncs public/ to S3).

## Conventions
- TypeScript strict mode, no `any` types
- Tests colocated with source files (*.test.ts)
- Infrastructure defined in terraform/ using reusable modules
- All API endpoints require authentication via Cognito

This isn’t revolutionary. It’s the same information you’d put in a good README. But the difference is that an agent will actually read it every time, consistently, and apply it. Most developers skim a README once and never look at it again. The agent treats it as binding context for every task.

Start Sequential, Graduate to Autonomous

There’s a strong temptation to jump straight to fully autonomous workflows — let the agent take a Jira ticket, write the code, open the PR, done. I tried this early on and the results were inconsistent enough that I backed off quickly.

The pattern that works is a deliberate escalation. I start every new type of task interactively: I watch the agent explore the codebase, I review its plan before it writes code, and I course-correct when it heads in the wrong direction. Once I’ve seen it handle a category of work reliably — say, adding a new API endpoint that follows established patterns — I’ll let it run with less supervision. Only after I understand the failure modes do I consider running it headless in CI.

This maps to three levels in practice:

Interactive with plan review: New feature work, unfamiliar codebases, anything touching authentication or data models. I use plan mode to separate exploration from execution.
Supervised parallel: Routine tasks like writing tests for existing code, fixing lint violations, updating dependencies. I’ll kick off multiple agent sessions and review the results.
Headless in CI: Strictly bounded tasks with deterministic validation — generating API documentation from OpenAPI specs, running migration checks, creating boilerplate from templates.

The headless tier is narrower than you’d expect. I run agents in GitHub Actions for documentation generation and for automated code review comments, but I don’t let them merge anything or modify infrastructure definitions without human sign-off.

# .github/workflows/agent-review.yml
name: AI Code Review
on: [pull_request]

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run agent review
        run: |
          claude --print \
            --allowedTools "Read,Grep,Glob" \
            "Review this PR for architectural issues, \
             security concerns, and convention violations. \
             Reference CLAUDE.md for project conventions."
      # Agent can comment but never approve or merge

Notice the tool restrictions: the agent can read code but can’t edit it or run arbitrary commands. This is the guardrail that makes headless operation viable. An agent with unrestricted permissions in a CI pipeline is an incident waiting to happen — AWS learned this the hard way when an AI coding tool deleted and recreated an environment during what should have been a routine change.

The Correction Tax Is Real

A Fastly survey from mid-2025 found that senior engineers ship nearly 2.5x more AI-generated code than junior engineers, but almost 30% of seniors reported that fixing AI output consumed most of the time they’d saved. This matches my experience exactly.

The issue isn’t that agents write bad code. Most of the time, the code is syntactically correct and functionally reasonable. The problem is architectural drift — small decisions that are locally correct but globally wrong. An agent will happily add a new utility function when a similar one exists two directories away. It’ll introduce a caching layer that conflicts with an existing invalidation strategy. It’ll choose a library that duplicates a dependency you’re already using for the same purpose.

This is where senior engineering judgment becomes more valuable, not less. The bottleneck in software development has shifted from writing code to evaluating code. I spend less time typing and more time reading, reviewing, and making architectural decisions. The agents are fast at producing output; the human work is deciding whether that output belongs in the system.

My mitigation strategy is simple: strong conventions, comprehensive tests, and small scopes.

Strong conventions mean the agent has less room to improvise. If every service follows the same directory structure and the same patterns for error handling, the agent converges on consistent output.
Comprehensive tests catch functional regressions immediately. I run the test suite after every agent-generated change, no exceptions. If the tests don’t cover it, the agent’s change doesn’t ship.
Small scopes limit blast radius. I’d rather have the agent make five focused changes than one sweeping refactor. Each change is easier to review and easier to revert.

AI Amplifies Your Engineering Discipline

This is the most important and least discussed aspect of working with coding agents: they amplify whatever engineering discipline you already have. Teams with good test coverage, clear architectural boundaries, and mature CI/CD pipelines get enormous value from agents. Teams without those foundations just generate technical debt faster.

I’ve seen this play out across several projects. In a well-structured TypeScript codebase with strict typing, comprehensive tests, and clear module boundaries, an agent can produce production-quality code with minimal corrections. In a legacy PHP monolith with no tests and implicit conventions, the same agent produces code that works in isolation but creates integration headaches.

This means the real preparation for AI-assisted development isn’t learning prompt techniques — it’s doing the engineering fundamentals you should have been doing all along. Writing tests. Documenting architectural decisions. Maintaining clean module boundaries. Keeping your CI/CD pipeline fast and reliable. These practices have always been valuable; they’re now essential because they directly determine how much value you extract from your tools.

Where I Draw the Line

There are categories of work I won’t delegate to an agent, regardless of how capable the tools get:

Infrastructure changes that touch production data: Terraform plans that modify databases, IAM policies that affect production access, anything involving encryption keys. The cost of a mistake is too high and the agent lacks the operational context to understand blast radius.
Security-critical code paths: Authentication flows, authorization checks, input validation at system boundaries. These require understanding threat models that agents don’t have.
Architectural decisions: Choosing between an event-driven and a request-response architecture, deciding whether to split a service, evaluating build-vs-buy trade-offs. These decisions require business context, operational history, and judgment about future requirements that no agent possesses today.

What agents are excellent at is the implementation work that follows these decisions. Once I’ve decided on the architecture, defined the interfaces, and written the key tests, an agent can fill in the implementation details faster and more consistently than I can type them. That’s genuinely valuable, and it’s changed how I allocate my time — more thinking, less typing, better outcomes.

The Honest Assessment

AI coding agents are the most significant productivity tool I’ve adopted since infrastructure-as-code. But the productivity gain isn’t “write code 10x faster.” It’s more like “spend your engineering time on the problems that actually require engineering judgment, and let the agent handle the mechanical parts.” The gain is qualitative as much as it’s quantitative.

The teams that will get the most from these tools are the ones that were already doing good engineering. The teams that struggle will be the ones hoping AI will compensate for missing fundamentals. It won’t. It’ll just help you build the wrong thing faster, and that’s a more expensive failure mode than building it slowly.