Spec-Driven AI Development

A Practical Framework for Building with AI Agents

Nate Hearns — 2026

Act I

The Arc of Computation

Every tool externalizes thought

The Abacus (~3000 BC)

Mesopotamia. Externalize arithmetic into physical beads.

The first "memory" outside the human skull.

What it externalized: data — numbers moved from your head to a physical object you could see, share, and verify.

Quipus (~2600 BC)

Knotted strings with base-10 encoding — census, taxes, inventories across 3,000 miles of the Andes.

"Early counting machines." Three knot types encoded values from 1 to 10,000+. The arrangement was precise enough to administer an empire without a written language.

What it externalized: records — complex state, encoded in fiber, portable across a continent.

al-Khwarizmi (~800 AD)

The word "algorithm" comes from his name. Working in Baghdad's House of Wisdom, he wrote systematic, step-by-step procedures that anyone could follow to get the right answer.

Before him, math was case-by-case. He formalized the idea that process itself can be written down and handed to someone else to execute.

What it externalized: procedure — the prerequisite for every programmable machine that followed.

The Pattern

Every civilization independently invented ways to extend the mind.

Era Invention Externalized
~3000 BC Abacus Data — numbers out of heads, into beads
~2600 BC Quipus Records — state encoded in fiber across a continent
~800 AD al-Khwarizmi Process — repeatable procedures, not case-by-case

Every breakthrough is about externalizing thought — getting ideas out of heads and into durable, shareable, repeatable form.

Jacquard Loom (1804)

Punched cards control a weaving pattern. The first programmable machine.

The pattern is the program. Change the cards, change the output. No redesign, no rebuilding — just a different sequence of instructions.

What it proved: machines can follow abstract instructions, not just brute force.

Babbage & Lovelace (1837)

A general-purpose mechanical computer. Never built — but fully designed.

Ada Lovelace writes the first algorithm and asks the question that still haunts us: "Can it think?"

What it proved: a single machine could be universal — not locked to one task.

ENIAC (1945)

30 tons. 18,000 vacuum tubes. One of the first electronic general-purpose computers.

Computed artillery trajectories in seconds that took humans weeks. The theoretical became physical.

The arc: al-Khwarizmi wrote procedures for humans to follow. The Jacquard loom followed instructions in cards. ENIAC followed instructions in electrons. Same idea. 5,000× faster.

Act II

The Step Function

Mid-to-late 2025 and what changed

What Happened in Mid-to-Late 2025

Frontier models crossed the agentic threshold:

  • Claude Opus 4, GPT-4.1, Gemini 2.5 — models that can reason, plan, and self-correct across multi-step tasks
  • Terminal agents (Claude Code, Codex CLI, OpenCode) — fully autonomous: read your repo, edit files, run tests, fix failures, commit. Loops that work.
  • 1M+ token context — an agent that holds an entire codebase in working memory

This wasn't a point on a smooth curve.
It was a phase transition.

The tools went from "impressive demo" to "I shipped production code with this today."

Before & After

Before mid-2025 After
AI coding Autocomplete on steroids. You drive. Autonomous agent. You steer.
Context Paste a file, get a suggestion Agent reads your repo, tests, specs, history
Iteration Human copies output, runs manually Agent runs tests, reads errors, fixes, loops
Dead-end cost Hours to days Minutes
Who writes code You, with suggestions Agent, with your judgment

The skill shifted from writing code to directing agents. From typing to thinking.

The Most Powerful Productivity Tool Ever Created

What a single developer can do today that was impossible 18 months ago:

Solo → Team
One person with AI agents produces the output of a 5-10 person team. Not quality shortcuts — actual tested, documented code.

Idea → Product
Full-stack applications in hours, not quarters. The gap between "what if" and "here it is" collapsed.

Explore → Decide
Try 5 architectural approaches in an afternoon. Build each one. Benchmark. Pick the winner.

Learn → Build
An unfamiliar framework isn't a 3-month learning curve. It's an afternoon of collaboration.

The Coordination Tax

Big organizations pay a tax on every feature:

Large org Small team + agents
Decision Committee, alignment meeting, RFC You decide. Now.
Implementation Sprint planning, ticket grooming, handoffs Agent starts in 30 seconds
Iteration PR review queue, staging, release train Test, fix, ship. Same afternoon.
Pivoting Quarterly roadmap negotiation "Actually, let's try this instead"

A 500-person company with 6-month release cycles cannot compete on speed with a focused individual who has a clear vision and AI agents.

The bottleneck was never typing. It was coordination.
AI didn't remove the bottleneck — it made the bottleneck irrelevant by making teams of one viable.

But Agents Are Powerful and Directionless

"Build me a dashboard"
    → Agent writes 2,000 lines of code
    → None of it matches what you needed
    → You spend longer fixing than you saved

Prompt and a prayer works for prototypes. It doesn't work for products.

The tools are here. The question is: how do you direct them?

The Starting Point

Write a spec first. Let the agent implement it. Verify against the spec.

GitHub's Spec Kit popularized this as a pipeline:

Constitution → Specify → Plan → Tasks → Implement

Each stage feeds the next. The agent gets more context at every step.

This is a great starting point. We adopted it. Then we kept going.

The gap: Spec Kit's pipeline flows one direction — from spec to code. What happens when the spec is wrong? When implementation reveals edge cases? When an agent reverts a decision it doesn't know was intentional?

We need more than a pipeline. We need layers.

Live Demo

Let's build something.

Try It

Constitution → Specify → Plan → Tasks → Implement

We'll set up a spec-driven project from scratch and walk through the pipeline live.

Act III

Structure for Agents

A practical system for directing AI

Seven Layers

Each layer answers a different question.

Layer Question
Vision & Principles What do we believe?
Specs What are we building?
Use Cases For whom?
Tasks How do we deliver?
Decision Records Why this way?
Tests & Benchmarks Does it work?
Traceability Did we build it?

You don't need all seven on day one. Start with three.

Layer 1: Vision & Principles

"What do we believe?"

The constitution every agent reads before touching code.

## Principles
- Declarative over imperative — define what, not how
- Simplicity over cleverness — a beginner should understand it in 6 months
- Validate at boundaries — trust internal code, check external input

## Authority Hierarchy
tests/CI > current code > current docs > old docs

## Anti-Patterns
- Shipping without updating the docs that describe what you shipped
- Hardcoding values that should be configurable
- Changing code without understanding why it was built that way

Goes in your CLAUDE.md or project root. Agents read it on every task.

Layer 2: Specs

"What are we building?"

Technical and product specifications with section anchors that tasks link to.

# Authentication API                    ← spec file

## §1 Token Format
JWT with RS256. Access tokens expire in 15 minutes.
Refresh tokens expire in 30 days.

## §2 Login Flow
POST /auth/login → { access_token, refresh_token }
Rate limited to 5 attempts per minute per IP.

## §3 Token Refresh
POST /auth/refresh → { access_token }
Revoke refresh token on password change.

Spec-anchored: specs evolve with implementation. Updates during coding are encouraged, not a sign of failure.

Layer 3: Use Cases

"For whom?"

Who you're building for, what they need, and what's missing.

# use_cases/field-inspector.md
persona: Field Inspector
scenario: Offline inspection with photo capture and sync

requires:
  features: [offline-mode, photo-capture, background-sync]
  apis: [inspection-api, asset-api]
  integrations: [camera, gps]

readiness: 67%    # 2 of 3 features implemented
gap: offline-mode  # blocks deployment

Use cases are cross-cutting — they span multiple features and reveal what's actually missing.

Layer 4: Tasks

"How do we deliver?"

Units of work for agents, linked to spec sections.

# Task: Implement Token Refresh
links: [auth-api.md §3]
priority: P1 (MVP)

## User Stories
- US1: As a user, I stay logged in across sessions
  - T001: POST /auth/refresh endpoint
  - T002: Revoke on password change
  - T003: Refresh token rotation

## Context Packet
Goal: Implement §3 of the auth spec
Constraints: Must be backwards-compatible with existing tokens
Done when: All 3 tasks pass, spec §3 fully covered

These carry context an agent needs to make good decisions.

Layer 5: Decision Records

"Why was it built this way?"

Agents that don't know the reasoning behind a decision will revert it.

decisions/
  0001-use-sqlite-for-metadata.md     ← still active
  0002-original-auth-approach.md      ← superseded by 0005
  0005-jwt-with-refresh-tokens.md     ← current

One file per decision. Permanent. Supersedable. Each captures: context, options considered, decision, consequences.

Decision records are institutional memory for agents that have none.

Layer 6: Tests & Benchmarks

"Does it work? How well?"

Specs define what should exist. Tests prove it exists.

tests > code > docs
  • Tests say X, docs say Y → fix the docs
  • Code does X, spec says Y → fix the spec (usually)
  • New work contradicts a decision record → supersede it explicitly

Tests are the strongest signal — pass/fail is unambiguous. Specs can be ambiguous. Docs can be stale. A failing test is a fact.

An agent with a good test suite and a spec will produce correct code. An agent with neither will produce plausible garbage.

Layer 7: Traceability

"Did we actually build what the spec says?"

$ specgraph coverage
┌──────────────┬──────────┬──────────┐
│ Spec          │ Sections │ Covered  │
├──────────────┼──────────┼──────────┤
│ auth-api      │ 6        │ 6 (100%) │
│ user-profiles │ 4        │ 3 (75%)  │
│ notifications │ 8        │ 5 (63%)  │
└──────────────┴──────────┴──────────┘

Without traceability, agents close tickets and move on. With it, you know whether the spec was actually implemented.

Connect specs → tasks → code so nothing drifts.

The Slime Mold Method

Physarum polycephalum — a single-celled organism with no brain — solves mazes and recreates rail networks. Not by planning. By exploring everything, then pruning to what works.

The seven layers are the food sources. Agents are the tendrils. Tests are the chemical signal. You are the organism — sensing, directing, deciding what to keep.

Place the food. Let agents explore. Remove the cruft. Harden what survives.

In Practice

Phase 0 — Place the food (human, ~30 min)
  ├── Write spec: what the feature does, API shape, edge cases
  ├── Write failing tests: the unambiguous signal
  └── Decision record: why this approach, not the alternatives

Phase 1 — Explore (agent-heavy, ~2 hours)
  ├── Agent reads spec + tests + decisions
  ├── Generates code → runs tests → reads failures → fixes → loops
  ├── Tries approach A → 7/10 tests pass
  ├── Tries approach B → 10/10 ✓
  └── Approach A deleted. No mourning.

Phase 2 — Harden (human-heavy, ~2 hours)
  ├── Review surviving code with fresh eyes
  ├── Error handling, reconnection logic, edge cases
  ├── Update spec where implementation revealed gaps
  └── Ship. Close ticket. Commit.

Half a day. Feature done. Spec accurate. Tests passing. Foundation solid.

"Compound interest is the eighth wonder of the world.

He who understands it, earns it.
He who doesn't, pays it."

— attributed to Albert Einstein

Compounding Applied to Code

The Einstein quote is about money. The principle is universal.

Linear vs compound output over time

Every spec, every decision record, every test — it doesn't just solve today's problem. It makes every future agent session faster and more accurate.

That's not just a workflow. That's a moat.

Act V

The Human at the Center

"A Bicycle for the Mind"

In 1980, Steve Jobs read a study comparing locomotion efficiency across species.

The condor was most efficient. Humans were unremarkable — middle of the pack.

But a human on a bicycle blew every species away.

"The computer is the most remarkable tool we've ever built...
it's the equivalent of a bicycle for our minds."

— Steve Jobs, 1990

From Bicycle to Rocket

From bicycle for the mind to rocket-powered supercar

If the personal computer was a bicycle for the mind,
AI agents are the rocket-powered supercar.

Same principle: a human is still driving. The vehicle changed.
The destination is still yours to choose.

Implementation Is Cheap. Judgment Is Scarce.

Coding is now orchestration. The hard part isn't writing code — it's knowing what to build and why.

The scarce skills:

  • Taste — recognizing good from good-enough from wrong
  • Intuition — sensing something is off before you can say why
  • Empathy — knowing who this is for and what they need
  • Accountability — someone has to own the outcome

Each agent sees its task. You see the system.
The human is the connective tissue.

Technology as Medium

"Technology should be a medium for users' intentions,
not a delivery mechanism for someone else's."

— folia core philosophy

This is the line that separates tools from traps.

Delivery Mechanism

Platform decides what you see. Algorithm chooses your feed. AI generates "the answer" and you consume it. You are the product.

Medium for Intentions

You decide the question. AI amplifies your ability to explore it. Every result is inspectable. You are the author.

"The best way to predict the future is to invent it."

— Alan Kay, 1971

Kay said this while building the Dynabook — a vision of a personal computer for children, decades before laptops existed.
He didn't predict the iPad. He invented the idea that demanded it.

Build Like It Matters

Because it does

What I'm Asking You to Build

We're at an inflection point. The tools are here. The question is what we do with them.

Build tools, not traps. Technology that amplifies what people can do, not what you can extract from them.

Let AI do what AI is good at. Let humans do what humans are good at. AI generates, explores, and iterates. Humans decide, judge, and take responsibility.

Make it inspectable. Show your work on every result. No black boxes. No magic.

Let it compound. Write specs. Record decisions. Build tests. Every piece of structured thought makes every future session better.

This is the eighth wonder applied to craft.

"Compound interest is the eighth wonder of the world."

Your specs, tests, and decisions earn interest every session.

"The best way to predict the future is to invent it."

You have the most powerful invention tools in human history.

"Technology should be a medium for users' intentions,
not a delivery mechanism for someone else's."

Build tools that make people more capable, not more dependent.

Thank You

Nate Hearns
talks.folia.sh

github.com/nthh/opcom
github.com/nthh/specgraph