Spec-Driven Development in Practice

A Knowledge Architecture for Building With AI Agents

February 2026

The Problem

AI coding agents are powerful but directionless.

"Build me a dashboard"
    → Agent writes 2,000 lines of code
    → None of it matches what you needed
    → You spend longer fixing than you saved

Vibe coding works for prototypes. It doesn't work for products.

The Idea

"Specifications don't serve code — code serves specifications."

Write a spec first. Let the agent implement it. Verify against the spec.

This is Spec-Driven Development. GitHub's Spec Kit popularized the term. The core loop is simple:

Specify → Plan → Tasks → Code

We adopted this. Then we kept going.

The Questions

Every layer in a system answers a different question.

Most SDD workflows answer three: what, how, and done.

After 6 months in production, we found ourselves asking more:

"What should we build?"        → answered
"How should we build it?"      → answered
"Is it done?"                  → answered

"For whom?"                    → ???
"When?"                        → ???
"Why this way and not that?"   → ???
"Who's right when they disagree?" → ???
"How well does it actually work?" → ???

The Knowledge Graph

Nine layers, nine questions

The Full Stack

 ┌──────────────────────────────────────────────┐
 │  STRATEGY        "For whom?"                  │
 │  Market data, EV scoring, use case ranking    │
 ├──────────────────────────────────────────────┤
 │  CONSTITUTION    "What do we believe?"         │
 │  Principles, constraints, anti-patterns        │
 ├──────────────────────────────────────────────┤
 │  ROADMAP         "When?"                       │
 │  Milestones, deadlines, sequencing             │
 ├──────────────────────────────────────────────┤
 │  SPECS           "What?"                       │
 │  Feature definitions, API contracts            │
 ├──────────────────────────────────────────────┤
 │  ADRs            "Why this way?"               │
 │  Decision records, permanent, supersedable     │
 ├──────────────────────────────────────────────┤
 │  TICKETS         "How?"                        │
 │  User stories, priorities, context packets     │
 ├──────────────────────────────────────────────┤
 │  CODE + TESTS    "Does it work?"               │
 │  Implementation, unit tests, E2E tests         │
 ├──────────────────────────────────────────────┤
 │  VALIDATION      "Is it ready?"                │
 │  Use case readiness, traceability audits       │
 ├──────────────────────────────────────────────┤
 │  BENCHMARKS      "How well?"                   │
 │  Real data vs reference tools, accuracy metrics│
 └──────────────────────────────────────────────┘

The Layers

Layer 1: Strategy

"For whom?"

Before writing any spec, score the opportunity:

EV = (Market Score × Platform Fit) / Dev Cost
Factor Weight What It Measures
Market Size 25% TAM for the vertical
Competitive Density 20% Greenfield scores high
Funding Momentum 20% VC activity = market timing
Buyer Accessibility 20% Sales cycle, SMB vs enterprise
Pricing Headroom 15% ACV potential

Result: A ranked list of use cases. Specs get written top-down.

Layer 2: Constitution

"What do we believe?"

Principles that constrain all specs and decisions:

  • Declarative over imperative
  • Local-first, cloud-scalable
  • Provenance is first-class
  • AI-native (agents are first-class users)
Spec contradicts principle?  → Fix the spec.
Code contradicts principle?  → Fix the code.
Principle is wrong?          → Update it explicitly, not silently.

Layer 3: Authority Hierarchy

"Who's right when sources disagree?"

After 50 PRs, your spec says one thing, your code does another, and your tests assert a third.

tests/CI  >  current code  >  current docs  >  old docs/lore
  • Tests say X, docs say Y → fix the docs
  • Code does X, spec says Y → fix the spec (usually)
  • New work contradicts an ADR → supersede the ADR explicitly

Tests are the ultimate source of truth. Not specs, not docs.

Layer 4: Spec-Anchored Development

"Specs evolve with implementation"

The naive model: write spec → implement → done.

Reality: write spec → implement → learn → update spec → implement...

Spec (v1) → Implementation → Learning → Spec (v2)
     ↑                                       │
     └───────────────────────────────────────┘

Spec updates during implementation are encouraged, not a sign of failure.
The goal is accurate documentation, not predicting everything upfront.

Layer 5: Decision Records (ADRs)

"Why was it built this way?"

One file per architectural decision. Permanent. Supersedable.

docs/decisions/
  0001-use-sqlite-for-metadata.md     ← still active
  0002-original-auth-approach.md      ← superseded by 0005
  0005-jwt-with-refresh-tokens.md     ← current

Key properties:

  • Permanent — superseded, never edited
  • Cross-referenced[[adr:0005]] syntax everywhere
  • Constraining — agents must check ADRs before proposing new approaches

Layer 6: Triage Protocol

"How does work enter the system?"

Every piece of work hits a decision tree:

Work arrives →
├─ "What is X?"        → Research mode (read docs, don't create)
├─ "Why is X this way?" → Archaeology mode (ADRs, git blame)
├─ Trivial fix          → Just do it
└─ Non-trivial →
    1. INTAKE:     Find the spec
    2. CONSTRAIN:  Check ADRs for limits
    3. SCOPE:      Create/find ticket
    4. EXECUTE:    Code with tight loops
    5. RECONCILE:  Update specs, create ADRs, close ticket

Without this, agents skip straight to step 4 — and create drift.

Layer 7: Traceability

"Did we actually build what the spec says?"

A CLI that audits spec-to-code coverage:

$ trk coverage
┌─────────────────┬──────────┬─────────┐
│ Spec             │ Sections │ Covered │
├─────────────────┼──────────┼─────────┤
│ artifact-catalog │ 12       │ 12 (100%)│
│ layer-spec       │ 8        │ 6 (75%) │
│ queue-system     │ 15       │ 9 (60%) │
└─────────────────┴──────────┴─────────┘
$ trk audit
⚠  3 orphan code files (no ticket link)
⚠  2 spec sections with no implementation
✓  14 tickets properly closed with code + test links

Layer 8: Use Case Validation

"Is it ready for real users?"

Each use case declares what it requires:

requires:
  ops: [ndvi, zonal_stats, temporal_composite]
  connectors: [stac, gee]
  datasources: [sentinel-2, admin-boundaries]
  views: [map, timeseries]

The system checks readiness automatically:

$ trk uc show UC-002
UC-002: Carbon MRV
  ops:         3/3 ✓
  connectors:  2/2 ✓
  datasources: 1/2 ✗ (missing: admin-boundaries)
  views:       2/2 ✓
  Readiness:   87%

Layer 9: Benchmarks

The missing question: "How well?"

Tests vs Benchmarks

Tests tell you the code runs. They don't tell you it's correct on real data.

Unit test:  slope(flat_array) == 0.0          ✓  (synthetic)
Benchmark:  slope(USGS 3DEP 10m) ≈ GDAL ±0.5° ?  (real data)
Tests Benchmarks
Data Synthetic (numpy arrays) Real (USGS, Sentinel, NLCD)
Reference Expected values in code Output from GDAL, QGIS, GEE
Result Pass / fail Metrics (RMSE, correlation, %)
Speed Seconds (every commit) Minutes (pre-milestone)
Catches Regressions, math errors Methodology drift, real-terrain edge cases

Op Benchmarks

"Our output matches the reference tool"

# benchmarks/ops/terrain_slope.yaml
op: terrain_slope
data:
  source: usgs-3dep/10m
  bbox: [-111.8, 40.5, -111.6, 40.7]   # Wasatch Range
reference:
  tool: gdal
  command: "gdaldem slope input.tif ref.tif -alg Horn"
metrics:
  - rmse: { max: 0.5 }           # degrees
  - correlation: { min: 0.999 }
  - max_abs_error: { max: 2.0 }

Does our slope operation produce the same output as GDAL on real terrain?

Workflow Benchmarks

"Our pipeline reproduces a known GIS workflow"

# benchmarks/workflows/renewable_siting.yaml
use_case: UC-004
reference:
  description: "Manual QGIS: slope + aspect + road proximity → weighted overlay"
pipeline:
  steps:
    - op: terrain_slope
    - op: reclassify
      params: { mapping: { "0-5": 5, "5-15": 3, "15-30": 1, "30+": 0 } }
    - op: weighted_overlay
      params: { weights: { slope: 0.4, aspect: 0.3, roads: 0.3 } }
metrics:
  - classification_agreement: { min: 0.85 }
  - top_sites_overlap: { min: 0.90 }

Can we reproduce a manual QGIS analysis programmatically?

Benchmarks + Use Case Readiness

Use case readiness now includes benchmark status:

# docs/use_cases/renewable-energy.md
requires:
  ops: [geo:slope, geo:weighted_overlay, ...]
  benchmarks:                          # ← new
    - bench-terrain-slope
    - bench-renewable-siting
$ trk uc show UC-004
UC-004: Renewable Energy Siting
  ops:         4/4 ✓
  connectors:  2/2 ✓
  benchmarks:  1/2 ✗ (bench-renewable-siting: agreement 78%, need 85%)
  Readiness:   83%

"Do the ops exist?" is necessary. "Do they produce correct results?" is sufficient.

Putting It Together

The Knowledge Graph

Each layer answers a different question:

Strategy       →  "For whom?"
Constitution   →  "What do we believe?"
Roadmap        →  "When?"
Specs          →  "What?"
ADRs           →  "Why this way?"
Tickets        →  "How?"
Code + Tests   →  "Does it work?"
Validation     →  "Is it ready?"
Benchmarks     →  "How well?"

Miss a layer and you get a predictable failure mode:
no strategy = build the wrong thing, no ADRs = repeat mistakes,
no benchmarks = ship code that passes tests but produces wrong answers.

The Flow

Strategy (for whom?)
    ↓ prioritizes
Constitution (principles)
    ↓ constrains
Roadmap (when?)
    ↓ sequences
Specs (what?)                   ←──── Spec-Anchored:
    ↓ decides                         specs update as we learn
ADRs (why this way?)
    ↓ implements
Tickets (how?)
    ↓ tracks
Code + Tests (does it work?)
    ↓ validates
Use Cases (is it ready?)
    ↓ measures
Benchmarks (how well?)
    ↓ informs
Strategy (next quarter)         ←──── Full loop

Why AI Agents Need This

This isn't just methodology for humans.

AI agents need these layers more than humans do. Because agents:

  • Don't have institutional memory → need ADRs
  • Can't judge market timing → need strategy
  • Don't know when sources conflict → need authority hierarchy
  • Will happily close a ticket without checking the spec → need traceability
  • Don't update docs unless told to → need the reconcile step
  • Can't tell if results are correct on real data → need benchmarks

SDD for agents isn't optional. It's the difference between useful and dangerous.

Getting Started

You don't need all nine layers on day one.

Start with three:

1. Authority hierarchy — decide who wins when sources conflict

tests > code > docs

2. Spec-anchored development — update specs during implementation

3. Decision records — one file per choice, permanent, supersedable

Then add layers as your project grows:
strategy when you have competing priorities, benchmarks when correctness matters, traceability when agents are closing tickets.

Thank You

"Technology should be a medium for users' intentions,
not a delivery mechanism for someone else's."