Declarative Geospatial Infrastructure

for Reproducible Environmental Analysis

Nate Hearns — 2026

The Architecture

Five layers. Each solves a distinct problem.

┌─────────────────────────────────────────────────────┐
│  1. DATA SOURCES — concepts, crosswalks, identity   │
├─────────────────────────────────────────────────────┤
│  2. TRANSFORMS — operations, domains, caching       │
├─────────────────────────────────────────────────────┤
│  3. COMPUTE — local-first, coarse-to-fine           │
├─────────────────────────────────────────────────────┤
│  4. DEPLOYMENT — cloud-native, shareable, citable   │
├─────────────────────────────────────────────────────┤
│  5. AGENTS — epistemic reasoning, provenance         │
└─────────────────────────────────────────────────────┘

Let's walk through each.

1. Data Sources

Concepts, crosswalks, and dataset identity

The Data Problem

The same DEM is referenced 15 different ways across 15 pipelines.
When the source updates, 14 of them break.

Two datasets call the same variable different names, use different units, classify land cover with different schemas.

How do you build cross-dataset analysis without a translation layer?

Concepts, Not URLs

Instead of hardcoding a dataset, reference a concept:

elevation:
  ref: "#terrain/dem"

The platform resolves it to the best available dataset for your area of interest:

#terrain/dem in Utah     → USGS 3DEP 1/3 arc-second (10m)
#terrain/dem in Ghana    → Copernicus GLO-30 (30m)
#terrain/dem in the Alps → EU-DEM v1.1 (25m)

One spec. Multiple geographies. No hardcoded URIs.

The concept registry maps to external vocabularies (CF conventions, GCMD, INSPIRE) so the same concept is discoverable across systems.

Crosswalks

Datasets disagree on names, units, and classification systems.

ERA5 calls it t2m. CF conventions call it air_temperature. CMIP6 calls it tas.
Same variable. Three names. Different units.

Crosswalks are mapping files that translate between vocabularies:

# crosswalks/era5-to-cf.yaml
source: era5
target: cf-conventions
mappings:
  - source_field: t2m
    target: air_temperature
    predicate: exact          # SKOS: exactMatch
    confidence: 0.95

  - source_field: tp
    target: precipitation_amount
    predicate: close          # SKOS: closeMatch
    confidence: 0.8
    requires_unit_conversion: "m → kg m⁻²"

Not binary match/no-match. Graded confidence — some mappings are uncertain, and the system says so. Crosswalks feed operations like harmonize and reclassify — they're configuration, not code.

Dataset Identity

Data should be namespaced by who produced it, not who hosts it:

@esa/sentinel-2/L2A          ← ESA produced this
@usgs/3dep/10m                ← USGS produced this
@jsmith/flood-risk-malawi     ← a researcher's output

Aggregators (Earth Search, Planetary Computer) are access endpoints,
not identity. One dataset, multiple mirrors.

# The catalog record
id: "@esa/sentinel-2/L2A"
access_endpoints:
  - url: "https://earth-search.aws.element84.com/v1"
    provider: element84
  - url: "https://planetarycomputer.microsoft.com/api/stac/v1"
    provider: microsoft

Origin-based naming means provenance starts at the source.

2. Transforms & Analyses

Operations, domains, caching, and the multiverse

Operations as a Vocabulary

An operation is an abstract interface. Backends are implementations.

# operation definition: slope
name: slope
domain: geo/terrain
inputs:
  elevation: { type: raster }
outputs:
  slope: { type: raster }
params:
  algorithm:
    type: select
    options: [horn, zevenbergen]
    default: horn
  units:
    type: select
    options: [degrees, percent]
    default: degrees

Same op: slope runs on GDAL, Google Earth Engine, or a GPU backend.
The operation is the vocabulary. The backend is the implementation.

Domains

Operations are organized by domain — pluggable capability sets:

Domain Examples
geo/terrain slope, aspect, hillshade, curvature, viewshed, TRI
geo/raster calc, clip, mosaic, reproject, resample, reclassify, normalize
geo/imagery radiometric indices (NDVI, BSI), calibration, pansharpening
geo/analysis zonal stats, point sample, change detection, weighted overlay
geo/hydrology catchment network, tiered network, upstream trace
tabular filter, select, sort, join, union, aggregate
temporal resample, rolling, align, interpolate, period stats

Users extend the registry. New operations = new vocabulary.
The community grows the language, not just consumes it.

Layers as a DAG

Every layer declares what it needs, not how to get it:

layers:
  canopy:
    ref: "#forest/canopy-cover"

  fire-fuel-load:
    compute:
      op: reclassify
      inputs:
        raster: { layer: canopy }        # ← dependency
      params:
        breaks: [0, 15, 40, 70, 100]
        labels: [1, 2, 3, 4]

  fire-risk:
    compute:
      op: weighted_overlay
      inputs:
        fuel: { layer: fire-fuel-load }  # ← dependency
        spread: { layer: spread-model }  # ← dependency
      params:
        weights: [0.5, 0.5]

The workspace spec is the DAG. Dependencies are explicit. No hidden state.

Caching Across Transformations

If two layers reference the same input with the same operation and the same parameters, the result is computed once.

canopy (source)
  ├──→ fire-fuel-load (op: reclassify, breaks=[0,15,40,70,100])
  │       ↓
  │    fire-risk (op: weighted_overlay, weights=[0.5,0.5])
  │
  └──→ deforestation-alerts (op: change_detection, threshold=-0.3)

canopy is fetched and processed once. Both branches consume the cached result.

When a source updates, the system knows exactly which downstream layers are invalidated — because the DAG is explicit.

Multi-Step Pipelines

Chain operations. Mix registered ops with inline SQL or Python:

flood-exposure:
  compute:
    steps:
      - op: hydrology_catchment_network
        inputs: { dem: { layer: elevation } }
        params: { threshold: 500 }
        as: flow_accumulation

      - op: raster_calc
        inputs: { slope: { layer: slope } }
        params:
          expression: "log2(flow_accumulation + 1) * (1 - slope / 90.0)"
        as: raw_exposure

      - op: raster_normalize
        input: raw_exposure

      - engine: sql
        query: |
          SELECT b1 * (population / max(population)) as weighted_exposure
          FROM :prev
          JOIN :population ON spatial_match
        inputs:
          population: { layer: population }

Every step is named, inspectable, and cacheable independently.

Parameters as First-Class Citizens

What if every analytical decision were a parameter?

vulnerability:
  compute:
    op: weighted_overlay
    inputs:
      capacity: { layer: capacity-index }
      sensitivity: { layer: sensitivity-index }
      exposure: { layer: exposure-index }
    params:
      weights:
        type: array
        default: [0.40, 0.20, 0.40]     # Malcomb's original
      normalization:
        type: select
        options: [min_max, z_score, rank, percentile]
        default: min_max
      aggregation:
        type: select
        options: [additive, multiplicative, geometric_mean]
        default: additive

Demo

data.folia.sh/@kedron/malcomb-vulnerability

Malcomb et al. (2014) — reproduced, parameterized, inspectable

  • Adjust indicator weights with sliders
  • Toggle normalization method
  • Switch spatial aggregation unit
  • See the vulnerability map update in real time
  • "View the spec" → full YAML behind every result

What the original study reported as one map is actually a space of 200+ maps.

3. Compute

Local-first. Coarse-to-fine.

Local-First Computation

Not everything needs a cloud cluster.

┌──────────────────────────────────────────────────────┐
│              BROWSER (instant, free)                  │
│  DuckDB-WASM: SQL on Parquet/GeoParquet, <100MB      │
│  Client-side rendering: PMTiles, COG range requests   │
│  Lightweight raster ops: NDVI, reclassify, normalize  │
├──────────────────────────────────────────────────────┤
│              LOCAL (seconds, free)                    │
│  DuckDB native: SQL on larger datasets, <10GB         │
│  GDAL/rasterio: terrain ops, reprojection, mosaics   │
│  Full Python: custom scripts, ML inference            │
├──────────────────────────────────────────────────────┤
│              CLOUD (minutes, metered)                 │
│  K8s batch: continental-scale, fan-out/reduce         │
│  Multi-GB imagery: temporal composites, ML training   │
│  Long-running: change detection over archives         │
└──────────────────────────────────────────────────────┘

Same spec. The system picks the tier based on data size, operation complexity, and engine type.

The Tabular Stack

Parquet + DuckDB = the modern analytical backbone.

cocoa-yield:
  compute:
    engine: sql
    query: |
      SELECT
        district,
        COUNT(*) as cocoa_pixels,
        ROUND(COUNT(*) * 0.09, 1) as area_ha,   -- 30m pixel = 0.09 ha
        ROUND(AVG(ndvi), 3) as mean_ndvi,
        CASE
          WHEN AVG(ndvi) > 0.6 THEN 'healthy'
          WHEN AVG(ndvi) > 0.4 THEN 'moderate'
          ELSE 'stressed'
        END as health_status
      FROM crop_classification
      JOIN districts ON ST_Contains(districts.geom, point)
      WHERE crop_type = 'cocoa'
      GROUP BY district
    inputs:
      crop_classification: { layer: crop-type }
      districts: { layer: admin/districts }

<50MB? Runs in your browser via DuckDB-WASM. No server round-trip.

The Geospatial Stack

Cloud-native formats enable range-request access — read only what you need:

Format Type Access Pattern
COG (Cloud-Optimized GeoTIFF) raster HTTP range requests for tiles/overviews
GeoParquet vector Column pruning, row-group filtering
PMTiles tiles Single-file tile archive, offset-based
Zarr n-d arrays Chunk-addressable, S3-native

A 2GB raster: the browser reads only the tiles visible at your zoom level.
The user sees instant response. The network transfers kilobytes, not gigabytes.

This is what makes local-first geospatial possible.

Coarse-to-Fine Analysis

Don't process 10 billion pixels to answer a question about 16 regions.

Start coarse. Dig deeper where it matters.

Step 1: H3 resolution 4 (~1,770 km² per cell)
        → 45 cells cover Ghana
        → Answer in <1 second
        → "Western Region has highest fire risk"

Step 2: H3 resolution 7 (~5 km² per cell)
        → 500 cells in Western Region only
        → Answer in ~5 seconds
        → "Sefwi Wiawso district is the hotspot"

Step 3: Full resolution (30m pixels)
        → Only for Sefwi Wiawso district
        → Answer in ~30 seconds
        → Parcel-level risk assessment

Spatial partitioning (H3, quadkey, slippy tiles) makes this mechanical, not manual.

Fan-Out / Reduce

When you do need full resolution at scale — partition and parallelize:

crop-classification:
  compute:
    op: classify
    params:
      model: crop_classifier_v3
      classes: [cocoa, palm, rubber, food_crop, bare, water]
    each:
      source:
        type: temporal_windows
        scheme: monthly
        range: [2025-06, 2026-02]       # 9 months
      key: window_start
    reduce:
      mode: aggregate
      engine: sql
      query: |
        SELECT pixel_x, pixel_y,
          MODE(predicted_class) as crop_type
        FROM all_monthly_results
        GROUP BY pixel_x, pixel_y

9 parallel tasks (one per month). DuckDB assembles the result.
The spec reads like a description, not a distributed systems tutorial.

4. Deployment

Cloud-native formats, shareable URLs, citable research

Publishing Research Outputs

Analysis results should be as accessible as the source data.

data.folia.sh/@kedron/malcomb-vulnerability.parquet   # GeoParquet
data.folia.sh/@kedron/malcomb-vulnerability.pmtiles   # vector tiles
data.folia.sh/@kedron/malcomb-vulnerability           # landing page

Same artifact. Extension determines format. No API keys, no auth, no SDKs.

Version-pinned for citability:

data.folia.sh/@kedron/malcomb-vulnerability@v3.parquet   # immutable

@v3 is content-addressed — cached forever. The URL is the citation.

Cloud-Native All the Way Down

Every published artifact includes a manifest with multiple representations:

{
  "artifact": "@kedron/malcomb-vulnerability",
  "version": 3,
  "representations": [
    { "format": "geoparquet", "role": "source", "size": "45MB" },
    { "format": "pmtiles", "role": "tiles", "size": "32MB" },
    { "format": "h3_geoparquet", "role": "index", "size": "8MB" }
  ],
  "provenance": {
    "operation": "weighted_overlay",
    "params": { "weights": [0.40, 0.20, 0.40], "normalization": "min_max" },
    "inputs": ["capacity-index@v2", "sensitivity-index@v1", "exposure-index@v3"]
  }
}

One publish. Three formats. Full provenance. No manual conversion.

What This Enables

A researcher in Accra publishes their Ghana cocoa analysis:

data.folia.sh/@cersgis/ghana-cocoa-eudr@v1.parquet

A compliance officer in Brussels:

  1. Opens the landing page — sees the methodology, parameters, provenance
  2. Downloads the GeoParquet — loads directly into DuckDB or QGIS
  3. Views the tiles — browses the map in any PMTiles-compatible viewer
  4. Clones the spec — runs the exact same analysis on updated data

No email. No data request forms. No "supplementary materials upon request."

The URL is the output. The spec is the methodology. The provenance is the proof.

5. Agentic Development

Epistemic reasoning and the machine's judgment

Why Does the Machine Do What It Does?

AI can compose analytical pipelines. The question is: should we trust the composition?

User: "Show me cocoa districts with high fire risk
       and post-2020 forest loss"

Agent:
  → searches catalog for forest cover, fire risk inputs
  → selects Hansen GFC lossyear (30m, 2001-2024)
  → selects Copernicus GLO-30 for elevation
  → composes weighted overlay → zonal stats → SQL join
  → returns result

But why those datasets? Why that overlay weighting? Why Hansen and not GLAD?

An agent that can't explain its choices is a black box.
A black box that produces vulnerability maps is dangerous.

Epistemic Justification

Every agent action should include a justification — not just a result:

{
  "result": { "districts": [...] },
  "risk_level": "UNCERTAIN",
  "justification": "Hansen GFC chosen over GLAD due to annual
    temporal resolution (lossyear) required for EUDR Dec 2020
    cutoff. Copernicus GLO-30 selected for Ghana (no 3DEP
    coverage). Weights [0.35, 0.40, 0.25] follow Malcomb et al.
    but sensitivity to weighting is high — consider multiverse.",
  "inspect": {
    "spec": "full YAML specification",
    "lineage": "provenance chain from pixel to claim"
  }
}

The machine says what it did, why, and what it's uncertain about.

Risk Classification

Not all operations need the same level of scrutiny:

Risk Level Trigger Response
INSTANT Cached artifact, simple lookup Return result
SAFE Small AOI, known operation Cost estimate
UNCERTAIN Multi-dataset fusion, method choice matters Epistemic justification required
EXPENSIVE Continental-scale, hours of compute Cost + impact disclosure

A slope calculation is SAFE — there's one right answer.
A vulnerability index is UNCERTAIN — the answer depends on choices.

The system should tell you which category you're in before you run.

Agents + Multiverse Analysis

The real power: agents can systematically explore the decision space.

User: "Run a multiverse analysis on the Malcomb vulnerability
       index. Vary weights, normalization, and aggregation."

Agent:
  → identifies 3 parameterizable decisions
  → generates 36 specification combinations
  → fans out: 36 parallel compute tasks
  → reduces: specification curve visualization
  → reports: "12 of 16 districts are consistently vulnerable
     across all specifications. 4 districts are sensitive
     to normalization choice — recommend reporting these
     with explicit uncertainty bounds."

The agent doesn't hide the forking paths. It maps the entire garden.

Putting It Together

Data Sources        Concepts (#terrain/dem) resolve by geography
     ↓              Crosswalks harmonize across naming systems
Transforms          Registered operations with explicit parameters
     ↓              Caching across shared intermediaries
Compute             Local-first: browser → local → cloud
     ↓              Coarse-to-fine: H3 drill-down, fan-out/reduce
Deployment          Cloud-native formats, URL = citation
     ↓              One publish → GeoParquet + PMTiles + H3 index
Agents              Epistemic justification on every decision
                    Multiverse exploration across the parameter space

The Declarative Spec

The spec is the connective tissue between all five layers:

# The spec IS the methodology
name: malcomb-vulnerability
description: "Reproduction of Malcomb et al. 2014, parameterized"

layers:
  # 1. DATA SOURCES — concepts, crosswalks
  elevation:
    ref: "#terrain/dem"

  # 2. TRANSFORMS — operations, parameters
  vulnerability:
    compute:
      op: weighted_overlay
      params:
        weights: [0.40, 0.20, 0.40]

  # 3. COMPUTE — runs locally or cloud
  # (determined by data size, not by the spec)

  # 4. DEPLOYMENT — publish with full provenance
  # data.folia.sh/@kedron/malcomb-vulnerability@v3

  # 5. AGENTS — can read, modify, and explain the spec
  # "View the spec" on every result

Principles

  1. Technology should be a medium for users' intentions, not a delivery mechanism for someone else's

  2. Every abstraction can be inspected — "View the spec" on every result

  3. Augment human reasoning, not replace it — AI proposes, humans refine

  4. Extensible by its users — the catalog is grown by users, not just consumed

  5. Multiple viewpoints are first-class citizens — the same terrain looks different to a forecaster, a farmer, a biologist

Thank You

Demo: data.folia.sh/@kedron/malcomb-vulnerability

Adjust weights. Change normalization. See the multiverse.

References:

  • Steegen et al. (2016) "Increasing Transparency Through a Multiverse Analysis"
  • Simonsohn et al. (2020) "Specification Curve Analysis" — Nature Human Behaviour
  • Gelman & Loken (2013) "The Garden of Forking Paths"
  • Kedron et al. (2024) "A Framework for Moving Beyond Computational Reproducibility"
  • Malcomb et al. (2014) "Vulnerability Modeling for sub-Saharan Africa"
  • HEGSRR Malcomb reproduction: github.com/HEGSRR/RPr-Malcomb-2014

Nate Hearns — nate@folia.sh