Open Source · Python 3.11+ · MIT License

A maturity

of AI agents.

Not just a collective noun — a design principle.

Armature is a YAML-first multi-agent workflow harness. Define researcher, worker, and judge agents. Execute them as a DAG. Then let the system study its own traces and rewrite its own specification — every run, every time.

Star on GitHub →How it works ↓

Like a murder of crows is more dangerous than one,
a maturity of agents is smarter than before.

Why "a maturity"?

Birds flock. Geese gaggle.
Crows murder.

AI agents mature.

Every collective noun for animals captures something true about how they move and behave together. A murder of crows isn't just a group — it names the coordinated, intelligent behavior that makes them formidable as a collective. We chose maturity deliberately.

Armature's agents don't just coordinate to complete a task. After every run, the system collects execution traces, runs the DiagnosticAnalyzer against them, and uses the SpecRefiner to rewrite the YAML sections that underperformed. The next run is better. Your workflow doesn't just run. It matures.

“A murder of crows is more dangerous than one.
A maturity of agents is smarter than before.”

How It Works

Three steps. One cycle.
Spec. Execute. Improve.

Armature isn't a run-once tool. It's a loop. Each step feeds the next, and the next run is smarter than the last.

SPECIFY

Write a YAML spec. That's it.

Define your agents by role, model tier, and dependencies. Armature validates the DAG before the first run — catching cycles, missing deps, and misconfigured stages. No framework to learn. No graph API to wire.

▸role: researcher | worker | judge | orchestrator

▸tier: small | medium | large (maps to your model config)

▸depends_on: [list of upstream stage IDs]

▸output_mode: text | guided_json with schema validation

EXECUTE

DAG execution. Context flows automatically.

Independent stages run in parallel. Dependent stages wait for their inputs. Every stage receives the full accumulated context from all upstream stages — no wiring, no passing variables by hand. One shared dict, built up as the workflow runs.

▸Parallel fan-out for independent branches

▸Context dict accumulates all upstream outputs

▸guided_json with automatic tier escalation on failure

▸Checkpoint & resume — survive crashes mid-workflow

IMPROVE

The workflow rewrites itself.

Every run generates a trace. The SelfImproveRunner computes IHR across all stages, identifies which ones drag the score down, and rewrites targeted YAML sections. Add --auto-improve to any run and Armature applies safe fixes automatically — or stages structural rewrites for human review. The next run is better. Verifiably.

▸IHR = 0.35×valid_rate + 0.25×success + 0.20×quorum + 0.10×latency + 0.10×happy_path

▸DiagnosticAnalyzer identifies the lowest-scoring stages

▸SpecRefiner rewrites only the underperforming YAML sections

▸Prediction-verification: fixes are confirmed or flagged each cycle

Agent Roles

Four roles. Every agent has one.

Roles aren't a label — they determine execution order, context access, and contribution to the self-improvement health score. A well-designed maturity uses all four.

◎

Researcher

Gathers.

The information foundation. Researchers query tools, read context, search external sources, and build the knowledge base that downstream agents draw from. They run first — and in parallel when independent.

Common uses

Market signal aggregation

Competitor analysis

Evidence synthesis across sources

Tool call fan-out

◈

Worker

Transforms.

The production engine. Workers synthesize research into drafts, summaries, reports, code, or structured data. They consume upstream researcher output and produce the artifacts that judges and downstream workers will evaluate.

Common uses

Draft generation

Data transformation

Code synthesis

Report writing

◉

Judge

Evaluates.

The quality gate. Judges score output quality, validate against criteria, flag hallucinations, and decide whether a result meets the bar. Only judges contribute to the quorum score in the IHR — they are the accountability layer.

Common uses

Output quality scoring (0–10)

Hallucination detection

Criteria validation

Structured pass/fail decisions

⬢

Orchestrator

Coordinates.

The control plane. Orchestrators manage multi-stage execution pipelines, route work to specialized subteams, handle branching and conditional logic, and ensure all dependencies are satisfied before downstream stages run.

Common uses

Multi-stage routing

Parallel fan-out coordination

Conditional branching

Error recovery & retry logic

The Differentiator

Static orchestration is table stakes.
Armature learns.

AWS AgentCore, LangGraph, and CrewAI let you build agent workflows. Armature does that too — and then automatically improves them across runs using the Implicit Harness Rating loop.

Implicit Harness Rating (IHR)

IHR = 0.35 × valid_rate + 0.25 × success_rate + 0.20 × avg_quorum + 0.10 × latency_score + 0.10 × happy_path_rate

Scored 0–1.0 per run. SpecRefiner targets stages whose contribution drops the overall IHR.

Run & Trace

Every workflow run generates a structured trace — inputs, outputs, scores, latencies, and errors per stage.

Diagnose

DiagnosticAnalyzer computes IHR and identifies stages with the lowest per-metric contribution.

Rewrite

SpecRefiner (an LLM) receives the underperforming stage spec and rewrites the system prompt, output schema, or parameters.

Verify

The next run's IHR is compared to predictions. SpecRefiner tracks which fixes held and which missed — so it improves its own rewrites too.

Prediction-verification closes the loop: SpecRefiner declares what it expects each rewrite to fix. The subsequent run confirms whether the fixes held — and which ones missed. The rewriter improves its own judgment over time.

Auto self-improvement — zero manual steps

armature run my-workflow.yaml --auto-improve

Add --auto-improve to any run. When IHR drops below 0.75, Armature automatically calls SpecRefiner after execution — rewriting prompts, relaxing schemas, rebalancing model tiers, or tuning retry limits. Safe changes apply immediately; structural rewrites stage to {spec}.pending.yaml for human review.

What's New in v0.2.0

Long-running workflows.
Production hardened.

v0.2.0 adds mission anchoring, cross-run continuation, triggers, streaming responses, and governance — making Armature workflows production-grade for services, scheduled jobs, and interactive systems.

🎯

mission: field

Anchors all LLM stages to a stated goal. Prevents drift in long workflows — the system automatically validates that each stage output moves toward the mission.

↻

continuation: block

Cross-run context carry-forward. Start where you left off. Enables incremental research, weekly briefings, or ongoing analysis without resetting state.

⏰

armature watch + triggers:

Cron and webhook daemon. Schedule workflows (6am daily), react to events, or expose HTTP endpoints that kick off runs. Turn any workflow into a service.

↙

response_stage: true

SSE token streaming. Converts any workflow into a streaming API. Perfect for chatbots, real-time UIs, and interactive agents.

▶️

armature replay <run_id>

Replay any run with cached LLM responses. Debug via transcript, understand exactly why a result was produced, or rerun with different safety rules.

⚠️

Static risk scoring

armature validate now scores LOW/MEDIUM/HIGH/CRITICAL before any run. Catch dangerous specs early with Governance layers, tool bans, and safety drift detection.

Research Foundation

Ten sources. One framework.
All implemented.

Armature isn't invented from first principles — it's a synthesis of the best current academic thinking on agent harness design, published between February and May 2026, plus Microsoft's Agent Governance Toolkit, ActiveGraph's event-sourced execution model, and Veldt Labs' KYA trust layer. Every source contributed concrete, implemented capabilities.

Mature has two meanings here. The agents grow smarter every run — and the harness itself matures alongside the field, tracking the latest research as it ships.

01 · Mar 2026

arXiv:2603.25723↗

Natural-Language Agent Harnesses

Tsinghua University

Workflows defined in structured natural language outperform equivalent code-based harnesses — and can be reasoned about and rewritten by an optimizer.

▸YAML spec format & DAG executor

▸Four role types (researcher/worker/judge/orchestrator)

▸IHR quality metric & parallel fan-out

02 · Mar 2026

arXiv:2603.28052↗

Meta-Harness: Automated Optimization

Stanford University

Giving a frontier model access to full execution traces — not just pass/fail scores — enables causal reasoning about why runs failed and how to fix them.

▸`armature optimize` command

▸A/B spec testing by IHR

▸Multi-iteration optimizer with proposal history

03 · Feb 2026

arXiv:2603.03329↗

AutoHarness: LLM-Synthesized Harnesses

arXiv:2603.03329

LLMs can generate, run, evaluate, and refine their own harness specs — producing systems that outperform larger models running without a harness.

▸`armature new` spec wizard

▸NL → YAML synthesis loop

▸Prompt bootstrapping from trace examples

04 · Mar 2025

arXiv:2503.18666↗

AgentSpec: Runtime Safety Enforcement

arXiv:2503.18666

Safety constraints should be declarative rules co-located with the workflow spec — not hardcoded logic — so they can be audited, reasoned about, and generated by LLMs.

▸Declarative `safety_rules` YAML DSL

▸Pre/post-stage and pre/post-tool hooks

▸`ToolBlocked` non-retryable exception

05 · May 2026

arXiv:2605.09998↗

Continual Harness: Reset-Free Self-Improvement

arXiv:2605.09998

Agentic systems can improve continuously — without human intervention or new training runs — using a two-loop design: in-run adaptation and cross-run spec refinement.

▸`post_run` in-run refiner stage

▸`armature improve` outer self-improvement loop

▸Trace export for SFT/DPO fine-tuning

06 · Apr 2026

arXiv:2604.25850↗

AHE: Observability-Driven Automatic Evolution

arXiv:2604.25850

Every improvement proposal must declare what it predicts it will fix — and the next cycle must verify those predictions. "Did the score go up?" is not enough.

▸Prediction-verification loop per improvement cycle

▸`predicted_fixes` / `verified_fixes` tracking

▸Falsifiable contracts on every spec revision

07 · May 2026

arXiv:2605.26112↗

From Model Scaling to System Scaling

arXiv:2605.26112

Three system-level failure modes that model size alone cannot fix: stale memory reaching LLMs without warning, context values flowing without provenance, and tool side effects going unverified.

▸Memory staleness detection + `_stale_memory_keys` injection

▸Context provenance tracking per trace key

▸Post-condition verification for tool side effects

▸Drift score + component governance classification

AGT · 2025

↗

Agent Governance Toolkit

Microsoft

Production agents require auditable governance primitives baked into the execution layer — not bolted on as policy checks. Reversibility, trace integrity, and fail-closed safety modes belong in the harness spec itself.

▸Reversibility classification on every tool (FULL / PARTIAL / NONE)

▸SHA-256 trace input hashing + policy version fingerprint

▸`require_approval` gate on the tool-call path

▸`safety_mode: strict` — fail-closed, deny on no-match

AG · May 2026

arXiv:2605.21997↗

ActiveGraph: Event-Sourced Agents

Yohei Nakajima

Append-only event logs make agent runs reproducible and auditable. Content-addressed LLM caching turns expensive re-runs into instant cache hits — enabling replay, debugging, and future fork-and-diff without paying LLM costs.

▸Content-addressed LLM response cache (`--no-cache` to opt out)

▸`armature replay <run_id>` — stage-by-stage audit from TraceStore

▸Trace-triggered behaviors (`BehaviorRule`) with IHR feedback built-in

▸`--auto-improve`: after each run, auto-applies spec improvements when IHR drops below 0.75

KYA · May 2026

arXiv:2605.25376↗

KYA: Trust Layer for Autonomous Systems

Veldt Labs

Governance must operate before execution, not only at runtime. A risk score computed from the agent's definition — its tools, governance mode, and safety rules — tells you how dangerous a workflow is before it runs. And safety rules must only tighten: an allow rule that contradicts a block rule is a misconfiguration, not a feature.

▸Static spec risk score [0–100] surfaced by `armature validate` (LOW/MEDIUM/HIGH/CRITICAL)

▸Rogue signal counter — every tool block incremented, shown in run summary

▸Only-tighten rule validation — `CONFLICTING_SAFETY_RULES` when allow loosens a block

The core finding shared across all seven: the harness is more important than the model. Armature ships the harness — production-grade, self-improving, and open source.

Quick Start

From zero to running
in minutes.

Write a YAML spec, point Armature at it, and watch your maturity of agents get to work.

market-briefing.yaml

name: market-briefing
model_tiers:
  small: {provider: anthropic, model: claude-haiku-4-5-20251001}
  large: {provider: anthropic, model: claude-sonnet-4-6}

stages:
  - id: researcher
    role: researcher
    tier: small
    system: |
      Gather and summarize key signals on the given topic.
      Focus on recent developments, key players, and trends.

  - id: analyst
    role: worker
    tier: small
    depends_on: [researcher]
    system: |
      From the research, identify the top 3 opportunities.
      Quantify each with available evidence.

  - id: editor
    role: judge
    tier: large
    depends_on: [analyst]
    system: |
      Review the analysis. Score quality 0–10.
      Flag any gaps or unsupported claims.

terminal

$armature run market-briefing.yaml \

--topic "AI in healthcare diagnostics"

✓DAG validated (3 stages, no cycles)

◌researcher running...

✓researcher done (1.4s)

◌analyst running...

✓analyst done (2.2s)

◌editor running...

✓editor done (0.9s, score=8.7/10)

✓Complete in 4.5s · IHR=0.91

→.armature/traces/run-20260517.json

$ pip install armature · then set ANTHROPIC_API_KEY and run.

Built with Armature

Reference implementations.
Production-ready.

Open-source projects that showcase what Armature can do — security scanning, automated research, and more.

🔐

Argus

Security & Quality Scanner

Combines 7 industry tools (gitleaks, semgrep, gosec, etc.) with LLM code review in a single command. Scans repositories in 3-8 minutes. A production reference implementation of Armature.

View on GitHub →

📊

Research-Analyst

Automated Research Briefings

Deep-dive research powered by web search, source extraction, and synthesis — declared as a YAML workflow. Supports continuation for weekly incremental briefings and scheduled runs.

View on GitHub →

Open Source

Built to be shared.

Armature is free, MIT licensed, and built in the open. Fork it, extend it, build on it. Contributions welcome — especially new role types, tool integrations, and self-improvement strategies.

View on GitHub →Read the Docs

Python 3.11+

runtime

MIT

license

LiteLLM

provider layer

1,286+

tests passing

Armature is an open-source project from ElfTech — we build autonomous AI systems for operational work. Explore our full platform at elftech.com.