Whitepaper
Everyone can write code — who's proving it works? Manufacturing factories have six sigma. What about automated software factories?
A single AI coding agent produces correct code roughly 61% of the time. Run N agents in parallel, and the probability that at least one succeeds climbs fast. But without verification, you cannot identify which one. That is the gap.
P(at least 1 correct | N) = 1 - (1 - p)N
Base success probability (p)
Number of agents (N)
P(at least 1 correct | p=0.61, N=4)
97.7%
Probability by agent count (p=0.61)
The problem is not generation. The problem is identification.
More agents only helps if you can verify which output is correct. Without verification, best-of-N is just expensive guessing.
Evolution produces robust solutions without central planning. Hyper applies the same principle: generate variation, apply selection pressure, retain only the fittest.
Organisms
Agents
Competing for the same niche
Environment
Gates
Selective pressure
Selection
Tournament
Variation + selection
Fossil record
Evidence trail
Every generation logged
Reproduction
Merge
Winners reproduce into main
16 organisms. 5 generations. Each row applies selection pressure. Only survivors advance.
Generation 0
Spawned
Generation 1
Hard gates: build, test, lint, typecheck
Generation 2
Advisory gates: visual, policy, diff-review
Generation 3
Council review: AI judge panel
Generation 4
Evidence fusion: Dempster-Shafer ranking
16 enter. 1 survives.
Multi-agent is table stakes. Verification is the moat.
| Capability | Hyper | Gastown | Blackbox | Factory.ai | OpenClaw |
|---|---|---|---|---|---|
| Architecture paradigm | Competitionsame task, best wins | Delegationdifferent tasks, manual merge | Multi-model orchestrationAI judge picks best output | Agent-native droidsdelegator + specialized droids | Gateway + sub-agentscustom skills for each task |
| Multi-agent | Tournament (N compete, 1 wins) | Work distribution (tasks farmed to subagents) | Parallel execution (multi-model, AI judge) | Droid delegation (sub-droids in parallel) | Agent teams (custom skills required) |
| Independent verification | 11 gates | ✕ | AI judge | Test hooks + code review | ✕ |
| Evidence fusion | Dempster-Shafer | ✕ | ✕ | ✕ | ✕ |
| Fail-closed | ✓ | ✕ | ✕ | Test + coverage gates | ✕ |
| Evidence trail | Full | ✕ | Execution logs | Audit logs | Session transcripts |
| Visual regression | ✓ | ✕ | ✕ | ✕ | ✕ |
| Merge confidence | Mathematical | ✕ | ✕ | ✕ | ✕ |
| Cost per verified merge | $0.47 | ✕ | ✕ | ✕ | ✕ |
Every competitor distributes work across agents — the same paradigm as hiring more developers. Hyper runs a tournament: N agents independently attempt the same specification, each sampling a different region of the model's latent space. Verification gates select the winner. This rigs the math — instead of hoping one agent gets it right, you raise the probability that at least one does. The difference between parallelization and selection.
Before each tournament, the decomposer re-evaluates the codebase and generates fresh specs. After, it tracks what failed and applies anti-fixation — refusing to retry semantically similar approaches, routing instead to entirely different parts of the vision. Self-healing by design: the factory never gets stuck, it adapts.
When 90% AI adoption correlates with 9% more bugs and 154% larger PRs (Google DORA 2025), the only safe factory is one that verifies everything.
Compound reliability over hundreds of autonomous merges. Without independent gates, reliability collapses exponentially. With them, it holds.
The naive approach
If each merge has 94.2% reliability, after 100 merges:
0.942100 = 0.23% zero defects
With fail-closed gates
Rsystem = 1 - (1 - Rgate)G
With 11 gates at 95% each: 1 - (0.05)11 = ~1.0
System reliability over autonomous merge count
A fully autonomous code factory that decomposes vision into specs, runs tournaments, merges winners, and repeats. No human in the loop unless the evidence demands it.
Signal
Bug report, feature request, vision statement
Spec
LLM decomposes into actionable spec
Confidence
Calibrated confidence score
Route
Dark / cautious / hold
Tournament
N agents compete in isolation
Verify
11 gates, evidence fusion
Merge
Winner merged, losers discarded
Repeat
Next spec, forever
The only safe autonomous factory is one that trusts nothing and verifies everything.
Every claim in this paper is backed by real gate logs, diff artifacts, and evidence trails. Hyper does not ask you to trust it. It asks you to verify.