Whitepaper
Generation is getting commoditized. The durable value is knowing whether the code actually works. This is the thesis.
As models improve, redundant generation wastes money. Verification is the durable bottleneck.
P(at least 1 correct | N) = 1 - (1 - p)N
Base success probability (p) — slide to see diminishing returns
P(at least 1 correct) by agent count (p=0.70)
When p = 0.70, running N=3 costs 3× for a 27.3% improvement. The marginal return on additional attempts collapses exponentially.
Best-of-N fixes the generation problem. Verification fixes the trust problem. As models improve, the generation problem shrinks but the trust problem grows.
Generation is getting commoditized. Sonnet, GPT, Gemini — they all write good code. Next year they'll all write better code. Switching costs approach zero. That's a commodity.
Verification is the opposite. It's project-specific. It requires understanding architecture, conventions, test infrastructure, deployment constraints. You can't verify a React app the same way you verify a distributed database.
The Harness Signal
OpenAI's team built a million-line codebase with zero human-written code. Three engineers. ~1,500 PRs. Their biggest bottleneck? Human QA capacity. They built verification internally because there's nothing to buy.
Verify
Point it at any change. 11 empirical gates. Dempster-Shafer confidence with full evidence trail. The primary command.
Council
Not pick-the-winner. Genuine synthesis. Reads N implementations, produces one change combining the best architecture, tests, error handling from each. Value increases as models improve — the inverse of best-of-N economics.
Factory
The autonomous loop. Give it a vision and a duration. It decomposes into specs, runs tournaments, verifies every change, and ships what passes. Hold queue for what doesn’t.
11 gates. 7 hard, 3 binding empirical, 1 advisory. Every gate produces evidence or blocks the merge. No LLM opinions.
Hard Gates (deterministic, must pass)
| Gate | Detects | How |
|---|---|---|
| install | Dependency issues | Package manager install (npm/yarn/pnpm/bun) |
| build | Compilation errors | Run detected build command |
| lint | Style violations | Run detected linter |
| typecheck | Type errors | Run type checker |
| test | Functional regressions | Run detected test suite + quality checks |
| policy | Constraint violations | Diff size, forbidden paths, renames |
| visual | UI regressions | Playwright screenshots + vision LLM judge |
Binding Empirical Gates (LLM generates, reality executes)
| Gate | Proves | Method |
|---|---|---|
| spec_tests | Spec compliance | LLM generates tests from spec, executes empirically |
| test_adequacy | Test thoroughness | Mutation testing — kill rate proves tests catch bugs |
| runtime_behavioral | Live behavior | Starts dev server, browser agent verifies conditions |
Advisory Gate (diagnostic, non-blocking)
| Gate | Evaluates | Method |
|---|---|---|
| error_diagnosis | Failure classification | Pure regex — environment, flaky test, or real bug |
Each gate emits a belief mass: correct, incorrect, uncommitted. These combine via Dempster-Shafer fusion into composite confidence. High agreement strengthens belief. Conflicting signals raise uncertainty. The math prevents any single gate from dominating. Result routes to three tiers: dark (auto-merge), cautious (auto-merge with logging), hold (human review).
| Code Generation | CI/CD | Code Review | Hyper | |
|---|---|---|---|---|
| Who | Copilot, Cursor, Codex, Claude Code, Devin | GitHub Actions, BuildKite, CircleCI | CodeRabbit, SonarQube, Codacy | — |
| Strengths | Writes code fast | Runs predefined pipelines | Reviews code quality | Full verification pipeline |
| Gap | No independent verification | Doesn’t understand specs | Suggestions, not verdicts | — |
| Result | “Trust me, it works” | “Tests pass” (if configured) | “Consider refactoring” | “Here’s the evidence” |
There is no standalone product that takes a change + a spec, runs the full verification pipeline, produces a confidence score with evidence, and decides ship-or-iterate. Every serious AI engineering team builds it internally.
| Generation | Verification | |
|---|---|---|
| Competition | Every AI lab + every coding tool | Nobody standalone |
| Moat | None — swap models freely | Project-specific knowledge |
| Pricing pressure | Extreme (commodity) | Low (unique value) |
| Value trajectory | Decreasing (models improve) | Increasing (velocity increases) |
| Switching cost | Zero | High (learned project context) |
Verification compounds. Every run teaches the system more about the project. Failure patterns, flaky tests, architecture violations. Generation doesn't compound — each run is independent. Verification gets better the longer it runs.
Every claim in this paper is backed by real gate logs, diff artifacts, and evidence trails. Hyper does not ask you to trust it. It asks you to verify.