Whitepaper

Verification Is the Product

Generation is getting commoditized. The durable value is knowing whether the code actually works. This is the thesis.

The Marginal Value of N is Collapsing

As models improve, redundant generation wastes money. Verification is the durable bottleneck.

P(at least 1 correct | N) = 1 - (1 - p)N

~/hyper/marginal-value --interactive

Base success probability (p) — slide to see diminishing returns

P(at least 1 correct) by agent count (p=0.70)

99%
70.0%
N=1
91.0%
N=2
97.3%
N=3
99.8%
N=5

When p = 0.70, running N=3 costs 3× for a 27.3% improvement. The marginal return on additional attempts collapses exponentially.

Best-of-N fixes the generation problem. Verification fixes the trust problem. As models improve, the generation problem shrinks but the trust problem grows.

The Agent That Verifies Is More Valuable Than The Agent That Generates

~/hyper/thesis --verification-first

Generation is getting commoditized. Sonnet, GPT, Gemini — they all write good code. Next year they'll all write better code. Switching costs approach zero. That's a commodity.

Verification is the opposite. It's project-specific. It requires understanding architecture, conventions, test infrastructure, deployment constraints. You can't verify a React app the same way you verify a distributed database.

The Harness Signal

OpenAI's team built a million-line codebase with zero human-written code. Three engineers. ~1,500 PRs. Their biggest bottleneck? Human QA capacity. They built verification internally because there's nothing to buy.

Verify. Council. Factory.

Verify

Point it at any change. 11 empirical gates. Dempster-Shafer confidence with full evidence trail. The primary command.

Council

Not pick-the-winner. Genuine synthesis. Reads N implementations, produces one change combining the best architecture, tests, error handling from each. Value increases as models improve — the inverse of best-of-N economics.

Factory

The autonomous loop. Give it a vision and a duration. It decomposes into specs, runs tournaments, verifies every change, and ships what passes. Hold queue for what doesn’t.

The Trust Pipeline

11 gates. 7 hard, 3 binding empirical, 1 advisory. Every gate produces evidence or blocks the merge. No LLM opinions.

~/hyper/gates --list

Hard Gates (deterministic, must pass)

GateDetectsHow
installDependency issuesPackage manager install (npm/yarn/pnpm/bun)
buildCompilation errorsRun detected build command
lintStyle violationsRun detected linter
typecheckType errorsRun type checker
testFunctional regressionsRun detected test suite + quality checks
policyConstraint violationsDiff size, forbidden paths, renames
visualUI regressionsPlaywright screenshots + vision LLM judge

Binding Empirical Gates (LLM generates, reality executes)

GateProvesMethod
spec_testsSpec complianceLLM generates tests from spec, executes empirically
test_adequacyTest thoroughnessMutation testing — kill rate proves tests catch bugs
runtime_behavioralLive behaviorStarts dev server, browser agent verifies conditions

Advisory Gate (diagnostic, non-blocking)

GateEvaluatesMethod
error_diagnosisFailure classificationPure regex — environment, flaky test, or real bug

Each gate emits a belief mass: correct, incorrect, uncommitted. These combine via Dempster-Shafer fusion into composite confidence. High agreement strengthens belief. Conflicting signals raise uncertainty. The math prevents any single gate from dominating. Result routes to three tiers: dark (auto-merge), cautious (auto-merge with logging), hold (human review).

Nobody Is Building This

~/hyper/landscape --compare
Code GenerationCI/CDCode ReviewHyper
WhoCopilot, Cursor, Codex, Claude Code, DevinGitHub Actions, BuildKite, CircleCICodeRabbit, SonarQube, Codacy
StrengthsWrites code fastRuns predefined pipelinesReviews code qualityFull verification pipeline
GapNo independent verificationDoesn’t understand specsSuggestions, not verdicts
Result“Trust me, it works”“Tests pass” (if configured)“Consider refactoring”“Here’s the evidence”

There is no standalone product that takes a change + a spec, runs the full verification pipeline, produces a confidence score with evidence, and decides ship-or-iterate. Every serious AI engineering team builds it internally.

Generation Is Volume. Verification Is Margin.

~/hyper/economics --compare
GenerationVerification
CompetitionEvery AI lab + every coding toolNobody standalone
MoatNone — swap models freelyProject-specific knowledge
Pricing pressureExtreme (commodity)Low (unique value)
Value trajectoryDecreasing (models improve)Increasing (velocity increases)
Switching costZeroHigh (learned project context)

Verification compounds. Every run teaches the system more about the project. Failure patterns, flaky tests, architecture violations. Generation doesn't compound — each run is independent. Verification gets better the longer it runs.

See the evidence.

Every claim in this paper is backed by real gate logs, diff artifacts, and evidence trails. Hyper does not ask you to trust it. It asks you to verify.