# Execution-Based Verification — Definition of Done

**Why this exists.** We build with Claude Code. When Claude *audits by reading the code*, it confidently says "works / implemented / available" — then manual testing fails. Reading code is a **document review**, not a test. An LLM reports plausible-looking code as "done" without ever starting the app, clicking the button, or hitting the database. This standard closes that gap: **a claim only counts when the running system was actually exercised and the real evidence is shown.**

The rule in one line: **Editable judgment, un-editable facts. "Looks correct" is not done. "Here is the actual run output" is done.**

---

## 1. The core rule

> No feature, fix, or audit result may be reported as **working / done / available / passing** based on reading code, reasoning about code, or summarizing what the code *should* do. It may only be reported as such when an **executed test ran against the running system** and the **actual output is shown** (command output, HTTP response, DB rows, screenshot, logs). If it wasn't run, it isn't done — say **"unverified — not executed."**

## 2. What does NOT count as evidence (auto-fail the claim)

- "I reviewed the code and it looks correct / handles this / is implemented."
- "The endpoint/function exists, so it works."
- "This should return… / this would handle…" (hypothetical, not observed).
- A *summary* of what a test would show, instead of the test's real output.
- Build gates described as passing without the actual `tsc` / `eslint` / `vitest` output pasted.
- Any "100% passed" with no per-test detail or artifacts (this is the auto-pass trap).

## 3. Definition of Done — checklist (every build/fix)

A change is **DONE** only when all that apply are true **and the evidence is pasted in**:

- [ ] **It runs.** App/service starts; the changed path is exercised end-to-end, not mocked away.
- [ ] **Happy path proven** with real output (HTTP response body / UI screenshot / returned value).
- [ ] **Negative path proven** — bad input, missing field, wrong type → correct rejection shown.
- [ ] **Permission path proven** — wrong-role / unauthenticated user is actually denied (real 401/403 shown).
- [ ] **Data proven** — DB/state **before and after** shown for any write; the row really changed.
- [ ] **Failure is real** — a forced failure produces *fail*, not a silent pass. (Prove the test can fail.)
- [ ] **Build gates pass with output pasted** — `tsc --noEmit`, `eslint`, `vitest run` actual results.
- [ ] **Evidence captured** — logs / screenshots / response bodies attached; if none exist, it didn't happen.
- [ ] **Acceptance criteria stated up front** and each one checked against real output (not vibes).

If any box can't be checked with real output → status is **`unverified — not executed`**, never "done."

## 4. Evidence required, by claim type

| Claim | Acceptable evidence (must be real, not described) |
|---|---|
| "API endpoint works" | Actual request + **actual response body + status code**; a negative request too |
| "UI feature works" | **Screenshot / Playwright run** of the real click-through, incl. an error state |
| "It saves / updates data" | **DB query before and after** showing the row changed; not "the code inserts it" |
| "Permissions enforced" | Real **401/403** for wrong-role user; real 200 for allowed user |
| "Tests pass" | Pasted **`vitest`/jest run output** with counts; plus proof a test *can* fail |
| "Types/lint clean" | Pasted **`tsc --noEmit --strict`** and **`eslint`** output, zero errors |
| "Migration applied" | Actual migration run log + post-state schema/rows |
| "It's fast / scalable" | Actual measured numbers; otherwise **`unverified`** |

## 5. Paste-in rule for Claude Code (drop into `CLAUDE.md`)

```text
## Verification rule (non-negotiable)
You may NOT report anything as working, done, available, implemented, fixed, or passing
based on reading or reasoning about code. Reading code is a document review, not a test.

A claim of "works/done/passing" is ONLY valid when you have ACTUALLY RUN it and you PASTE
the real output: command output, HTTP request+response, DB rows before/after, screenshot,
or test-run logs. No summaries of what it "would" show. No "looks correct".

For every build/fix, before saying done, RUN and PASTE evidence for:
  happy path, negative path (bad input), permission-denied path, data before/after,
  and a forced failure proving the test CAN fail.
Then run and paste: tsc --noEmit --strict, eslint, vitest run.

If you did not execute it, say exactly: "unverified — not executed" and list what to run.
Never write "100% passed" without per-test detail and artifacts. Never turn an unknown
into a pass. Treat any AI/code-reading conclusion as ADVISORY until real output confirms it.
```

## 6. When AUDITING existing code (same standard)

An "audit" that reads the repo and reports green is the **same failure mode** that bites us.
When auditing, classify every item as one of:

- **VERIFIED** — executed, evidence pasted.
- **UNVERIFIED** — only read/reasoned; **not** proven (default for anything not run).
- **CONFLICT** — code contradicts the claim (e.g., hardcoded value, stub, `details: []`, ignored param).

Report counts honestly (VERIFIED / UNVERIFIED / CONFLICT). "Available" without execution = **UNVERIFIED**, full stop. Watch specifically for stubs that fake success: hardcoded pass/`100%`, empty result arrays, ignored signature/permission params, `null` provenance, status set without doing the work.

## 7. Why this works

LLMs are confident pattern-matchers: they'll call plausible code "done" because it *reads* done. Execution is ground truth — the app either returned the right thing or it didn't. By making **real run output the only currency**, "looks correct" stops counting, and the "Claude says it works but it doesn't" gap closes. This is the same discipline the Verixa self-validation engine (URS-39) is built on — *execute and capture evidence, never assert* — applied to everyday Claude Code work.

---

*One-page standard. Adopt as a `CLAUDE.md` rule (§5) + PR checklist (§3). Definition of Done = green executed evidence, not an AI review.*
