📊 Agent 4
Codebase + Product Critic
The Critic runs immediately after the Builder — before security hardening and QA —
judging whether the built product should exist, whether it is positioned well, and what must
change before real launch. It uses the 12-Dimension Product Scorecard as its primary judgment framework.
⚠️
Truth over comfort
The Critic does not provide false reassurance. If the product has fundamental problems — unclear value proposition, broken monetization, or uncompetitive positioning — the Critic says so directly with evidence.
12-Dimension Product Scorecard
The scorecard is completed immediately after Phase 0 market intelligence. Every dimension is scored 1–5. A score of 1 on any dimension is an automatic HOLD.
| # | Dimension | What it measures |
| D1 | First-Run Experience | Can a new user understand the product and complete their first action within 60 seconds, with no documentation? |
| D2 | Activation Funnel Clarity | Is the path from landing/signup to first delivered value obvious and frictionless? |
| D3 | Core Loop Retention | Does the product give users a compelling reason to return tomorrow? |
| D4 | Error Recovery UX | When something fails, can the user recover without leaving the product or contacting support? |
| D5 | Feature Discoverability | Can users find secondary features without being told they exist? |
| D6 | Monetization Fit | Does the paywall hit at the right moment? Is the free tier compelling but limited enough to convert? |
| D7 | Competitive Differentiation | Does the product do one thing demonstrably better than named competitors, visible on the first screen? |
| D8 | Onboarding Completeness | Does every declared user type have a path to their first success? |
| D9 | Empty State Quality | What does the app show before any data exists? Does it guide the user forward? |
| D10 | Mobile / Responsive Execution | Was mobile designed or just adapted? Does every critical flow work at 390px? |
| D11 | Trust Signal Quality | Does the product look production-ready? Are there obvious "made by AI" tells? |
| D12 | Spec Fidelity | What was in the spec that didn't ship? What shipped that wasn't in the spec? |
Scoring Scale
| Score | Meaning |
| 5 | Excellent — clearly better than most products in this category |
| 4 | Good — solid, a few rough edges |
| 3 | Adequate — functional but unremarkable |
| 2 | Weak — material problems that reduce the product's viability |
| 1 | Broken / Missing — this dimension is absent or severely defective |
Scorecard Verdict
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 12-Dimension Product Scorecard — [Project Name]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
D1 First-Run Experience 4/5 — clear onboarding, first action in ~45s
D2 Activation Funnel Clarity 4/5 — single CTA, minimal friction
D3 Core Loop Retention 3/5 — some reason to return, not reinforced
D4 Error Recovery UX 3/5 — most errors informative, some dead ends
D5 Feature Discoverability 3/5 — core features accessible, secondary hidden
D6 Monetization Fit 4/5 — paywall at value moment, free tier compelling
D7 Competitive Differentiation 3/5 — differentiation exists but not prominent
D8 Onboarding Completeness 4/5 — primary user well-served, admin gaps
D9 Empty State Quality 2/5 — raw "no data" messages on 3 screens
D10 Mobile/Responsive Execution 3/5 — usable but clearly secondary
D11 Trust Signal Quality 4/5 — polished, consistent, no obvious AI tells
D12 Spec Fidelity 5/5 — all MVP features shipped
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Mean score: 3.5/5
Dimensions at risk (score ≤ 2): D9 Empty State Quality
Auto-HOLD triggers (score = 1): none
Scorecard verdict: CONDITIONAL
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
| Condition | Verdict |
| Mean ≥ 4.0, no dimension ≤ 2 | SHIP_READY |
| Mean ≥ 3.0, no dimension = 1 | CONDITIONAL |
| Any dimension = 1, or mean < 3.0 | HOLD |
Market Intelligence Tools
The Critic uses live market data during Phase 0 to ground critique in external reality, not just artifact analysis:
| Tool | Used for |
| Brave Search MCP | Verify stated differentiation against actual search results; surface competitors |
| Firecrawl MCP | Scrape competitor pricing pages and feature tables for direct comparison |
| Playwright MCP | Load the built app; take first-impression screenshots at desktop and mobile |
| Reddit API | Search product-category subreddits for pain points and competitor mentions |
| Hacker News API | Search "Ask HN" threads for comparable products; surface community perception |
| GitHub API | Check star counts and contributor velocity for open-source competitors |
Review Workflow
| Phase | Description |
| 0 | Orientation — stack, platforms, users, monetization, design contract. Run market intelligence pre-scan. |
| 0B | Product Scorecard — complete 12-Dimension Scorecard with evidence per dimension. |
| 1 | System Map — architecture, data flow, major modules, state ownership, dependency risks. |
| 2 | Assumption Stress Test — for each core assumption: how it fails, cost of failure, platform affected. |
| 3A–L | Deep Review — Correctness, Security, Performance, Maintainability, DX, Observability, Testing, Product/GTM, Platform Fit, Cross-Platform, Monetization, UI/UX. |
| 4 | Prioritization — Top 10 Fix-First, Stop-Doing, Keep-Doing lists. Every finding includes severity, effort, and acceptance criteria. |
Final Verdict Vocabulary
| Verdict | Meaning |
| SHIP [platform] | Genuinely ready for production launch on this platform |
| CONDITIONAL | Launch requires contained fixes; documented conditions must be met first |
| HOLD [platform] | Platform implementation is not ready for production |
| RESTYLE | UI trust is below launch quality; visual redesign required |
| REDESIGN | Foundation or product architecture is wrong; rebuild required |
| KILL | The product concept should not proceed as built |
| PAUSE | Shipping would be irresponsible; specific risks must be mitigated first |
CRITICISM.md Report Structure
- Executive Summary (scorecard verdict as first line)
- 12-Dimension Product Scorecard (complete table)
- Assessment per Platform
- Critical Flaws
- Risks
- Top 10 Fix-First List (ordered by scorecard dimension impact)
- Stop-Doing List
- Keep-Doing List (evidence required)
- Product & Growth Gaps
- Platform-Specific Critique
- Cross-Platform Consistency Assessment
- Monetization Architecture Assessment
- UI/UX Design Contract Assessment
- What You're Avoiding
- Product Improvement Examples (3–10)
- Stress Question & Answer
- Assumptions Stress Test
- Review Category Summary (A–L)
- Per-Platform Verdict
- Overall Verdict (references scorecard verdict)
- One Next Action (targets lowest-scoring dimension)
- Audit Checklist