📊 Agent 4

Codebase + Product Critic

The Critic runs immediately after the Builder — before security hardening and QA — judging whether the built product should exist, whether it is positioned well, and what must change before real launch. It uses the 12-Dimension Product Scorecard as its primary judgment framework.

⚠️

Truth over comfort

The Critic does not provide false reassurance. If the product has fundamental problems — unclear value proposition, broken monetization, or uncompetitive positioning — the Critic says so directly with evidence.

12-Dimension Product Scorecard

The scorecard is completed immediately after Phase 0 market intelligence. Every dimension is scored 1–5. A score of 1 on any dimension is an automatic HOLD.

#	Dimension	What it measures
D1	First-Run Experience	Can a new user understand the product and complete their first action within 60 seconds, with no documentation?
D2	Activation Funnel Clarity	Is the path from landing/signup to first delivered value obvious and frictionless?
D3	Core Loop Retention	Does the product give users a compelling reason to return tomorrow?
D4	Error Recovery UX	When something fails, can the user recover without leaving the product or contacting support?
D5	Feature Discoverability	Can users find secondary features without being told they exist?
D6	Monetization Fit	Does the paywall hit at the right moment? Is the free tier compelling but limited enough to convert?
D7	Competitive Differentiation	Does the product do one thing demonstrably better than named competitors, visible on the first screen?
D8	Onboarding Completeness	Does every declared user type have a path to their first success?
D9	Empty State Quality	What does the app show before any data exists? Does it guide the user forward?
D10	Mobile / Responsive Execution	Was mobile designed or just adapted? Does every critical flow work at 390px?
D11	Trust Signal Quality	Does the product look production-ready? Are there obvious "made by AI" tells?
D12	Spec Fidelity	What was in the spec that didn't ship? What shipped that wasn't in the spec?

Scoring Scale

Score	Meaning
5	Excellent — clearly better than most products in this category
4	Good — solid, a few rough edges
3	Adequate — functional but unremarkable
2	Weak — material problems that reduce the product's viability
1	Broken / Missing — this dimension is absent or severely defective

Scorecard Verdict

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 12-Dimension Product Scorecard — [Project Name]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
D1  First-Run Experience         4/5 — clear onboarding, first action in ~45s
D2  Activation Funnel Clarity    4/5 — single CTA, minimal friction
D3  Core Loop Retention          3/5 — some reason to return, not reinforced
D4  Error Recovery UX            3/5 — most errors informative, some dead ends
D5  Feature Discoverability      3/5 — core features accessible, secondary hidden
D6  Monetization Fit             4/5 — paywall at value moment, free tier compelling
D7  Competitive Differentiation  3/5 — differentiation exists but not prominent
D8  Onboarding Completeness      4/5 — primary user well-served, admin gaps
D9  Empty State Quality          2/5 — raw "no data" messages on 3 screens
D10 Mobile/Responsive Execution  3/5 — usable but clearly secondary
D11 Trust Signal Quality         4/5 — polished, consistent, no obvious AI tells
D12 Spec Fidelity                5/5 — all MVP features shipped
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Mean score: 3.5/5
Dimensions at risk (score ≤ 2): D9 Empty State Quality
Auto-HOLD triggers (score = 1): none

Scorecard verdict: CONDITIONAL
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Condition	Verdict
Mean ≥ 4.0, no dimension ≤ 2	SHIP_READY
Mean ≥ 3.0, no dimension = 1	CONDITIONAL
Any dimension = 1, or mean < 3.0	HOLD

Market Intelligence Tools

The Critic uses live market data during Phase 0 to ground critique in external reality, not just artifact analysis:

Tool	Used for
Brave Search MCP	Verify stated differentiation against actual search results; surface competitors
Firecrawl MCP	Scrape competitor pricing pages and feature tables for direct comparison
Playwright MCP	Load the built app; take first-impression screenshots at desktop and mobile
Reddit API	Search product-category subreddits for pain points and competitor mentions
Hacker News API	Search "Ask HN" threads for comparable products; surface community perception
GitHub API	Check star counts and contributor velocity for open-source competitors

Review Workflow

Phase	Description
0	Orientation — stack, platforms, users, monetization, design contract. Run market intelligence pre-scan.
0B	Product Scorecard — complete 12-Dimension Scorecard with evidence per dimension.
1	System Map — architecture, data flow, major modules, state ownership, dependency risks.
2	Assumption Stress Test — for each core assumption: how it fails, cost of failure, platform affected.
3A–L	Deep Review — Correctness, Security, Performance, Maintainability, DX, Observability, Testing, Product/GTM, Platform Fit, Cross-Platform, Monetization, UI/UX.
4	Prioritization — Top 10 Fix-First, Stop-Doing, Keep-Doing lists. Every finding includes severity, effort, and acceptance criteria.

Final Verdict Vocabulary

Verdict	Meaning
SHIP [platform]	Genuinely ready for production launch on this platform
CONDITIONAL	Launch requires contained fixes; documented conditions must be met first
HOLD [platform]	Platform implementation is not ready for production
RESTYLE	UI trust is below launch quality; visual redesign required
REDESIGN	Foundation or product architecture is wrong; rebuild required
KILL	The product concept should not proceed as built
PAUSE	Shipping would be irresponsible; specific risks must be mitigated first

CRITICISM.md Report Structure

Executive Summary (scorecard verdict as first line)
12-Dimension Product Scorecard (complete table)
Assessment per Platform
Critical Flaws
Risks
Top 10 Fix-First List (ordered by scorecard dimension impact)
Stop-Doing List
Keep-Doing List (evidence required)
Product & Growth Gaps
Platform-Specific Critique
Cross-Platform Consistency Assessment
Monetization Architecture Assessment
UI/UX Design Contract Assessment
What You're Avoiding
Product Improvement Examples (3–10)
Stress Question & Answer
Assumptions Stress Test
Review Category Summary (A–L)
Per-Platform Verdict
Overall Verdict (references scorecard verdict)
One Next Action (targets lowest-scoring dimension)
Audit Checklist