Judge calibration

Axis	Judge vs hand
Strongest agreement
palette.warmth	1.00 · exact
surface.ground	1.00 · cream ✓
emphasis.mechanism	1.00 · underline ✓
Weakest agreement
surface.radius	0.50 · pred 12–16px / actual 8–10px
type.pairing	0.00 · pred single-family / actual serif/sans
voice.archetype	pending · awaits ANTHROPIC_API_KEY

Axis

Judge vs hand

Strongest agreement

palette.warmth

1.00 · exact

surface.ground

1.00 · cream ✓

emphasis.mechanism

1.00 · underline ✓

Weakest agreement

surface.radius

0.50 · pred 12–16px / actual 8–10px

type.pairing

0.00 · pred single-family / actual serif/sans

voice.archetype

pending · awaits ANTHROPIC_API_KEY

Axis	Judge vs hand
Strongest agreement
surface.ground	1.00 · white ✓
palette.contrast	1.00 · high ✓
surface.borders	~0.92 · hairline (25-brand)
Weakest agreement
imagery.strategy	pending · n=1 only (Acne)
motion.budget	pending · Playwright required
voice.archetype	pending · per-brand tau to come

Axis

Judge vs hand

Strongest agreement

surface.ground

1.00 · white ✓

palette.contrast

1.00 · high ✓

surface.borders

~0.92 · hairline (25-brand)

Weakest agreement

imagery.strategy

pending · n=1 only (Acne)

motion.budget

pending · Playwright required

voice.archetype

pending · per-brand tau to come

Axis	Judge vs hand
Strongest agreement
type.pairing	1.00 · sans/sans ✓
palette.contrast	1.00 · high ✓
surface.ground	1.00 · white ✓
Weakest agreement
palette.warmth	0.00 · accent recovered, hue-bucket wrong
palette.saturation	0.50 · ordinal map needs recalibration
motion.budget	pending · pervasive vs static-CSS abstention

Axis

Judge vs hand

Strongest agreement

type.pairing

1.00 · sans/sans ✓

palette.contrast

1.00 · high ✓

surface.ground

1.00 · white ✓

Weakest agreement

palette.warmth

0.00 · accent recovered, hue-bucket wrong

palette.saturation

0.50 · ordinal map needs recalibration

motion.budget

pending · pervasive vs static-CSS abstention

Axis	Judge vs hand
Strongest agreement
surface.dark_mode	1.00 · dark-first ✓
surface.borders	~0.92 · hairline (25-brand)
palette.contrast	1.00 · high ✓
Weakest agreement
voice.sentence_length	~0.56 · sparse <main> on SPA shell
motion.character	pending · scroll-recording analysis
voice.archetype	pending · per-brand tau to come

Axis

Judge vs hand

Strongest agreement

surface.dark_mode

1.00 · dark-first ✓

surface.borders

~0.92 · hairline (25-brand)

palette.contrast

1.00 · high ✓

Weakest agreement

voice.sentence_length

~0.56 · sparse <main> on SPA shell

motion.character

pending · scroll-recording analysis

voice.archetype

pending · per-brand tau to come

Axis	Judge vs hand
Strongest agreement
type.tracking	1.00 · loose, em-median exact
imagery.strategy	1.00 · photographic ✓ (only n=1 hit)
surface.ground	1.00 · white ✓
Weakest agreement
emphasis.mechanism	0.00 · pred underline / actual scale (widget CSS swamp)
motion.budget	abstain · 56 Shopify @keyframes vs catalog "none"
palette.warmth / saturation	0.00 · no :root vars, fallback poisoned

Axis

Judge vs hand

Strongest agreement

type.tracking

1.00 · loose, em-median exact

imagery.strategy

1.00 · photographic ✓ (only n=1 hit)

surface.ground

1.00 · white ✓

Weakest agreement

emphasis.mechanism

0.00 · pred underline / actual scale (widget CSS swamp)

motion.budget

abstain · 56 Shopify @keyframes vs catalog "none"

palette.warmth / saturation

0.00 · no :root vars, fallback poisoned

Methodology, in three lines

Per-axis agreement is the round-trip score: fetch a brand's homepage, project raw CSS/HTML signals into grammar axes, then compare against the hand-graded catalog value. Ordinal axes use normalized distance; nominal axes use binary match.

Golden Set: anthropic · linear · stripe · acne-studios · antimetal — 5 brands, fully hand-graded.
Test set: 25 catalog brands with ≥8 graded axes — held out for fidelity measurement.
Target bands (RFC-001 §V): det ≥ 0.85 mix ≥ 0.70 model ≥ 0.60

Full per-axis agreement

Reference material — the 30 axes across 8 categories used by the judge. Agreement is the round-trip score from tools/spike_roundtrip.py on the 25-brand test set; bold scores are at-target. Axes marked pending are honest non-predictions — the extractor abstains rather than guesses. For per-brand judge-vs-hand-grade comparisons, see the Golden Set above.

Axis	Label	Scale	Extraction	Agreement	Status

at-target meets RFC §V band in-flight close, scheduled fix blocked named external blocker pending not yet measured

Reproduce this

Every figure on this page is generated by code you can run. No closed evaluation set, no secret prompts.

Read the eval pipeline spec — what is being measured, and why, in detail.
Re-run round-trip on the 25-brand set: python tools/spike_roundtrip.py --weights w-2026-05-001
Inspect per-brand breakdown and known blockers in the calibration note.

RFC-001 — The Open Evaluation Pipeline for Creative AI Grammar v2 · Calibration Note (May 2026) schema/grammar-v2.json — full 30-axis specification api/weights.json — current weights (w-2026-05-001)

How tasteHQ's judge is calibrated

The Golden Set — five hand-graded anchors.

Anthropic

Apple

Stripe

Linear

Acne Studios

Methodology, in three lines

Full per-axis agreement

Reproduce this