calibration · grammar v2.0.0

How tasteHQ's judge is calibrated

We hand-grade a Golden Set of 5 brands, then measure how often the automated judge agrees — per axis, per brand. Every number on this page is reproducible from open code.

The Golden Set — five hand-graded anchors.

These five brands define what graded means on tasteHQ. Every other brand's judge score is calibrated against the per-axis agreement rate published below.

Anthropic

Reference

Word-level underlines replace color as the sole emphasis mechanism in display-scale headlines.

Axis Judge vs hand
Strongest agreement
palette.warmth 1.00 · exact
surface.ground 1.00 · cream ✓
emphasis.mechanism 1.00 · underline ✓
Weakest agreement
surface.radius 0.50 · pred 12–16px / actual 8–10px
type.pairing 0.00 · pred single-family / actual serif/sans
voice.archetype pending · awaits ANTHROPIC_API_KEY

Apple

Reference

Full-viewport hero: single product photo centered on white, one-line SF Pro Display name, nothing else above the fold — the page is a cinematic catalog, not a website.

Axis Judge vs hand
Strongest agreement
surface.ground 1.00 · white ✓
palette.contrast 1.00 · high ✓
surface.borders ~0.92 · hairline (25-brand)
Weakest agreement
imagery.strategy pending · n=1 only (Acne)
motion.budget pending · Playwright required
voice.archetype pending · per-brand tau to come

Stripe

Reference

Animated gradient mesh aurora runs behind the hero — brand colors signal financial infrastructure as a living, breathing system.

Axis Judge vs hand
Strongest agreement
type.pairing 1.00 · sans/sans ✓
palette.contrast 1.00 · high ✓
surface.ground 1.00 · white ✓
Weakest agreement
palette.warmth 0.00 · accent recovered, hue-bucket wrong
palette.saturation 0.50 · ordinal map needs recalibration
motion.budget pending · pervasive vs static-CSS abstention

Linear

Reference

Every action has a keyboard shortcut shown inline as a kbd-styled badge — the keyboard is treated as the primary input device, not an accelerator for power users.

Axis Judge vs hand
Strongest agreement
surface.dark_mode 1.00 · dark-first ✓
surface.borders ~0.92 · hairline (25-brand)
palette.contrast 1.00 · high ✓
Weakest agreement
voice.sentence_length~0.56 · sparse <main> on SPA shell
motion.character pending · scroll-recording analysis
voice.archetype pending · per-brand tau to come

Acne Studios

Reference

Helvetica Monospaced Pro at 9px, 0.033em tracking, caps throughout — one face, one size, one weight across the entire UI.

Axis Judge vs hand
Strongest agreement
type.tracking 1.00 · loose, em-median exact
imagery.strategy 1.00 · photographic ✓ (only n=1 hit)
surface.ground 1.00 · white ✓
Weakest agreement
emphasis.mechanism 0.00 · pred underline / actual scale (widget CSS swamp)
motion.budget abstain · 56 Shopify @keyframes vs catalog "none"
palette.warmth / saturation0.00 · no :root vars, fallback poisoned
Cosign spec

Every brand entry on tasteHQ ships with canonical: false in /.well-known/taste/<slug> — we mark ourselves a mirror. The brand's own homepage is the source of truth.

When a brand publishes the same shape at <brand>.com/.well-known/taste/<slug> with canonical: true, tasteHQ's mirror auto-marks itself derivative and links upstream. No vendor lock-in. Read the spec at /cosign.

F-score · Golden Set
0.839
5-brand reference set, round-trip agreement on scored axes.
Mean fidelity · 25-brand test
0.743
Held-out catalog subset (≥8 graded axes, Surge 3b · 2026-05-17).
Axes at target
9/17 scored
Meeting the per-method RFC-001 §V threshold.

Methodology, in three lines

Per-axis agreement is the round-trip score: fetch a brand's homepage, project raw CSS/HTML signals into grammar axes, then compare against the hand-graded catalog value. Ordinal axes use normalized distance; nominal axes use binary match.

Golden Set
anthropic · linear · stripe · acne-studios · antimetal — 5 brands, fully hand-graded.
Test set
25 catalog brands with ≥8 graded axes — held out for fidelity measurement.
Target bands (RFC-001 §V)
det ≥ 0.85 mix ≥ 0.70 model ≥ 0.60

Full per-axis agreement

Reference material — the 30 axes across 8 categories used by the judge. Agreement is the round-trip score from tools/spike_roundtrip.py on the 25-brand test set; bold scores are at-target. Axes marked pending are honest non-predictions — the extractor abstains rather than guesses. For per-brand judge-vs-hand-grade comparisons, see the Golden Set above.

Axis Label Scale Extraction Agreement Status
at-target meets RFC §V band in-flight close, scheduled fix blocked named external blocker pending not yet measured

Reproduce this

Every figure on this page is generated by code you can run. No closed evaluation set, no secret prompts.

  1. Read the eval pipeline spec — what is being measured, and why, in detail.
  2. Re-run round-trip on the 25-brand set: python tools/spike_roundtrip.py --weights w-2026-05-001
  3. Inspect per-brand breakdown and known blockers in the calibration note.