How judgment — actually works
The evaluation loop in one page. What gets measured deterministically, what gets handed to an LLM, how a verdict is reached, and where the judge is honestly allowed to be wrong.
§Thesis
Generation is cheap; evaluation is the moat.
The 2026 generation of models all produce technically acceptable output. What they cannot reliably produce is differentiated output — work anchored to a specific brand’s point of view. tasteHQ judges output against a brand’s stored grammar, per axis, with a deliberate split: deterministic where the answer is reproducible (palette, type weight, density, radius) and LLM where the question is genuinely subjective (voice archetype, emphasis device, motion character). The split is the bet.
§The loop
A single round-trip from input to verdict. Implemented in api/score.py on top of
tools/extract_features.py and tools/grade_grammar.py.
INPUT URL (or rendered HTML) + target brand | brief │ ▼ L1 · DETERMINISTIC EXTRACTION palette via k-means over rendered pixels CSS tokens via regex (radius, borders, tracking, weights) density via DOM counters (sections, gaps, element counts) voice signals via text-stats (sentence length, hedging frequency) │ ▼ L2 · LLM JUDGE (subjective axes only) voice.archetype · emphasis.mechanism · motion.character surface.texture · imagery.strategy │ ▼ L3 · PER-AXIS SCORE each axis a: ca = clip( 1 − |ba − oa| / span, 0, 1 ) weighted by catalog_prior from /api/weights.json │ ▼ VERDICT pass (≥80) · revise (55–79) · reject (<55) + strongest axis, weakest axis, next_action
Every score carries an audit block (grammar version, judge model, weights version,
input fingerprint, timestamp) so any verdict is reproducible and contestable.
§What’s deterministic, what’s LLM
The split is published per axis in schema/grammar-v2.json
under the extraction field. Reproducibility lives on the left; taste lives on the right.
| Method | Axes | Why |
|---|---|---|
| deterministic |
surface.* · palette.* · type.heading_weight
· type.tracking · whitespace.section_gap
· voice.hedging · motion.budget
|
Hex values, font weights, DOM counts. Same input → same output, every run. No prompts, no temperature. |
| model-graded |
voice.archetype · emphasis.mechanism
· motion.character · surface.texture
· imagery.strategy
|
Genuinely categorical taste calls (“is this voice engineer or concierge?”). An LLM with the grammar in context returns a value; absence of an API key returns n/a rather than a guess. |
| mixed |
type.pairing · emphasis.cta_treatment
· whitespace.discipline · voice.formality
|
Deterministic signal where available; model fallback when DOM evidence is ambiguous. |
§The Golden Set
Five hand-graded reference brands anchor the calibration chain. They were chosen for legibility, not popularity: maximal axis coverage and minimal interpretive ambiguity, across surface, palette, type, voice, and motion.
- anthropic
- apple
- stripe
- linear
- acne-studios
Per-axis agreement between the judge and the hand grades is published live at /eval.
Where the judge currently disagrees materially with hand grades — a small slice of axes, mostly
around voice.archetype and surface.texture — those rows are flagged on
the calibration page rather than averaged away.
§Tier promotion
Tier is a function of grammar coverage and validation status, not editorial opinion. The pipeline
promotes entries deterministically; reference-tier is reserved for hand-graded anchors.
tools/taste_gate.py, 0 critical fails)
→
reference
(hand-graded only)
A pipeline-extracted entry enters as community. To promote, it runs through the
severity-weighted checklist in taste_gate.py — critical items must be
at zero fails. Drops below threshold are demoted to community or unrated
rather than silently retained. See TIER-RUBRIC.md for the full gate.
§Failure modes we surface
The judge can be wrong. Per-axis output exists so you can see exactly where. Four classes of failure are listed openly, not hidden behind an average.
- CSS-in-JS opacity Palette and token extraction degrade on heavily runtime-styled sites where the rendered DOM contains few authored class names and most colors come from inline style attributes injected by the framework.
-
missing API key → n/a
The LLM-graded axes (
voice.archetype,motion.character, etc.) require a model API key. Without one, those axes returnn/aand are excluded from the weighted score — rather than guessed. - judge disagreement is visible When the judge disagrees with a hand grade, the per-axis row on /eval shows it. There is no single fidelity number that absorbs the disagreement; you can always drill to the axis that broke.
-
tasteHQ is a mirror, not the source
Canonical state for a brand lives at the brand’s own domain under
/.well-known/taste. tasteHQ aggregates and judges; it does not own the truth. A brand can republish a corrected grammar at any time and tasteHQ re-mirrors.
§The protocol
Grammar v2 is an open spec. Anyone implementing it can host their own canonical entry at
/.well-known/taste. tasteHQ’s job is to
aggregate those entries and judge against them — not to be the registry of record. The full
schema, conformance levels, and badge live at /spec.