tasteHQ · the 60-second explanation

How judgment — actually works

The evaluation loop in one page. What gets measured deterministically, what gets handed to an LLM, how a verdict is reached, and where the judge is honestly allowed to be wrong.

Reading time: ~3 min
Grammar: v2.0.0
Calibration: /eval
Full spec: RFC-001
Endpoint: POST /api/score

§Thesis

Generation is cheap; evaluation is the moat.

The 2026 generation of models all produce technically acceptable output. What they cannot reliably produce is differentiated output — work anchored to a specific brand’s point of view. tasteHQ judges output against a brand’s stored grammar, per axis, with a deliberate split: deterministic where the answer is reproducible (palette, type weight, density, radius) and LLM where the question is genuinely subjective (voice archetype, emphasis device, motion character). The split is the bet.

§The loop

A single round-trip from input to verdict. Implemented in api/score.py on top of tools/extract_features.py and tools/grade_grammar.py.

  INPUT     URL (or rendered HTML)  +  target brand  |  brief
              │
              ▼
  L1 · DETERMINISTIC EXTRACTION
     palette via k-means over rendered pixels
     CSS tokens via regex (radius, borders, tracking, weights)
     density via DOM counters (sections, gaps, element counts)
     voice signals via text-stats (sentence length, hedging frequency)
              │
              ▼
  L2 · LLM JUDGE (subjective axes only)
     voice.archetype · emphasis.mechanism · motion.character
     surface.texture · imagery.strategy
              │
              ▼
  L3 · PER-AXIS SCORE
     each axis a: c_a = clip( 1 − |b_a − o_a| / span, 0, 1 )
     weighted by catalog_prior from /api/weights.json
              │
              ▼
  VERDICT   pass (≥80) · revise (55–79) · reject (<55)
            + strongest axis, weakest axis, next_action

Every score carries an audit block (grammar version, judge model, weights version, input fingerprint, timestamp) so any verdict is reproducible and contestable.

§What’s deterministic, what’s LLM

The split is published per axis in schema/grammar-v2.json under the extraction field. Reproducibility lives on the left; taste lives on the right.

Method	Axes	Why
deterministic	`surface.` · `palette.` · `type.heading_weight` · `type.tracking` · `whitespace.section_gap` · `voice.hedging` · `motion.budget`	Hex values, font weights, DOM counts. Same input → same output, every run. No prompts, no temperature.
model-graded	`voice.archetype` · `emphasis.mechanism` · `motion.character` · `surface.texture` · `imagery.strategy`	Genuinely categorical taste calls (“is this voice engineer or concierge?”). An LLM with the grammar in context returns a value; absence of an API key returns `n/a` rather than a guess.
mixed	`type.pairing` · `emphasis.cta_treatment` · `whitespace.discipline` · `voice.formality`	Deterministic signal where available; model fallback when DOM evidence is ambiguous.

§The Golden Set

Five hand-graded reference brands anchor the calibration chain. They were chosen for legibility, not popularity: maximal axis coverage and minimal interpretive ambiguity, across surface, palette, type, voice, and motion.

anthropic
apple
stripe
linear
acne-studios

Per-axis agreement between the judge and the hand grades is published live at /eval. Where the judge currently disagrees materially with hand grades — a small slice of axes, mostly around voice.archetype and surface.texture — those rows are flagged on the calibration page rather than averaged away.

§Tier promotion

Tier is a function of grammar coverage and validation status, not editorial opinion. The pipeline promotes entries deterministically; reference-tier is reserved for hand-graded anchors.

unrated → community (≥ 15 axes extracted by pipeline) → verified (passes tools/taste_gate.py, 0 critical fails) → reference (hand-graded only)

A pipeline-extracted entry enters as community. To promote, it runs through the severity-weighted checklist in taste_gate.py — critical items must be at zero fails. Drops below threshold are demoted to community or unrated rather than silently retained. See TIER-RUBRIC.md for the full gate.

§Failure modes we surface

The judge can be wrong. Per-axis output exists so you can see exactly where. Four classes of failure are listed openly, not hidden behind an average.

CSS-in-JS opacity Palette and token extraction degrade on heavily runtime-styled sites where the rendered DOM contains few authored class names and most colors come from inline style attributes injected by the framework.
missing API key → n/a The LLM-graded axes (voice.archetype, motion.character, etc.) require a model API key. Without one, those axes return n/a and are excluded from the weighted score — rather than guessed.
judge disagreement is visible When the judge disagrees with a hand grade, the per-axis row on /eval shows it. There is no single fidelity number that absorbs the disagreement; you can always drill to the axis that broke.
tasteHQ is a mirror, not the source Canonical state for a brand lives at the brand’s own domain under /.well-known/taste. tasteHQ aggregates and judges; it does not own the truth. A brand can republish a corrected grammar at any time and tasteHQ re-mirrors.

§The protocol

Grammar v2 is an open spec. Anyone implementing it can host their own canonical entry at /.well-known/taste. tasteHQ’s job is to aggregate those entries and judge against them — not to be the registry of record. The full schema, conformance levels, and badge live at /spec.