tasteHQ · the 60-second explanation

How judgment actually works

The evaluation loop in one page. What gets measured deterministically, what gets handed to an LLM, how a verdict is reached, and where the judge is honestly allowed to be wrong.

Reading time
~3 min
Grammar
v2.0.0
Calibration
/eval
Full spec
RFC-001
Endpoint
POST /api/score

§Thesis

Generation is cheap; evaluation is the moat.

The 2026 generation of models all produce technically acceptable output. What they cannot reliably produce is differentiated output — work anchored to a specific brand’s point of view. tasteHQ judges output against a brand’s stored grammar, per axis, with a deliberate split: deterministic where the answer is reproducible (palette, type weight, density, radius) and LLM where the question is genuinely subjective (voice archetype, emphasis device, motion character). The split is the bet.

§The loop

A single round-trip from input to verdict. Implemented in api/score.py on top of tools/extract_features.py and tools/grade_grammar.py.

  INPUT     URL (or rendered HTML)  +  target brand  |  brief
              │
              ▼
  L1 · DETERMINISTIC EXTRACTION
     palette via k-means over rendered pixels
     CSS tokens via regex (radius, borders, tracking, weights)
     density via DOM counters (sections, gaps, element counts)
     voice signals via text-stats (sentence length, hedging frequency)
              │
              ▼
  L2 · LLM JUDGE (subjective axes only)
     voice.archetype · emphasis.mechanism · motion.character
     surface.texture · imagery.strategy
              │
              ▼
  L3 · PER-AXIS SCORE
     each axis a: ca = clip( 1 − |ba − oa| / span, 0, 1 )
     weighted by catalog_prior from /api/weights.json
              │
              ▼
  VERDICT   pass (≥80) · revise (55–79) · reject (<55)
            + strongest axis, weakest axis, next_action

Every score carries an audit block (grammar version, judge model, weights version, input fingerprint, timestamp) so any verdict is reproducible and contestable.

§What’s deterministic, what’s LLM

The split is published per axis in schema/grammar-v2.json under the extraction field. Reproducibility lives on the left; taste lives on the right.

Method Axes Why
deterministic surface.* · palette.* · type.heading_weight · type.tracking · whitespace.section_gap · voice.hedging · motion.budget Hex values, font weights, DOM counts. Same input → same output, every run. No prompts, no temperature.
model-graded voice.archetype · emphasis.mechanism · motion.character · surface.texture · imagery.strategy Genuinely categorical taste calls (“is this voice engineer or concierge?”). An LLM with the grammar in context returns a value; absence of an API key returns n/a rather than a guess.
mixed type.pairing · emphasis.cta_treatment · whitespace.discipline · voice.formality Deterministic signal where available; model fallback when DOM evidence is ambiguous.

§The Golden Set

Five hand-graded reference brands anchor the calibration chain. They were chosen for legibility, not popularity: maximal axis coverage and minimal interpretive ambiguity, across surface, palette, type, voice, and motion.

Per-axis agreement between the judge and the hand grades is published live at /eval. Where the judge currently disagrees materially with hand grades — a small slice of axes, mostly around voice.archetype and surface.texture — those rows are flagged on the calibration page rather than averaged away.

§Tier promotion

Tier is a function of grammar coverage and validation status, not editorial opinion. The pipeline promotes entries deterministically; reference-tier is reserved for hand-graded anchors.

unrated community  (≥ 15 axes extracted by pipeline) verified  (passes tools/taste_gate.py, 0 critical fails) reference  (hand-graded only)

A pipeline-extracted entry enters as community. To promote, it runs through the severity-weighted checklist in taste_gate.pycritical items must be at zero fails. Drops below threshold are demoted to community or unrated rather than silently retained. See TIER-RUBRIC.md for the full gate.

§Failure modes we surface

The judge can be wrong. Per-axis output exists so you can see exactly where. Four classes of failure are listed openly, not hidden behind an average.

§The protocol

Grammar v2 is an open spec. Anyone implementing it can host their own canonical entry at /.well-known/taste. tasteHQ’s job is to aggregate those entries and judge against them — not to be the registry of record. The full schema, conformance levels, and badge live at /spec.