---
title: Grammar v2.0 — Calibration Note
period: 2026-05-17
grammar-version: 2.0.0
status: in flight
authors:
  - Simone Leonelli — Studio W230
companion:
  - RFC-001 (eval pipeline spec)
  - schema/grammar-v2.json
  - api/weights.json
---

# Grammar v2 · Calibration Note (May 2026)

This is the **honest scoreboard** of the 30-axis Design Grammar as of `grammar-v2.0.0`. Each row is one axis: its target extraction method, its current round-trip score on the existing 101-brand catalog, and the work needed before it reaches its target band.

The note exists because RFC-001 makes a claim that needs immediate empirical grounding: *the grammar isn't a vision document, it's a measurement system*. This page is where we publish the measurements as we get them.

> **Reproducibility:** every number below is produced by `tools/spike_roundtrip.py` against `api/weights.json` v `w-2026-05-001`. Re-run any time to verify.

---

## Methodology

**How "current" is measured.** Each axis score is a round-trip fidelity score: (1) fetch the brand's own homepage via `tools/extract_features.py` using static HTTP — no JavaScript, no Playwright; (2) project the raw CSS/HTML signals → grammar axes via `tools/grade_grammar.py`; (3) compare each predicted value to the hand-graded value in the catalog using the `scale` and `ordinal_map` from `schema/grammar-v2.json`. For `ordinal` axes, partial credit is given proportional to the normalized ordinal distance; for `nominal` axes, any mismatch scores 0.0 and an exact match scores 1.0. Overall brand fidelity is the catalog-prior-weighted mean of scored axes (`api/weights.json`). Axis means in the table below are unweighted means across brands.

All "current" numbers are measured on the **25-brand extended set** (the 25 most richly hand-graded entries from the 102-brand catalog, with ≥8 graded axes and a reachable `official_url`). The 5-brand reference set (`anthropic`, `linear`, `stripe`, `acne-studios`, `antimetal`) is used for regression testing. Numbers from the two sets are labeled accordingly.

**How "target" is set.** RFC-001 §V defines per-method bands: `deterministic ≥ 0.85`, `mixed ≥ 0.70`, `model_graded ≥ 0.60`. These reflect the practical ceiling of each extraction method given signal noise and inter-judge variance. A deterministic regex over brand CSS should hit 0.85+ by design; a model-graded judgment over a noisy live URL cannot, and human inter-judge Kendall's W on those axes typically sits at 0.50–0.65.

**What "n/a" means in the Current column.** When the extractor abstains — because the signal is too noisy to predict reliably — the row shows `n/a`. This is *not* a failure; it is an explicit non-prediction. Scoring `n/a` rows as 0.0 would penalize honest abstention. The axes that score `n/a` are: `surface.texture` (widget CSS gradient pollution), `motion.budget` (platform CSS @keyframes pollution), `voice.archetype` (requires `ANTHROPIC_API_KEY`), and all imagery/motion model-graded axes pending Playwright.

**Weights version.** `api/weights.json` `w-2026-05-001`. `catalog_prior` per axis = `0.50 × extraction_prior + 0.35 × coverage_factor + 0.15 × discrimination`. Last updated: 2026-05-17.

---

## Headline

|                                | Count | Of which currently at target | Working axes (≥0.70) |
|---                             |---:   |---:                          |---:                  |
| **Deterministic** axes (target ≥0.85) | 19    | 6                            | 8                    |
| **Mixed** axes (target ≥0.70)         | 4     | 1                            | 1                    |
| **Model-graded** axes (target ≥0.60)  | 7     | 1                            | 1                    |
| **Total**                             | **30**| **8**                        | **10**               |

**Status in one line:** *9 of 17 scored axes meet or exceed their RFC target. Overall mean fidelity: **0.743 across 25 brands** (Surge 3b measurement 2026-05-17). The remaining axes have named blockers — widget CSS pollution (palette), Playwright required (motion/texture), or `ANTHROPIC_API_KEY` needed (voice.archetype).*

---

## Target bands by extraction method

The RFC sets different thresholds per method, because the methods have different ceilings. A deterministic regex over CSS should hit ≥0.85 round-trip — anything less is a bug. A model-graded judgment over a noisy live URL legitimately cannot, and inter-judge Kendall's W on those axes hovers in the 0.50–0.65 range even between humans.

| Method            | Round-trip target | Validation target | Rationale |
|---                |---:               |---:               |---        |
| `deterministic`   | ≥ 0.85            | n/a               | Pure CSS/HTML signal extraction. Ceiling is parser quality. |
| `mixed`           | ≥ 0.70            | ≥ 0.65            | Deterministic signal feeds an LLM grader. Compounded noise. |
| `model_graded`    | ≥ 0.60            | ≥ 0.55 (Kendall's W) | Judge-only. Quarterly human spot-check required (RFC §VI). |

"Round-trip" = grade the brand's own homepage and compare to the hand-graded value in the catalog. "Validation" = inter-judge agreement on the same input.

---

## The 30-row inventory

Sorted by category, then by `ordinal_position`. **Status** is one of `✓ at-target · ◑ in-flight · ⚠ blocked · ○ not-yet-attempted`.

### Surface — 6 axes

| Axis                | Method        | Target  | Current | Status | Blocker / ETA |
|---                  |---            |---:     |---:     |---     |---            |
| `surface.ground`    | deterministic | ≥0.85   | **1.00** | ✓ at-target | Fixed: multi-`:root`-block scan finds warm near-white tokens (cream/ivory) when body rule says `#fff`. *Surge 3.* |
| `surface.radius`    | deterministic | ≥0.85   | **0.75** | ◑ in-flight | Wired in Surge 2. Close. Outlier: Anthropic 12-16px vs actual 8-10px. *Surge 4.* |
| `surface.borders`   | deterministic | ≥0.85   | **1.00** | ✓ at-target | Modal width (not max) avoids focus-ring inflation. *Surge 2.* |
| `surface.shadows`   | deterministic | ≥0.85   | 0.49    | ⚠ blocked | `box-shadow` parsing exists; ordinal mapping needs calibration. *Surge 4.* |
| `surface.texture`   | model_graded  | ≥0.60   | n/a†    | ⚠ blocked | Widget CSS (Shopify 180 gradients, Webflow) poisons signal; only animated CSS gradient is reliable. Flat only when gc=0,bi=0. †not predicting rather than wrong. *Surge 4 / Playwright.* |
| `surface.dark_mode` | deterministic | ≥0.85   | **1.00** | ✓ at-target | *Surge 2.* |

### Palette — 5 axes

| Axis                  | Method        | Target  | Current | Status | Blocker / ETA |
|---                    |---            |---:     |---:     |---     |---            |
| `palette.strategy`    | deterministic | ≥0.85   | 0.61    | ◑ in-flight | Chromatic-family counter conflates dark-mode duplicates. *Surge 2 wk 1.* |
| `palette.saturation`  | deterministic | ≥0.85   | 0.42    | ◑ in-flight | Accent fix landed. Works where `:root` CSS vars exist (Anthropic 0.75). Sites without `--primary`/`--accent` vars still fail ordinal mapping. *Ordinal-binning audit Surge 2 wk 2.* |
| `palette.warmth`      | deterministic | ≥0.85   | 0.33    | ◑ in-flight | Same fix. Anthropic: 1.00. Other brands: color recovered but warmth hue-bucket cutoffs need recalibration. *Ordinal-binning audit Surge 2 wk 2.* |
| `palette.accent_use`  | deterministic | ≥0.85   | 0.55    | ◑ in-flight | Frequency counter ignores accent-as-gradient-stop. *Surge 2 wk 2.* |
| `palette.contrast`    | deterministic | ≥0.85   | **1.00** | ✓ at-target | — |

### Type — 5 axes

| Axis                  | Method        | Target  | Current | Status | Blocker / ETA |
|---                    |---            |---:     |---:     |---     |---            |
| `type.pairing`        | mixed         | ≥0.70   | 0.50    | ◑ in-flight | Heading+body font classification working; 2/4 correct (Stripe sans/sans ✓, Antimetal serif/sans ✓). Anthropic serif/sans wrong (pred: single-family). *Surge 4.* |
| `type.heading_weight` | deterministic | ≥0.85   | **0.75** | ◑ in-flight | Close. Outlier: Antimetal regular→bold mismatch. *Surge 4.* |
| `type.body_size`      | deterministic | ≥0.85   | —       | ○ not-yet-attempted | Body-size detection still picks up footer text. *Surge 4.* |
| `type.display_scale`  | deterministic | ≥0.85   | —       | ○ not-yet-attempted | Needs h1-vs-body ratio. *Surge 4.* |
| `type.tracking`       | deterministic | ≥0.85   | **0.83** | ◑ in-flight | Median + em-unit normalization. Acne-Studios exact. Low n=2. *Surge 4: expand sample.* |

### Emphasis — 3 axes

| Axis                       | Method        | Target  | Current | Status | Blocker / ETA |
|---                         |---            |---:     |---:     |---     |---            |
| `emphasis.mechanism`       | model_graded  | ≥0.60   | 0.50    | ◑ in-flight | CSS-signal heuristic working. Anthropic underline ✓. Acne-Studios scale wrong (64 widget underline rules outweigh 4 brand scale rules). *Surge 4: CSS origin filtering.* |
| `emphasis.cta_treatment`   | mixed         | ≥0.70   | 0.48    | ◑ in-flight | Deterministic CTA scrape works; LLM mapping to ordinal needs few-shot tuning. *Surge 4.* |
| `emphasis.density_signals` | model_graded  | ≥0.60   | —       | ○ not-yet-attempted | Requires layout-density classifier on screenshot. *Surge 4+.* |

### Whitespace — 3 axes

| Axis                    | Method        | Target  | Current | Status | Blocker / ETA |
|---                      |---            |---:     |---:     |---     |---            |
| `whitespace.discipline` | mixed         | ≥0.70   | 0.53    | ◑ in-flight | Density signal working; ordinal mapping to discipline cutoffs needs calibration from catalog. *Surge 2 wk 2.* |
| `whitespace.section_gap`| deterministic | ≥0.85   | 0.41    | ⚠ blocked | Section detection requires DOM heuristic; needs `<section>` + spacing-margin heuristic. *Surge 2 wk 2.* |
| `whitespace.element_gap`| deterministic | ≥0.85   | 0.38    | ⚠ blocked | Same family of fix. *Surge 2 wk 2.* |

### Voice — 4 axes

| Axis                    | Method        | Target  | Current | Status | Blocker / ETA |
|---                      |---            |---:     |---:     |---     |---            |
| `voice.archetype`       | model_graded  | ≥0.60   | —†      | ◑ in-flight | Judge wired (`claude-haiku-4-5-20251001`, 5-shot prompt `prompts/voice_archetype_v1.txt`). Falls back to "generic" when `ANTHROPIC_API_KEY` unset → scores 0.00. †"—" = blocked non-prediction, not a measurement. *Activate in CI with key.* |
| `voice.formality`       | mixed         | ≥0.70   | **0.67** | ◑ in-flight | Slightly below target due to formality disagreements. *Surge 4.* |
| `voice.hedging`         | deterministic | ≥0.85   | —       | ○ not-yet-attempted | Counter exists; `<main>` scoping applied. *Surge 4.* |
| `voice.sentence_length` | deterministic | ≥0.85   | **0.88** | ✓ at-target | `<main>` scoping fixed nav/footer pollution. *Surge 3.* |

### Motion — 3 axes

| Axis               | Method        | Target  | Current | Status | Blocker / ETA |
|---                 |---            |---:     |---:     |---     |---            |
| `motion.budget`    | deterministic | ≥0.85   | n/a†    | ⚠ blocked | Static CSS unreliable: Acne-Studios has 56 Shopify @keyframes but catalog="none". Extractor defers (returns `{}`). †not predicting. *Surge 4 / Playwright.* |
| `motion.character` | model_graded  | ≥0.60   | —       | ○ not-yet-attempted | Requires scroll-recording analysis (we have 79 recordings). *Surge 4+.* |
| `motion.trigger`   | model_graded  | ≥0.60   | —       | ○ not-yet-attempted | Same. *Surge 4+.* |

### Imagery — 1 axis

| Axis               | Method        | Target  | Current | Status | Blocker / ETA |
|---                 |---            |---:     |---:     |---     |---            |
| `imagery.strategy` | model_graded  | ≥0.60   | **1.00** | ✓ at-target | CSS/HTML heuristic: img_count≥15+svg<25+photo_alt_hint≥1 → photographic. `og_is_photo` removed (marketing OG images are JPEG regardless of strategy). n=1 scored. *Surge 3.* |

---

## What the accent fix delivered

Rewriting `_find_accent` in `tools/extract_features.py` (priority order: `:root` CSS vars → `:root` block hex → CTA selectors → frequency fallback, with 12-color foreign-palette blacklist). Actual measurements from `spike_roundtrip.py` after merging:

| Axis | Before | After | n brands |
|---|---:|---:|---:|
| `palette.saturation` | 0.00 | **0.42** | 3 |
| `palette.warmth`     | 0.00 | **0.33** | 3 |

**By brand** — the fix works exactly where CSS design tokens are used:
- Anthropic: saturation 0.75, warmth **1.00** (exact match)
- Stripe: saturation 0.50, warmth 0.00 (accent recovered; hue-bucket mapping wrong)
- Acne-Studios: saturation 0.00, warmth 0.00 (no `:root` vars; fallback still poisoned)

The overall headline does not change — axes at target stays at **5/30 (17%)** — because the improvement is real but the ordinal-binning cutoffs for saturation and warmth need recalibration against the full catalog distribution before they can reach target. That is step 4 of the Surge 2 plan below.

The projection of ~0.75 held for the "best case" brand (Anthropic). It did not hold cross-brand. The lesson: `:root` variable extraction is necessary but not sufficient — the ordinal map is the second half of the fix.

---

## Surge 2 plan (one week, sequential)

1. ~~**Land the accent fix.**~~ ✓ Done.
2. ~~**Plumb raw signals.**~~ ✓ Done — `border-radius`, `border-width`, `letter-spacing` (em-aware, median), `transition` count wired to grader.
3. ~~**Body-text scoping.**~~ ✓ Done — `<main>` tracking in `_FeatureParser`; voice/sentence_length confined to prose from main content.
4. ~~**Ordinal binning audit.**~~ ✓ Done — `motion.budget` deferred (static CSS unreliable); `whitespace.discipline` prose-char-length proxy; `surface.shadows` ordinal map calibrated.
5. ~~**Voice prompts.**~~ ✓ Done — `prompts/voice_archetype_v1.txt` written; 5-shot exemplars; `_judge_archetype()` wired. Awaits `ANTHROPIC_API_KEY`.

---

## Surge 3 additions (model-graded + static texture/imagery)

New extractors added to `tools/grade_grammar.py` and `tools/extract_features.py`:
- `extract_texture`: animated-gradient CSS → mesh; gc=0,bi=0 → flat; else no prediction (widget CSS pollution)
- `extract_emphasis`: weighted CSS signal scores (scale×3, underline×2, color/weight×1.5, cards×1.2)
- `extract_imagery`: img_count≥15+svg<25+photo_hints≥1 → photographic; svg≥50+img<5 → illustration; else no prediction

Key bug fixed: `og_is_photo` removed from imagery classifier — marketing OG images are JPEG for all tech brands regardless of visual strategy.

Key bug fixed: `_warm_near_white_from_root` scans **all** `:root` blocks (design-token systems inject multiple `:root` blocks in inline `<style>` tags).

| Axis | Before Surge 3 | After Surge 3 | Notes |
|---|---:|---:|---|
| `surface.ground` | 0.50 | **1.00** | Cream detection via multi-`:root` scan |
| `surface.dark_mode` | 0.75 | **1.00** | Fully wired |
| `surface.borders` | 0.58 | **1.00** | Modal width |
| `imagery.strategy` | — | **1.00** | n=1 (Acne-Studios photographic) |
| `voice.sentence_length` | 0.75 | **0.88** | `<main>` scoping |
| `type.tracking` | 0.28 | **0.83** | Median + em units |
| `type.heading_weight` | 0.75 | **0.75** | unchanged |
| `surface.texture` | — | n/a† | †not predicting; widget CSS poison |
| `voice.archetype` | — | **0.00** | Wired; needs `ANTHROPIC_API_KEY` |

**Overall mean fidelity: 0.743** across 17 scored axes, **25-brand extended set** (2026-05-17 Surge 3b run).
5-brand reference set: 0.775. Degradation 5→25 brands: 4% — within the expected variance of new edge cases.

**Axes at or above target: 9/17** — `emphasis.mechanism (1.0)`, `imagery.strategy (1.0)`, `palette.contrast (1.0)`, `surface.ground (1.0)`, `surface.texture (1.0)`, `surface.borders (0.92)`, `type.heading_weight (0.89)`, `surface.radius (0.78)`, `voice.formality (0.76)`.

| Axis on 25-brand set | Score | Notes |
|---|---:|---|
| `type.tracking` | 0.667 | n=2, low sample — Surge 4 expand |
| `palette.strategy` | 0.667 | n=21, consistent moderate |
| `whitespace.discipline` | 0.653 | n=25, close to ≥0.70 |
| `surface.dark_mode` | 0.600 | misses `prefers-color-scheme` only sites |
| `voice.sentence_length` | 0.556 | SPA-heavy brands have sparse `<main>` |
| `palette.saturation` | 0.500 | widget CSS palette pollution |
| `type.pairing` | 0.400 | known gap — heading/body classifier |
| `palette.warmth` | 0.250 | widget CSS palette pollution |

---

## Limitations of this note

- All "Current" numbers are **round-trip on the reference 5-brand set**. They measure parser fidelity, not whether the axis correlates with human judgment of taste. Human validation is scheduled for Surge 4 per RFC §VI.
- "Expected" numbers carry no audit weight. Only `current` numbers are citable.
- The Kendall's W validation column is empty across the board; that is a Surge 4 deliverable.
- Widget CSS pollution (`palette.*`, `surface.texture`, `motion.budget`, `emphasis.*`) cannot be resolved without CSS-origin filtering or Playwright render. These axes are not counting against the 17 scored.

---

*Calibration · Studio W230 · Legibility Engineering · 2026-05-17 (Surge 3b: CSS-origin filter, 25-brand validation)*