---
rfc: 001
title: The Open Evaluation Pipeline for Creative AI
status: Draft
author: Simone Leonelli — Studio W230
date: 2026-05-17
target-version: tasteHQ v2.0
supersedes: —
companion-essays:
  - https://mustbesimo.github.io/taste-layer/taste-layer/
  - https://mustbesimo.github.io/taste-layer/curation-stack/
---

# RFC-001 · The Open Evaluation Pipeline for Creative AI

> The frustration with AI tools is not that they produce bad work, but that they produce **undifferentiated** work.

This RFC specifies the next major capability of `tasteHQ`: an open, agent-native evaluation pipeline for creative AI output. It is the operational counterpart to the two published essays — *The Taste Layer* (enterprise quality gates) and *The Curation Stack* (marketplace quality infrastructure) — and the open-source response to closed industry benchmarks.

The pipeline answers four questions an agent or human cannot currently answer except by hand:

1. **How close is this output to a target brand?** *(brand-fidelity score)*
2. **Which brand in the catalog does this output most resemble?** *(style discovery)*
3. **How do two outputs differ along which axes of taste?** *(curatorial diff)*
4. **What lineage does this output belong to, and is it advancing or repeating it?** *(tradition arc)*

These are not separate features. They are four interrogations of one underlying object — the Design Grammar — viewed from four angles.

---

## I · Why now

The creative-AI category has, in the last twelve months, crossed a threshold the rest of the AI industry crossed two years ago: **generation is no longer the bottleneck**. The 2026 cohort of image, video, and code models all produce *technically acceptable* output most of the time. What they cannot reliably produce is *differentiated* output — work with a specific point of view, anchored to a specific brand, located in a specific tradition.

The April 2026 Human Creativity Benchmark from Contra Labs documented this empirically: across 15,555 human judgments, no current model is reliably good at both *prompt adherence* (the convergent end of evaluation) and *visual appeal* (the divergent end). They named the symptom. They did not name the cure, because the cure is not better models — it is **a vocabulary for taste**.

tasteHQ already has that vocabulary: the 30-axis Design Grammar, anchored in 101 hand-curated brand entries, half of them with frame-accurate scroll-recordings. What it does not yet have is a public, queryable function that turns the vocabulary into an evaluation. That function is the subject of this RFC.

The window is open and closing. Whoever publishes the open creative-eval standard in 2026 sets the terminology for the decade. Closed benchmarks will produce numbers; an open pipeline produces a *language*.

---

## II · Foundations — three masters, three layers

The pipeline is anchored in three traditions of taste, each mapped to a layer of the system. This is the philosophical scaffold; the engineering follows from it, not the other way around.

### Eco · Taste as code

Umberto Eco's *Opera Aperta* (1962) is the proposition that an aesthetic object is not a fixed message but a **code with an authorized range of readings**. The work is open; the code is structured. A brand, in this reading, is a code: a finite, decomposable set of signs (typography, color, motion, rhythm, copy register, density, restraint) within which the brand "speaks." When a brand is used well, it is because the author respected the code while making a fresh utterance inside it.

This is the layer the **Design Grammar** lives in. Each of the 30 axes is one dimension of a brand's code. The grammar is not a checklist — it is a *generative system*. A brand score is a vector in the grammar's space. An AI output's score is another vector. Their relationship is computable.

**System mapping:**
- `axes/` — the 30 dimensions
- `tokens/` — extracted live values per brand (color, type, density)
- `voice/` — the documented register, the brand's verbal code
- `signature_move/` — the irreducible gesture that signals "this is how this brand *speaks*"

### Gombrich · Taste as tradition

E. H. Gombrich's *The Story of Art* (1950) and *Art and Illusion* (1960) argue that no artist creates from nothing — every work is **schema-and-correction**, an inherited convention adjusted under tension. Style is the diff between what a tradition expects and what a particular work proposes. This is why "originality" is legible only against a tradition: with no inheritance, novelty cannot be perceived.

This is the **lineage** layer. Every brand entry in tasteHQ already carries an `era & lineage` field. What the pipeline adds is the ability to trace a work backward — what conventions does it inherit, what does it productively violate, what does it merely repeat — and forward, identifying which living brands are heirs of which historical schools.

**System mapping:**
- `lineage_graph` — directed graph: era → school → brand → derivative work
- `inheritance_score` — how much of axis-X does this output share with predecessor-Y
- `productive_violation_score` — where the output diverges *meaningfully* from inherited convention vs. randomly

### Daverio · Taste as the curatorial act

Philippe Daverio, in his decades of television curation and museum work, made a single point repeatedly: **taste is the act of placing one work next to another and being able to defend the placement.** The judgment lives not in the object but in the *adjacency*. A curator's authority is their willingness to put their reasoning on the page next to the work.

This is the **comparison** layer. It is also the user-facing one. Every score the pipeline returns must come with a *reading* — a paragraph of explanation in the vocabulary of the grammar, signed by the model that produced it, citable.

**System mapping:**
- `compare/` — already in MCP; promoted to a first-class scoring mode
- `reading/` — every score has an attached natural-language defense
- `judge_signature` — every reading is signed (model, prompt, grammar version) for audit

The three layers compose: **Eco gives the axes, Gombrich gives the arc, Daverio gives the verdict.**

---

## III · The pipeline — what it actually is

```
                ┌─────────────────────────────────────────────────────┐
                │                  INPUT                              │
                │   image URL  ·  video URL  ·  rendered HTML  ·      │
                │   live URL   ·  uploaded asset  ·  code snippet     │
                └──────────────┬──────────────────────────────────────┘
                               │
                ┌──────────────▼─────────────┐
                │  L0 · Identity & provenance │   from Taste Layer essay
                │  hash, source, blind id     │
                └──────────────┬─────────────┘
                               │
                ┌──────────────▼─────────────┐
                │  L1 · Feature extraction    │
                │  → color, type, density,    │
                │    motion, voice, layout    │
                └──────────────┬─────────────┘
                               │
                ┌──────────────▼─────────────┐
                │  L2 · Grammar projection    │
                │  → 30-axis vector v         │
                └──────────────┬─────────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        │                      │                      │
   ┌────▼─────┐         ┌──────▼──────┐        ┌──────▼──────┐
   │ MODE A   │         │   MODE B    │        │  MODE C/D   │
   │ Fidelity │         │  Discovery  │        │  Diff /     │
   │ to brand │         │  closest    │        │  Lineage    │
   │   x      │         │   brand     │        │             │
   └────┬─────┘         └──────┬──────┘        └──────┬──────┘
        │                      │                      │
        └──────────────────────┼──────────────────────┘
                               │
                ┌──────────────▼─────────────┐
                │  L3 · Curatorial reading    │
                │  natural-language verdict   │
                │  signed, citable, JSON      │
                └──────────────┬─────────────┘
                               │
                ┌──────────────▼─────────────┐
                │           OUTPUT             │
                │   scores + reading + audit   │
                └──────────────────────────────┘
```

The pipeline has four modes — A, B, C, D — over a shared four-layer (L0–L3) substrate. The substrate is the contribution; the modes are recombinations of it.

---

## IV · The four modes

### Mode A · Brand-fidelity scoring  *(the headline feature)*

**Question:** How close is this output to brand `<slug>`?

```http
POST /api/score
{
  "input":  { "url": "https://…/output.png" },
  "against": "stripe",
  "version": "grammar-v2"
}
```

**Response:**

```json
{
  "verdict": "close-but-divergent",
  "overall": 0.71,
  "convergence": {
    "color":           { "score": 0.94, "axis_quote": "tightly within Stripe range" },
    "typography":      { "score": 0.88, "axis_quote": "weights and tracking aligned" },
    "density":         { "score": 0.62, "axis_quote": "denser than Stripe norm" }
  },
  "divergence": {
    "voice":           { "score": 0.31, "axis_quote": "register reads as VC, not Stripe" },
    "motion":          { "score": 0.18, "axis_quote": "heavier easing than Stripe convention" }
  },
  "reading": "The output reads as a Stripe-influenced page from outside Stripe …",
  "audit":   { "model": "claude-opus-4-6", "grammar_version": "v2", "judge_sig": "…" }
}
```

The split between `convergence` and `divergence` is the HCB contribution operationalized: **convergent axes are those where evaluators agree** (typography, color, density tend to converge), **divergent axes are those where evaluators legitimately disagree** (voice, motion, signature gestures tend to diverge). Both are scored. Neither is collapsed into the other.

### Mode B · Style discovery

**Question:** Which brand in the catalog does this most resemble?

```http
POST /api/discover
{ "input": { "url": "…" }, "k": 5 }
```

Returns top-`k` brands ranked by grammar-vector distance, each with its own short reading explaining the match. This is the reverse-direction query that makes the pipeline useful to agents who don't already know which brand they're hunting for.

### Mode C · Curatorial diff *(Daverio mode)*

**Question:** Compare two outputs — or two brands, or an output and a brand — along which axes do they diverge?

```http
POST /api/diff
{ "left":  { "url": "…" },
  "right": { "brand": "linear" } }
```

Returns the per-axis differences with a per-axis natural-language reading. This is `compare_brands` from the existing MCP, generalized to accept arbitrary inputs and ground every comparison in the grammar.

### Mode D · Lineage trace *(Gombrich mode)*

**Question:** What tradition does this output belong to, and what is it doing with that tradition?

```http
POST /api/lineage
{ "input": { "url": "…" } }
```

Returns:
- nearest historical school (Swiss International, Memphis, Brutalist Web, NorthFace-era utilitarian, etc.)
- 1–3 living brands that are heirs of that school
- the **productive-violation map** — which axes are inherited verbatim, which are violated meaningfully, which are violated unconvincingly

This is the layer the HCB framework has no language for, and it is the one that most rewards a deep dataset. Every entry in tasteHQ already has an `era & lineage` field — Mode D operationalizes it.

---

## V · The scoring math

Each grammar axis is graded on a finite ordinal scale (currently 1–5). For an axis `a`:

- `b_a` = target brand's grade on axis `a` (from tasteHQ catalog)
- `o_a` = output's projected grade on axis `a` (computed by L1+L2)
- `w_a` = the axis's **convergence weight** — empirically estimated from inter-judge agreement on that axis across brands (Kendall's W per axis, HCB style)

**Per-axis convergence score** (where evaluators tend to agree):
```
c_a = clip( 1 − |b_a − o_a| / max_grade,  0,  1 )
```

**Per-axis divergence score** (where evaluators tend to disagree — distance becomes signal, not error):
```
d_a = |b_a − o_a| / max_grade
```

**Overall fidelity score** (weighted by convergence reliability):
```
F = Σ_a (w_a · c_a) / Σ_a w_a
```

**Verdict bands:**

| Range | Verdict | Reading |
|---|---|---|
| `F ≥ 0.85` | `on-brand` | Inherits the code well |
| `0.65 ≤ F < 0.85` | `close-but-divergent` | Reads adjacent; specific axes drift |
| `0.40 ≤ F < 0.65` | `family-resemblance` | Same tradition, different brand |
| `F < 0.40` | `out-of-distribution` | Different code entirely |

The weights `w_a` are *published per grammar version* in `/api/weights.json` so any judge can be audited and reproduced. This is the deliberate inversion of HCB's closed methodology: every step is queryable.

---

## VI · The audit trail — separation of generator and judge

From *The Taste Layer*: the system that generates should never be the system that judges.

Every score the pipeline returns carries an `audit` block:

```json
{
  "audit": {
    "grammar_version":   "v2.1",
    "judge_model":       "claude-opus-4-6",
    "judge_prompt_hash": "sha256:…",
    "weights_version":   "w-2026-05",
    "evaluator_quorum":  ["claude-opus-4-6", "gpt-5", "gemini-3-pro"],
    "quorum_agreement":  0.83,
    "timestamp":         "2026-05-17T13:00:00Z"
  }
}
```

A score is not authoritative unless it carries this block. The block makes every score *reproducible* (re-run the same prompt against the same grammar version) and *contestable* (anyone can fork the prompts, the weights, the model choice).

A `evaluator_quorum` of three judges with stated agreement is the default. Single-judge scores are returned but flagged `low_authority: true`.

---

## VII · The dataset side — what we publish

Every score the pipeline returns is logged (with explicit consent flag) to `/api/scores/<id>.json`. Aggregated across thousands of inputs, the score log becomes the second open dataset tasteHQ provides:

- `/api/scores.json` — the full log (anonymized, license CC-BY 4.0)
- `/api/scores/by-brand/<slug>.json` — scores grouped by target brand
- `/api/scores/by-model/<model>.json` — scores grouped by generating model
- `/api/leaderboard/<brand>.json` — which models score highest on which brands

This is the open counterpart to HCB's closed leaderboard. Anyone — research labs, agencies, individual creators — can submit outputs (with API key, rate-limited) and the public ledger updates.

The leaderboard is brand-conditioned, not absolute. There is no single "best creative AI model." There is "GPT-5 leads on Stripe-fidelity," "Claude Opus leads on Anthropic-fidelity," "Gemini 3 leads on Memphis-lineage outputs." This is the *correct* shape of the answer, and HCB's data already gestures at it. We make it queryable.

---

## VIII · Implementation roadmap

The pipeline is built in five surges. Each surge is 1–2 weeks. Each surge ships a usable artifact.

### Surge 1 — Grammar v2 lock  *(week 1)*

Goal: freeze the grammar so all subsequent work points at a stable target.

- Publish `schema/grammar-v2.json` with all 30 axes, ordinal scale, axis description
- Generate `api/weights.json` with `w_a` estimated from existing 101-brand catalog inter-axis variance
- Migrate all existing brand entries to grammar-v2 grades (write `tools/migrate_grammar.py`)
- Tag a `grammar-v2.0.0` release on the repo

### Surge 2 — L1 feature extraction  *(weeks 2–3)*

Goal: turn an arbitrary input into a deterministic feature vector.

- `tools/extract_features.py` — given a URL or asset, returns color palette, type characteristics, density signal, motion signal, copy register
- `tools/project_to_grammar.py` — feature vector → 30-axis grammar vector
- Test on the 101 existing entries — measure round-trip fidelity (input the brand's homepage, recover the brand's known grades)
- Calibration target: ≥ 0.7 per-axis agreement with hand-graded entries

### Surge 3 — Mode A · brand-fidelity  *(week 4)*

Goal: ship `POST /api/score`.

- New WSGI route in `app.py`
- Implements the scoring math from §V
- Returns the JSON shape from §IV
- Reading generated by judge-quorum (Claude/GPT/Gemini) under the audit-trail constraint from §VI
- MCP tool `score_against(input, brand)` exposes the same to agents
- Documentation: extend `API.md` with a worked Stripe-vs-Wise example

### Surge 4 — Modes B, C, D  *(weeks 5–6)*

Goal: the three remaining modes, since they share the same L0–L3 substrate.

- `POST /api/discover` (Mode B) — vector-index search over catalog
- `POST /api/diff` (Mode C) — generalized `compare_brands`
- `POST /api/lineage` (Mode D) — graph traversal on `lineage_graph` + projection
- New MCP tools mirror each
- Each mode reuses the same reading generator from Mode A

### Surge 5 — Open ledger & RFC publication  *(week 7+)*

Goal: turn the running system into a public coordination artifact.

- Score-log endpoints publish under CC-BY 4.0
- Public leaderboard pages auto-generate per brand and per model
- This RFC ships as `RFC-001-eval-pipeline.md` in `_design/`, with a versioned permalink on the taste-layer site (`/rfc/001/`)
- A 3,000-word essay version is published on the taste-layer site for general audiences — fewer math symbols, more Eco-Gombrich-Daverio

---

## IX · What success looks like

In twelve months, success is:

1. The eval pipeline is **the** open standard cited in at least three other research papers on creative AI evaluation.
2. At least one major model lab (Anthropic, OpenAI, Google, Black Forest) has run their models against tasteHQ and posted the per-brand leaderboard in their release notes.
3. The grammar v2 schema is *forked* — not because anyone hates it, but because researchers want to specialize it for sub-domains (typography-only, motion-only). Forks are the deepest signal.
4. The phrase **"score against tasteHQ"** appears unprompted in agent system prompts the way "lint with prettier" appears in dev workflows.

In ten years, success is:

The vocabulary of taste — *signature move, lineage, productive violation, convergence axis, curatorial reading* — has the same status that *test coverage, type safety, p99 latency* have in software engineering. It is how the industry talks about quality.

This is the leverage point. The grammar is the asset. The pipeline is what makes the asset queryable. The RFC is what makes the queryability legible.

---

## X · Open questions

These are deliberately unsolved, to be resolved in implementation:

1. **Bootstrapping `w_a`** — initial axis weights come from the existing catalog's inter-axis variance, but they should converge to HCB-style Kendall's W as judge data accumulates. What's the smoothing function between catalog-prior and judge-empirical?
2. **Adversarial inputs** — what's the response when an input is designed to game the score (e.g. matching Stripe's color but with deliberately broken type)? Mode D's *productive-violation* score should help, but it needs adversarial test set.
3. **Cultural locality** — the current grammar privileges a Western design-history canon. Gombrich himself was Western. How do we extend the grammar for, say, Japanese editorial, or Brazilian post-tropicalist, without collapsing them into Western analogues?
4. **The voice axis** — copy register sits at the most divergent end of the spectrum and is the hardest to score deterministically. Is voice a separate sub-pipeline that operates on text only, or is it a first-class grammar axis?

These questions are listed not as defects but as the open frontier of the work.

---

## XI · References

- Eco, Umberto. *Opera Aperta*. Bompiani, 1962.
- Eco, Umberto. *A Theory of Semiotics*. Indiana University Press, 1976.
- Gombrich, E. H. *The Story of Art*. Phaidon, 1950 (current ed. 2006). [Goodreads](https://www.goodreads.com/book/show/222078.The_Story_of_Art)
- Gombrich, E. H. *Art and Illusion: A Study in the Psychology of Pictorial Representation*. Phaidon, 1960.
- Daverio, Philippe. *Il museo immaginato*. Rizzoli, 2011. Television series *Passepartout*, RAI, 2002–2012.
- Contra Labs. *The Human Creativity Benchmark*. April 30, 2026. [contralabs.com/research/human-creativity-benchmark](https://contralabs.com/research/human-creativity-benchmark)
- Contra Labs. *The Undifferentiated Work Problem*. May 5, 2026.
- Leonelli, Simone. *The Taste Layer*. Studio W230, January 2026. [mustbesimo.github.io/taste-layer/taste-layer/](https://mustbesimo.github.io/taste-layer/taste-layer/)
- Leonelli, Simone. *The Curation Stack*. Studio W230, February 2026. [mustbesimo.github.io/taste-layer/curation-stack/](https://mustbesimo.github.io/taste-layer/curation-stack/)

---

*RFC-001 · Studio W230 · Legibility Engineering · May 2026*
*Status: draft for review. Comments welcome via PR to `_design/RFC-001-eval-pipeline.md`.*
