Phase 1 Spec: Closing the Feedback Loop
Goal: Move the evolutionary loop from “Prelife” (feedback dies after Quality Gate) to “Proto-Life” (feedback flows into skill evaluation and triggers improvement).
Scope: Minimal Viable Feedback Loop. No automatic mutation — only measure, make visible, set triggers.
Inspired by: EvoFlow (Pareto: Quality + Cost), EvoAgentX (Evaluation Layer), MCE (History of skills, executions, evaluations)
Phase 1 Spec: Feedback-Loop schließen
Ziel: Den evolutionären Loop von “Prelife” (Feedback stirbt nach Quality-Gate) zu “Proto-Life” (Feedback fließt in Skill-Bewertung und triggert Verbesserung) bringen.
Scope: Minimal Viable Feedback Loop. Kein automatisches Mutieren — nur Messen, Sichtbar machen, Trigger setzen.
Inspiriert von: EvoFlow (Pareto: Quality + Cost), EvoAgentX (Evaluation Layer), MCE (History of skills, executions, evaluations)
TL;DR
Phase 1 closes the evolutionary feedback loop by measuring skill fitness (quality + cost). SQL schema with 3 views: per-skill metrics, Pareto classification, and per-agent breakdown. Advisory alerts only — no automatic changes.
Current State (AS-IS)
| Component | Tool | What It Provides | Where It Ends |
|---|---|---|---|
| Skill Execution | orchestrator-routing | Which skill for which intent | Log in knowledge.db (skills_usage) |
| Token Usage | Pulse | Tokens/Session, Velocity | Pulse DB (SQLite) |
| Quality | Quality-Gate / signal-check | Quality score per output | memory/quality-scores.md |
| Routing History | routing-log | Skill → Trigger → Outcome | knowledge.db |
The Problem: These 4 data streams are not connected. We don’t know: “Skill X costs on average Y tokens and delivers quality score Z.”
Kurzfassung
Phase 1 schließt den evolutionären Feedback-Loop durch Messung der Skill-Fitness (Qualität + Kosten). SQL-Schema mit 3 Views: Pro-Skill-Metriken, Pareto-Klassifikation und Pro-Agent-Aufschlüsselung. Nur beratende Alerts — keine automatischen Änderungen.
Was existiert (IST)
| Komponente | Tool | Was es liefert | Wo es endet |
|---|---|---|---|
| Skill-Ausführung | orchestrator-routing | Welcher Skill für welchen Intent | Log in knowledge.db (skills_usage) |
| Token-Verbrauch | Pulse | Tokens/Session, Velocity | Pulse-DB (SQLite) |
| Qualität | Quality-Gate / signal-check | Quality-Score per Output | memory/quality-scores.md |
| Routing-Historie | routing-log | Skill → Trigger → Outcome | knowledge.db |
Das Problem: Diese 4 Datenströme sind nicht verbunden. Wir wissen nicht: “Skill X kostet durchschnittlich Y Tokens und liefert Quality-Score Z.”
What Needs to Be Built (TARGET)
Section Summary
4 components:
- Central SQL table logging every skill execution
- Aggregated metrics view
- Pareto classification query (quality vs. cost)
- Advisory alert triggers
Total build estimate: ~2 hours.
1. Skill Performance Log (Core)
What: After each skill execution, write a structured entry to a central table.
CREATE TABLE IF NOT EXISTS skill_performance (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT NOT NULL,
skill_name TEXT NOT NULL,
task_type TEXT, -- 'research', 'writing', 'analysis', 'code', 'review'
quality_score REAL, -- 0.0-1.0, from Quality-Gate/signal-check
token_cost INTEGER, -- Tokens consumed (Input + Output)
duration_seconds REAL, -- Wall time
outcome TEXT, -- 'success', 'partial', 'failure', 'skipped'
project TEXT, -- Project name (from Working Directory / Project State)
notes TEXT -- Optional: Why success/failure
);
Where: ~/.claude/data/knowledge.db (already exists, used for skills_usage)
When: At the end of each skill execution. Can be implemented as a hook or manually via routing log update.
2. Aggregated Skill Metrics (View)
CREATE VIEW IF NOT EXISTS skill_metrics AS
SELECT
skill_name,
COUNT(*) as total_uses,
AVG(quality_score) as avg_quality,
AVG(token_cost) as avg_cost,
AVG(duration_seconds) as avg_duration,
SUM(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END) * 1.0 / COUNT(*) as success_rate,
MIN(timestamp) as first_used,
MAX(timestamp) as last_used
FROM skill_performance
GROUP BY skill_name;
Usage: SELECT * FROM skill_metrics ORDER BY avg_quality DESC; instantly shows the best/worst skills.
3. Pareto View (Quality vs. Cost)
-- Skills that deliver above-average quality for their cost
SELECT
skill_name,
avg_quality,
avg_cost,
total_uses,
CASE
WHEN avg_quality > (SELECT AVG(avg_quality) FROM skill_metrics)
AND avg_cost < (SELECT AVG(avg_cost) FROM skill_metrics)
THEN 'PARETO-OPTIMAL'
WHEN avg_quality < (SELECT AVG(avg_quality) FROM skill_metrics)
AND avg_cost > (SELECT AVG(avg_cost) FROM skill_metrics)
THEN 'DOMINATED'
ELSE 'TRADE-OFF'
END as pareto_status
FROM skill_metrics
WHERE total_uses >= 3 -- At least 3 uses for reliable data
ORDER BY avg_quality DESC;
💡 Key Insight
The Pareto view is the evolutionary "selection lens" — it instantly reveals which skills deliver the best quality-cost trade-off. DOMINATED skills are candidates for mutation or retirement.
4. Automatic Triggers (Alerts, Not Actions)
| Trigger | Condition | Action |
|---|---|---|
| Quality Drop | avg_quality < 0.5 AND total_uses >= 5 | “Skill X has low quality. Review needed?” |
| Cost Outlier | avg_cost > 2× median of all skills | “Skill X costs twice the median.” |
| Unused Skill | last_used > 30 days | “Skill X unused for 30 days. Still relevant?” |
| New Champion | New skill has higher avg_quality than current default for task_type | “Skill Y outperforms Skill X for [task_type].” |
Important: Phase 1 only triggers alerts to the user. No automatic changes. The human decides.
Was gebaut werden muss (SOLL)
Abschnitt-Zusammenfassung
4 Komponenten:
- Zentrale SQL-Tabelle für jede Skill-Ausführung
- Aggregierte Metriken-View
- Pareto-Klassifikations-Abfrage (Qualität vs. Kosten)
- Beratende Alert-Trigger
Geschätzter Aufwand: ~2 Stunden.
1. Skill-Performance-Log (Kern)
Was: Nach jeder Skill-Ausführung einen strukturierten Eintrag in eine zentrale Tabelle.
CREATE TABLE IF NOT EXISTS skill_performance (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT NOT NULL,
skill_name TEXT NOT NULL,
task_type TEXT, -- 'research', 'writing', 'analysis', 'code', 'review'
quality_score REAL, -- 0.0-1.0, aus Quality-Gate/signal-check
token_cost INTEGER, -- Tokens verbraucht (Input + Output)
duration_seconds REAL, -- Wandzeit
outcome TEXT, -- 'success', 'partial', 'failure', 'skipped'
project TEXT, -- Projekt-Name (aus Working Directory / Projekt-State)
notes TEXT -- Optional: Warum Erfolg/Misserfolg
);
Wo: ~/.claude/data/knowledge.db (existiert bereits, wird für skills_usage genutzt)
Wann: Am Ende jeder Skill-Execution. Kann als Hook oder manuell per Routing-Log-Update implementiert werden.
2. Aggregierte Skill-Metriken (View)
CREATE VIEW IF NOT EXISTS skill_metrics AS
SELECT
skill_name,
COUNT(*) as total_uses,
AVG(quality_score) as avg_quality,
AVG(token_cost) as avg_cost,
AVG(duration_seconds) as avg_duration,
SUM(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END) * 1.0 / COUNT(*) as success_rate,
MIN(timestamp) as first_used,
MAX(timestamp) as last_used
FROM skill_performance
GROUP BY skill_name;
Nutzung: SELECT * FROM skill_metrics ORDER BY avg_quality DESC; zeigt sofort die besten/schlechtesten Skills.
3. Pareto-Ansicht (Quality vs. Cost)
-- Skills die für ihren Cost überdurchschnittliche Quality liefern
SELECT
skill_name,
avg_quality,
avg_cost,
total_uses,
CASE
WHEN avg_quality > (SELECT AVG(avg_quality) FROM skill_metrics)
AND avg_cost < (SELECT AVG(avg_cost) FROM skill_metrics)
THEN 'PARETO-OPTIMAL'
WHEN avg_quality < (SELECT AVG(avg_quality) FROM skill_metrics)
AND avg_cost > (SELECT AVG(avg_cost) FROM skill_metrics)
THEN 'DOMINATED'
ELSE 'TRADE-OFF'
END as pareto_status
FROM skill_metrics
WHERE total_uses >= 3 -- Mindestens 3 Nutzungen für belastbare Daten
ORDER BY avg_quality DESC;
💡 Kernerkenntnis
Die Pareto-Ansicht ist die evolutionäre „Selektionslinse" — sie zeigt sofort, welche Skills das beste Qualitäts-Kosten-Verhältnis liefern. DOMINATED Skills sind Kandidaten für Mutation oder Ruhestand.
4. Automatische Trigger (Alerts, nicht Aktionen)
| Trigger | Bedingung | Aktion |
|---|---|---|
| Quality Drop | avg_quality < 0.5 UND total_uses >= 5 | “Skill X hat niedrige Quality. Review nötig?” |
| Cost Outlier | avg_cost > 2× Median aller Skills | “Skill X kostet doppelt so viel wie der Median.” |
| Unused Skill | last_used > 30 Tage | “Skill X seit 30 Tagen ungenutzt. Noch relevant?” |
| New Champion | Neuer Skill hat höhere avg_quality als bisheriger Default für task_type | “Skill Y übertrifft Skill X für [task_type].” |
Wichtig: Phase 1 triggert nur Alerts an den User. Keine automatischen Änderungen. Der Mensch entscheidet.
Integration into Existing Components
Section Summary
4 existing components need changes: orchestrator-routing (write performance data), Quality-Gate (feed scores into SQL), Pulse (read token data), and the improve skill (consume metrics for suggestions). No new tools — only new connections between existing ones.
orchestrator-routing (modify)
After skill execution → INSERT INTO skill_performance. The routing log entry (skills_usage) is either extended with performance fields OR skill_performance is maintained as a separate, more detailed log.
Recommendation: Separate log. skills_usage remains the routing log (why was it routed), skill_performance is the performance log (how good was the result).
Quality-Gate (modify)
Quality score is currently written to memory/quality-scores.md. Additionally: INSERT INTO skill_performance with the score.
Pulse (read, don’t modify)
Pulse tracks token usage per session. For Phase 1: Manually derive the token count per skill usage from the session. For Phase 2: Link Pulse events directly to skill_performance.
improve Skill (modify)
The improve skill (“improve routing”) gets access to the skill_metrics view. Instead of only routing logs: real performance data as a basis for improvement suggestions.
Integration in bestehende Komponenten
Abschnitt-Zusammenfassung
4 bestehende Komponenten brauchen Änderungen: orchestrator-routing (Performance-Daten schreiben), Quality-Gate (Scores in SQL einspeisen), Pulse (Token-Daten lesen) und der improve-Skill (Metriken für Vorschläge nutzen). Keine neuen Tools — nur neue Verbindungen zwischen bestehenden.
orchestrator-routing (anpassen)
Nach Skill-Ausführung → INSERT INTO skill_performance. Der Routing-Log-Eintrag (skills_usage) wird um die Performance-Felder erweitert ODER skill_performance als separater, detaillierterer Log geführt.
Empfehlung: Separater Log. skills_usage bleibt das Routing-Log (warum wurde geroutet), skill_performance ist das Performance-Log (wie gut war das Ergebnis).
Quality-Gate (anpassen)
Quality-Score wird aktuell in memory/quality-scores.md geschrieben. Zusätzlich: INSERT INTO skill_performance mit dem Score.
Pulse (auslesen, nicht ändern)
Pulse trackt Token-Verbrauch pro Session. Für Phase 1: Manuell den Token-Count pro Skill-Nutzung aus der Session ableiten. Für Phase 2: Pulse-Events direkt mit skill_performance verknüpfen.
improve-Skill (anpassen)
Der improve-Skill (“Routing verbessern”) bekommt Zugriff auf skill_metrics View. Statt nur Routing-Logs: echte Performance-Daten als Basis für Verbesserungsvorschläge.
Implementation Plan
Section Summary
6 steps over ~2 hours of active work + 2 weeks of data collection. Steps 1–4 can be done in a single session. Step 5 requires the improve skill. Step 6 is a 2-week observation period targeting ≥50 logged executions.
Step 1: Create Schema (5 min)
sqlite3 ~/.claude/data/knowledge.db < schema.sql
Step 2: Define Logging Hook (30 min)
- Option A: Platform hook (
post_tool_call) that writes the performance entry after a skill tool call - Option B: Manual block at the end of each skill: “Log Performance”
- Recommendation: Option A (Hook) — automatic, no manual effort
Step 3: Extend Routing Log Entry (15 min)
-- Update existing skills_usage entry with outcome
UPDATE skills_usage SET outcome = '<success|partial|failure>' WHERE id = <last_insert>;
Step 4: Test Metrics Query (15 min)
- Create view
- Insert test data
- Validate
SELECT * FROM skill_metrics
Step 5: Build Trigger Logic into improve Skill (1h)
- Alerts as part of the improve output
- Pareto classification per skill
Step 6: Collect for 2 Weeks, Then Evaluate
- At least 50 skill executions for reliable statistics
- First insights: Which skills are Pareto-optimal? Which are dominated?
Implementierungsplan
Abschnitt-Zusammenfassung
6 Schritte über ~2 Stunden aktive Arbeit + 2 Wochen Datensammlung. Schritte 1–4 können in einer Session erledigt werden. Schritt 5 erfordert den improve-Skill. Schritt 6 ist eine 2-wöchige Beobachtungsphase mit Ziel ≥50 geloggte Ausführungen.
Schritt 1: Schema erstellen (5 Min)
sqlite3 ~/.claude/data/knowledge.db < schema.sql
Schritt 2: Logging-Hook definieren (30 Min)
- Option A: Platform hook (
post_tool_call) der nach Skill-Tool-Aufruf den Performance-Eintrag schreibt - Option B: Manueller Block am Ende jedes Skills: “Log Performance”
- Empfehlung: Option A (Hook) — automatisch, kein manueller Aufwand
Schritt 3: Routing-Log-Eintrag erweitern (15 Min)
-- Bestehenden skills_usage Eintrag mit outcome updaten
UPDATE skills_usage SET outcome = '<success|partial|failure>' WHERE id = <last_insert>;
Schritt 4: Metriken-Abfrage testen (15 Min)
- View erstellen
- Test-Daten einfügen
SELECT * FROM skill_metricsvalidieren
Schritt 5: Trigger-Logik in improve-Skill einbauen (1h)
- Alerts als Teil des improve-Outputs
- Pareto-Klassifikation pro Skill
Schritt 6: 2 Wochen sammeln, dann auswerten
- Mindestens 50 Skill-Ausführungen für belastbare Statistik
- Erste Erkenntnisse: Welche Skills sind Pareto-optimal? Welche dominated?
⚠️ Scope Boundary⚠️ Scope-Grenze
Phase 1 is measurement only. No automatic changes, no mutations, no A/B testing. That's Phase 2.Phase 1 ist nur Messung. Keine automatischen Änderungen, keine Mutationen, kein A/B-Testing. Das ist Phase 2.
What Phase 1 Does NOT Do
- ❌ Automatic mutation of skills
- ❌ A/B testing of skill variants
- ❌ Crossover (new skills from existing ones)
- ❌ Population management (Niching)
- ❌ Automatic workflow topology changes
All of that is Phase 2+. Phase 1 delivers the data foundation, without which no evolution is possible.
Was Phase 1 NICHT tut
- ❌ Automatisches Mutieren von Skills
- ❌ A/B-Testing von Skill-Varianten
- ❌ Crossover (neue Skills aus bestehenden)
- ❌ Populationsmanagement (Niching)
- ❌ Automatische Workflow-Topologie-Änderungen
All das ist Phase 2+. Phase 1 liefert die Datenbasis, ohne die keine Evolution möglich ist.
Success Criteria
💡 Key Insight
The final success criterion — "at least 1 skill change based on performance data" — is the proof that the feedback loop actually closes. Data without action is just logging.
| Criterion | Measurement |
|---|---|
| Performance data is being collected | ≥ 50 entries after 2 weeks |
| Pareto view works | At least 1 skill identified as PARETO-OPTIMAL, 1 as DOMINATED |
| First alert fires | At least 1 Quality Drop or Cost Outlier alert |
| improve skill uses data | improve output references skill_metrics |
| User decision is made | At least 1 skill change based on performance data |
Erfolgskriterien
💡 Kernerkenntnis
Das letzte Erfolgskriterium — „mindestens 1 Skill-Änderung basierend auf Performance-Daten" — ist der Beweis, dass der Feedback-Loop tatsächlich geschlossen wird. Daten ohne Aktion sind nur Logging.
| Kriterium | Messung |
|---|---|
| Performance-Daten werden gesammelt | ≥ 50 Einträge nach 2 Wochen |
| Pareto-Ansicht funktioniert | Mindestens 1 Skill als PARETO-OPTIMAL, 1 als DOMINATED identifiziert |
| Erster Alert feuert | Mindestens 1 Quality-Drop oder Cost-Outlier-Alert |
| improve-Skill nutzt Daten | improve-Output referenziert skill_metrics |
| User-Entscheidung wird getroffen | Mindestens 1 Skill-Änderung basierend auf Performance-Daten |
Connection to Nowak
Section Summary
Phase 1 maps to Nowak's fitness measurement (fᵢ and φ). The Pareto view implements multi-objective selection from EvoFlow. Triggers create selection pressure without forcing action — the first step of the evolutionary phase transition.
This Phase 1 implements the measurement apparatus — Nowak’s fᵢ (fitness per individual) and φ (population average). Without this measurement there is no selection pressure, only drift.
The Pareto view (Quality vs. Cost) is the first step toward multi-objective selection — EvoFlow’s core idea, applied in practice.
The triggers are the selection pressure — they don’t force change, but they make visible WHERE change would be needed. This is the transition from “no signal” to “signal present” — the first step of the phase transition.
Verbindung zu Nowak
Abschnitt-Zusammenfassung
Phase 1 bildet Nowaks Fitness-Messung ab (fᵢ und φ). Die Pareto-Ansicht implementiert Multi-Objective-Selektion aus EvoFlow. Trigger erzeugen Selektionsdruck ohne Aktionszwang — der erste Schritt des evolutionären Phasenübergangs.
Diese Phase 1 implementiert den Messapparat — Nowaks fᵢ (Fitness pro Individuum) und φ (Populations-Durchschnitt). Ohne diese Messung gibt es keinen Selektionsdruck, nur Drift.
Die Pareto-Ansicht (Quality vs. Cost) ist der erste Schritt zu Multi-Objective-Selektion — EvoFlow’s Kernidee, praktisch angewendet.
Die Trigger sind der Selektionsdruck — sie erzwingen keine Änderung, aber sie machen sichtbar, WO Änderung nötig wäre. Das ist der Übergang von “kein Signal” zu “Signal vorhanden” — der erste Schritt des Phasenübergangs.