Phase 1 Implementation

TL;DR

SQLite schema + shell scripts to measure skill fitness across 5 agents. 3 commands to start: create DB, log executions, view metrics. Pareto classification after ≥3 uses per skill.

Status: ✅ Deployed (2026-03-22) Platform: Multi-agent system (5 agents: main + α/β/γ/δ) Spec: Phase 1: Feedback Loop

Quickstart (3 commands)

# 1. Create the database
sqlite3 ~/.openclaw/data/metrics.db < implementation/phase-1/schema.sql

# 2. Log a skill execution
skill-log.sh deep-research success --quality 0.85 --tokens 15000 --duration 120

# 3. View metrics
skill-metrics.sh overview

What Was Built

1. SQLite Schema (`metrics.db`)

Extended the existing ~/.openclaw/data/metrics.db with:

skill_performance table — logs every skill execution with agent_id, quality_score, token_cost, duration, outcome
skill_metrics view — aggregated per-skill stats (avg quality, cost, success rate)
skill_pareto view — Pareto classification (PARETO-OPTIMAL / DOMINATED / TRADE-OFF)
agent_skill_metrics view — per-agent breakdown

2. Logging Script (`skill-log.sh`)

~/.openclaw/scripts/skill-log.sh <skill_name> <outcome> [--agent <id>] [--quality <0-1>] [--tokens <n>] [--duration <sec>] [--task-type <type>] [--project <name>] [--notes <text>]

Called after every skill execution. All 5 agents have this in their AGENTS.md as mandatory.

3. Metrics Script (`skill-metrics.sh`)

~/.openclaw/scripts/skill-metrics.sh overview   # Summary
~/.openclaw/scripts/skill-metrics.sh pareto     # Quality vs. Cost
~/.openclaw/scripts/skill-metrics.sh agents     # Per-agent breakdown
~/.openclaw/scripts/skill-metrics.sh alerts     # Quality drops, cost outliers
~/.openclaw/scripts/skill-metrics.sh detail <name>  # History for one skill
~/.openclaw/scripts/skill-metrics.sh recent [n]     # Last n entries

4. Alert Triggers

Trigger	Condition	Action
Quality Drop	avg_quality < 0.5 AND uses >= 5	Alert to user
Cost Outlier	avg_cost > 2× median	Alert to user
Unused Skill	last_used > 30 days	Alert to user

Alerts are advisory only — no automatic changes (Phase 1 principle).

Architecture Decisions

Reused existing metrics.db instead of creating a new DB — keeps everything in one place
Shell scripts, not Python — zero dependencies, works from any agent
agent_id field — tracks which of the 5 agents executed which skill, enabling Q3 (Collaboration Gain) analysis later
Manual logging, not hooks — The platform doesn’t have post-execution hooks yet; agents are instructed via AGENTS.md to log after every skill use

Adaptation from Spec

The original spec targets knowledge.db and post_tool_call hooks. This implementation adapts to the target platform:

Original Spec	Adapted Implementation
knowledge.db	metrics.db (already existed)
post_tool_call hook	Manual logging via AGENTS.md instruction
Single agent	5 agents (main + α/β/γ/δ) with agent_id tracking
Improvement Loop	skill-metrics.sh alerts command

Connection to Nowak

This implements the measurement apparatus — Nowak’s fᵢ (fitness per individual) and φ (population average). Without this measurement, there is no selection pressure, only drift.

skill_metrics view = population fitness landscape
skill_pareto view = multi-objective selection (EvoFlow’s core idea)
alerts = selection pressure signals (not automatic action)
agent_skill_metrics = niche performance (MAP-Elites analogy per agent)

Updates (2026-03-22, same day)

5. Feedback Script (`skill-feedback.sh`)

~/.openclaw/scripts/skill-feedback.sh <quality_score> [--skill <name>] [--notes <text>]

Updates the quality_score of the last skill execution. Called when the user gives feedback (“good” → 0.95, “wrong” → 0.2).

6. Auto-Quality Heuristic

Documented in skills/auto-quality/SKILL.md. Binary signals instead of the 0.5 default:

Signal	Score
Code builds/deploys without errors	0.85
User says “good”/”perfect”	0.95
Needs rework (1×)	0.6
Wrong approach	0.1
Research delivered + in vault	0.8
Routine task success	1.0

Result: 76% of entries now have real scores (was 28% before).

7. Pareto View v2 (`skill_pareto_v2`)

Directly on skill_performance (not via skill_metrics). Includes:

quality_per_10k_tokens — efficiency metric
quality_trend — last 5 vs. first 5 runs

8. Token Usage Script (`token-usage.sh`)

~/.openclaw/scripts/token-usage.sh session  # Current session
~/.openclaw/scripts/token-usage.sh today    # All sessions today
~/.openclaw/scripts/token-usage.sh all      # Summary with cost estimate

Reads real token counts from the platform’s sessions.json (input, output, cache read/write).

9. Dashboard (`skill-dashboard.sh`)

~/.openclaw/scripts/skill-dashboard.sh

Combined view: Pareto ranking + recent executions + stats + token usage.

First Real Data (Day 1)

Metric	Value
Total entries	21
Unique skills	7
Real quality scores (≠0.5)	76%
Top skill by efficiency	deep-research (0.57 quality/10k tokens)

Next Steps (Phase 2 Prerequisites)

Accumulate ≥50 entries across all agents
First Pareto analysis with ≥3 uses per skill
First alert fires (quality drop or cost outlier)
Evaluate: Is manual logging sustainable or do we need automation?
Skill-routing based on Pareto data (planned W14)
Token tracking from API response instead of estimates (planned W14)

Phase 1 Implementierung

Kurzfassung

SQLite-Schema + Shell-Scripts zur Messung der Skill-Fitness über 5 Agents. 3 Befehle zum Start: DB erstellen, Ausführungen loggen, Metriken anzeigen. Pareto-Klassifikation ab ≥3 Nutzungen pro Skill.

Status: ✅ Deployed (2026-03-22) Plattform: Multi-Agent-System (5 Agents: main + α/β/γ/δ) Spec: Phase 1: Feedback Loop

Schnellstart (3 Befehle)

# 1. Datenbank erstellen
sqlite3 ~/.openclaw/data/metrics.db < implementation/phase-1/schema.sql

# 2. Skill-Ausführung loggen
skill-log.sh deep-research success --quality 0.85 --tokens 15000 --duration 120

# 3. Metriken anzeigen
skill-metrics.sh overview

Was gebaut wurde

1. SQLite-Schema (`metrics.db`)

Die bestehende ~/.openclaw/data/metrics.db wurde erweitert um:

skill_performance-Tabelle — loggt jede Skill-Ausführung mit agent_id, quality_score, token_cost, duration, outcome
skill_metrics-View — aggregierte Statistiken pro Skill (Ø Qualität, Kosten, Erfolgsrate)
skill_pareto-View — Pareto-Klassifikation (PARETO-OPTIMAL / DOMINATED / TRADE-OFF)
agent_skill_metrics-View — Aufschlüsselung pro Agent

2. Logging-Script (`skill-log.sh`)

~/.openclaw/scripts/skill-log.sh <skill_name> <outcome> [--agent <id>] [--quality <0-1>] [--tokens <n>] [--duration <sec>] [--task-type <type>] [--project <name>] [--notes <text>]

Wird nach jeder Skill-Ausführung aufgerufen. Alle 5 Agents haben das als Pflicht in ihrer AGENTS.md.

3. Metriken-Script (`skill-metrics.sh`)

~/.openclaw/scripts/skill-metrics.sh overview   # Überblick
~/.openclaw/scripts/skill-metrics.sh pareto     # Qualität vs. Kosten
~/.openclaw/scripts/skill-metrics.sh agents     # Pro-Agent-Aufschlüsselung
~/.openclaw/scripts/skill-metrics.sh alerts     # Qualitätseinbrüche, Kosten-Ausreißer
~/.openclaw/scripts/skill-metrics.sh detail <name>  # Historie eines Skills
~/.openclaw/scripts/skill-metrics.sh recent [n]     # Letzte n Einträge

4. Alert-Trigger

Trigger	Bedingung	Aktion
Qualitätseinbruch	avg_quality < 0.5 UND uses >= 5	Alert an User
Kosten-Ausreißer	avg_cost > 2× Median	Alert an User
Unbenutzter Skill	last_used > 30 Tage	Alert an User

Alerts sind rein informativ — keine automatischen Änderungen (Phase-1-Prinzip).

Architektur-Entscheidungen

Bestehende metrics.db wiederverwendet statt neue DB — alles an einem Ort
Shell-Scripts, nicht Python — keine Abhängigkeiten, funktioniert von jedem Agent
agent_id-Feld — trackt welcher der 5 Agents welchen Skill ausgeführt hat, ermöglicht Q3-Analyse (Collaboration Gain) später
Manuelles Logging, keine Hooks — Die Plattform hat noch keine Post-Execution-Hooks; Agents werden via AGENTS.md angewiesen, nach jeder Skill-Nutzung zu loggen

Anpassung gegenüber Spec

Die ursprüngliche Spec zielt auf knowledge.db und post_tool_call Hooks. Diese Implementierung ist an die Zielplattform angepasst:

Original-Spec	Angepasste Implementierung
knowledge.db	metrics.db (existierte bereits)
post_tool_call Hook	Manuelles Logging via AGENTS.md-Anweisung
Einzelner Agent	5 Agents (main + α/β/γ/δ) mit agent_id-Tracking
Improvement Loop	skill-metrics.sh alerts Befehl

Verbindung zu Nowak

Dies implementiert den Messapparat — Nowaks fᵢ (Fitness pro Individuum) und φ (Populations-Durchschnitt). Ohne diese Messung gibt es keinen Selektionsdruck, nur Drift.

skill_metrics-View = Fitness-Landschaft der Population
skill_pareto-View = Multi-Objective-Selektion (EvoFlows Kernidee)
alerts = Selektionsdruck-Signale (keine automatische Aktion)
agent_skill_metrics = Nischen-Performance (MAP-Elites-Analogie pro Agent)

Updates (22.03.2026, gleicher Tag)

5. Feedback-Script (`skill-feedback.sh`)

~/.openclaw/scripts/skill-feedback.sh <quality_score> [--skill <name>] [--notes <text>]

Aktualisiert den quality_score der letzten Skill-Ausführung. Wird aufgerufen wenn der User Feedback gibt (“gut” → 0.95, “falsch” → 0.2).

6. Auto-Quality-Heuristik

Dokumentiert in skills/auto-quality/SKILL.md. Binäre Signale statt 0.5-Default:

Signal	Score
Code baut/deployed ohne Fehler	0.85
User sagt “gut”/”perfekt”	0.95
Nachbesserung nötig (1×)	0.6
Falscher Ansatz	0.1
Research geliefert + im Vault	0.8
Routine-Task erfolgreich	1.0

Ergebnis: 76% der Einträge haben jetzt echte Scores (vorher 28%).

7. Pareto View v2 (`skill_pareto_v2`)

Direkt auf skill_performance (nicht über skill_metrics). Enthält:

quality_per_10k_tokens — Effizienz-Metrik
quality_trend — letzte 5 vs. erste 5 Runs

8. Token-Usage-Script (`token-usage.sh`)

~/.openclaw/scripts/token-usage.sh session  # Aktuelle Session
~/.openclaw/scripts/token-usage.sh today    # Alle Sessions heute
~/.openclaw/scripts/token-usage.sh all      # Zusammenfassung mit Kostenschätzung

Liest echte Token-Zahlen aus sessions.json der Plattform (Input, Output, Cache Read/Write).

9. Dashboard (`skill-dashboard.sh`)

~/.openclaw/scripts/skill-dashboard.sh

Kombinierte Ansicht: Pareto-Ranking + letzte Ausführungen + Stats + Token-Usage.

Erste echte Daten (Tag 1)

Metrik	Wert
Einträge gesamt	21
Unique Skills	7
Echte Quality-Scores (≠0.5)	76%
Top-Skill nach Effizienz	deep-research (0.57 Qualität/10k Tokens)

Nächste Schritte (Phase-2-Voraussetzungen)

≥50 Einträge über alle Agents sammeln
Erste Pareto-Analyse mit ≥3 Nutzungen pro Skill
Erster Alert feuert (Qualitätseinbruch oder Kosten-Ausreißer)
Bewertung: Ist manuelles Logging nachhaltig oder brauchen wir Automatisierung?
Skill-Routing basierend auf Pareto-Daten (geplant W14)
Token-Tracking aus API-Response statt Schätzungen (geplant W14)

Phase 1 Implementation

TL;DR

Quickstart (3 commands)

What Was Built

1. SQLite Schema (metrics.db)

2. Logging Script (skill-log.sh)

3. Metrics Script (skill-metrics.sh)

4. Alert Triggers

Architecture Decisions

Adaptation from Spec

Connection to Nowak

Updates (2026-03-22, same day)

5. Feedback Script (skill-feedback.sh)

6. Auto-Quality Heuristic

7. Pareto View v2 (skill_pareto_v2)

8. Token Usage Script (token-usage.sh)

9. Dashboard (skill-dashboard.sh)

First Real Data (Day 1)

Next Steps (Phase 2 Prerequisites)

Phase 1 Implementierung

Kurzfassung

Schnellstart (3 Befehle)

Was gebaut wurde

1. SQLite-Schema (metrics.db)

2. Logging-Script (skill-log.sh)

3. Metriken-Script (skill-metrics.sh)

4. Alert-Trigger

Architektur-Entscheidungen

Anpassung gegenüber Spec

Verbindung zu Nowak

Updates (22.03.2026, gleicher Tag)

5. Feedback-Script (skill-feedback.sh)

6. Auto-Quality-Heuristik

7. Pareto View v2 (skill_pareto_v2)

8. Token-Usage-Script (token-usage.sh)

9. Dashboard (skill-dashboard.sh)

Erste echte Daten (Tag 1)

Nächste Schritte (Phase-2-Voraussetzungen)

1. SQLite Schema (`metrics.db`)

2. Logging Script (`skill-log.sh`)

3. Metrics Script (`skill-metrics.sh`)

5. Feedback Script (`skill-feedback.sh`)

7. Pareto View v2 (`skill_pareto_v2`)

8. Token Usage Script (`token-usage.sh`)

9. Dashboard (`skill-dashboard.sh`)

1. SQLite-Schema (`metrics.db`)

2. Logging-Script (`skill-log.sh`)

3. Metriken-Script (`skill-metrics.sh`)

5. Feedback-Script (`skill-feedback.sh`)

7. Pareto View v2 (`skill_pareto_v2`)

8. Token-Usage-Script (`token-usage.sh`)

9. Dashboard (`skill-dashboard.sh`)