Phase 1 Implementation

TL;DR

SQLite schema + shell scripts to measure skill fitness across 5 agents. 3 commands to start: create DB, log executions, view metrics. Pareto classification after ≥3 uses per skill.

Status: ✅ Deployed (2026-03-22) Platform: Multi-agent system (5 agents: main + α/β/γ/δ) Spec: Phase 1: Feedback Loop

Quickstart (3 commands)

# 1. Create the database
sqlite3 ~/.openclaw/data/metrics.db < implementation/phase-1/schema.sql

# 2. Log a skill execution
skill-log.sh deep-research success --quality 0.85 --tokens 15000 --duration 120

# 3. View metrics
skill-metrics.sh overview

What Was Built

1. SQLite Schema (metrics.db)

Extended the existing ~/.openclaw/data/metrics.db with:

  • skill_performance table — logs every skill execution with agent_id, quality_score, token_cost, duration, outcome
  • skill_metrics view — aggregated per-skill stats (avg quality, cost, success rate)
  • skill_pareto view — Pareto classification (PARETO-OPTIMAL / DOMINATED / TRADE-OFF)
  • agent_skill_metrics view — per-agent breakdown

2. Logging Script (skill-log.sh)

~/.openclaw/scripts/skill-log.sh <skill_name> <outcome> [--agent <id>] [--quality <0-1>] [--tokens <n>] [--duration <sec>] [--task-type <type>] [--project <name>] [--notes <text>]

Called after every skill execution. All 5 agents have this in their AGENTS.md as mandatory.

3. Metrics Script (skill-metrics.sh)

~/.openclaw/scripts/skill-metrics.sh overview   # Summary
~/.openclaw/scripts/skill-metrics.sh pareto     # Quality vs. Cost
~/.openclaw/scripts/skill-metrics.sh agents     # Per-agent breakdown
~/.openclaw/scripts/skill-metrics.sh alerts     # Quality drops, cost outliers
~/.openclaw/scripts/skill-metrics.sh detail <name>  # History for one skill
~/.openclaw/scripts/skill-metrics.sh recent [n]     # Last n entries

4. Alert Triggers

Trigger Condition Action
Quality Drop avg_quality < 0.5 AND uses >= 5 Alert to user
Cost Outlier avg_cost > 2× median Alert to user
Unused Skill last_used > 30 days Alert to user

Alerts are advisory only — no automatic changes (Phase 1 principle).

Architecture Decisions

  1. Reused existing metrics.db instead of creating a new DB — keeps everything in one place
  2. Shell scripts, not Python — zero dependencies, works from any agent
  3. agent_id field — tracks which of the 5 agents executed which skill, enabling Q3 (Collaboration Gain) analysis later
  4. Manual logging, not hooks — The platform doesn’t have post-execution hooks yet; agents are instructed via AGENTS.md to log after every skill use

Adaptation from Spec

The original spec targets knowledge.db and post_tool_call hooks. This implementation adapts to the target platform:

Original Spec Adapted Implementation
knowledge.db metrics.db (already existed)
post_tool_call hook Manual logging via AGENTS.md instruction
Single agent 5 agents (main + α/β/γ/δ) with agent_id tracking
improve-Skill skill-metrics.sh alerts command

Connection to Nowak

This implements the measurement apparatus — Nowak’s fᵢ (fitness per individual) and φ (population average). Without this measurement, there is no selection pressure, only drift.

  • skill_metrics view = population fitness landscape
  • skill_pareto view = multi-objective selection (EvoFlow’s core idea)
  • alerts = selection pressure signals (not automatic action)
  • agent_skill_metrics = niche performance (MAP-Elites analogy per agent)

Updates (2026-03-22, same day)

5. Feedback Script (skill-feedback.sh)

~/.openclaw/scripts/skill-feedback.sh <quality_score> [--skill <name>] [--notes <text>]

Updates the quality_score of the last skill execution. Called when the user gives feedback (“good” → 0.95, “wrong” → 0.2).

6. Auto-Quality Heuristic

Documented in skills/auto-quality/SKILL.md. Binary signals instead of the 0.5 default:

Signal Score
Code builds/deploys without errors 0.85
User says “good”/”perfect” 0.95
Needs rework (1×) 0.6
Wrong approach 0.1
Research delivered + in vault 0.8
Routine task success 1.0

Result: 76% of entries now have real scores (was 28% before).

7. Pareto View v2 (skill_pareto_v2)

Directly on skill_performance (not via skill_metrics). Includes:

  • quality_per_10k_tokens — efficiency metric
  • quality_trend — last 5 vs. first 5 runs

8. Token Usage Script (token-usage.sh)

~/.openclaw/scripts/token-usage.sh session  # Current session
~/.openclaw/scripts/token-usage.sh today    # All sessions today
~/.openclaw/scripts/token-usage.sh all      # Summary with cost estimate

Reads real token counts from the platform’s sessions.json (input, output, cache read/write).

9. Dashboard (skill-dashboard.sh)

~/.openclaw/scripts/skill-dashboard.sh

Combined view: Pareto ranking + recent executions + stats + token usage.

First Real Data (Day 1)

Metric Value
Total entries 21
Unique skills 7
Real quality scores (≠0.5) 76%
Top skill by efficiency deep-research (0.57 quality/10k tokens)

Next Steps (Phase 2 Prerequisites)

  • Accumulate ≥50 entries across all agents
  • First Pareto analysis with ≥3 uses per skill
  • First alert fires (quality drop or cost outlier)
  • Evaluate: Is manual logging sustainable or do we need automation?
  • Skill-routing based on Pareto data (planned W14)
  • Token tracking from API response instead of estimates (planned W14)

Phase 1 Implementierung

Kurzfassung

SQLite-Schema + Shell-Scripts zur Messung der Skill-Fitness über 5 Agents. 3 Befehle zum Start: DB erstellen, Ausführungen loggen, Metriken anzeigen. Pareto-Klassifikation ab ≥3 Nutzungen pro Skill.

Status: ✅ Deployed (2026-03-22) Plattform: Multi-Agent-System (5 Agents: main + α/β/γ/δ) Spec: Phase 1: Feedback Loop

Schnellstart (3 Befehle)

# 1. Datenbank erstellen
sqlite3 ~/.openclaw/data/metrics.db < implementation/phase-1/schema.sql

# 2. Skill-Ausführung loggen
skill-log.sh deep-research success --quality 0.85 --tokens 15000 --duration 120

# 3. Metriken anzeigen
skill-metrics.sh overview

Was gebaut wurde

1. SQLite-Schema (metrics.db)

Die bestehende ~/.openclaw/data/metrics.db wurde erweitert um:

  • skill_performance-Tabelle — loggt jede Skill-Ausführung mit agent_id, quality_score, token_cost, duration, outcome
  • skill_metrics-View — aggregierte Statistiken pro Skill (Ø Qualität, Kosten, Erfolgsrate)
  • skill_pareto-View — Pareto-Klassifikation (PARETO-OPTIMAL / DOMINATED / TRADE-OFF)
  • agent_skill_metrics-View — Aufschlüsselung pro Agent

2. Logging-Script (skill-log.sh)

~/.openclaw/scripts/skill-log.sh <skill_name> <outcome> [--agent <id>] [--quality <0-1>] [--tokens <n>] [--duration <sec>] [--task-type <type>] [--project <name>] [--notes <text>]

Wird nach jeder Skill-Ausführung aufgerufen. Alle 5 Agents haben das als Pflicht in ihrer AGENTS.md.

3. Metriken-Script (skill-metrics.sh)

~/.openclaw/scripts/skill-metrics.sh overview   # Überblick
~/.openclaw/scripts/skill-metrics.sh pareto     # Qualität vs. Kosten
~/.openclaw/scripts/skill-metrics.sh agents     # Pro-Agent-Aufschlüsselung
~/.openclaw/scripts/skill-metrics.sh alerts     # Qualitätseinbrüche, Kosten-Ausreißer
~/.openclaw/scripts/skill-metrics.sh detail <name>  # Historie eines Skills
~/.openclaw/scripts/skill-metrics.sh recent [n]     # Letzte n Einträge

4. Alert-Trigger

Trigger Bedingung Aktion
Qualitätseinbruch avg_quality < 0.5 UND uses >= 5 Alert an User
Kosten-Ausreißer avg_cost > 2× Median Alert an User
Unbenutzter Skill last_used > 30 Tage Alert an User

Alerts sind rein informativ — keine automatischen Änderungen (Phase-1-Prinzip).

Architektur-Entscheidungen

  1. Bestehende metrics.db wiederverwendet statt neue DB — alles an einem Ort
  2. Shell-Scripts, nicht Python — keine Abhängigkeiten, funktioniert von jedem Agent
  3. agent_id-Feld — trackt welcher der 5 Agents welchen Skill ausgeführt hat, ermöglicht Q3-Analyse (Collaboration Gain) später
  4. Manuelles Logging, keine Hooks — Die Plattform hat noch keine Post-Execution-Hooks; Agents werden via AGENTS.md angewiesen, nach jeder Skill-Nutzung zu loggen

Anpassung gegenüber Spec

Die ursprüngliche Spec zielt auf knowledge.db und post_tool_call Hooks. Diese Implementierung ist an die Zielplattform angepasst:

Original-Spec Angepasste Implementierung
knowledge.db metrics.db (existierte bereits)
post_tool_call Hook Manuelles Logging via AGENTS.md-Anweisung
Einzelner Agent 5 Agents (main + α/β/γ/δ) mit agent_id-Tracking
improve-Skill skill-metrics.sh alerts Befehl

Verbindung zu Nowak

Dies implementiert den Messapparat — Nowaks fᵢ (Fitness pro Individuum) und φ (Populations-Durchschnitt). Ohne diese Messung gibt es keinen Selektionsdruck, nur Drift.

  • skill_metrics-View = Fitness-Landschaft der Population
  • skill_pareto-View = Multi-Objective-Selektion (EvoFlows Kernidee)
  • alerts = Selektionsdruck-Signale (keine automatische Aktion)
  • agent_skill_metrics = Nischen-Performance (MAP-Elites-Analogie pro Agent)

Updates (22.03.2026, gleicher Tag)

5. Feedback-Script (skill-feedback.sh)

~/.openclaw/scripts/skill-feedback.sh <quality_score> [--skill <name>] [--notes <text>]

Aktualisiert den quality_score der letzten Skill-Ausführung. Wird aufgerufen wenn der User Feedback gibt (“gut” → 0.95, “falsch” → 0.2).

6. Auto-Quality-Heuristik

Dokumentiert in skills/auto-quality/SKILL.md. Binäre Signale statt 0.5-Default:

Signal Score
Code baut/deployed ohne Fehler 0.85
User sagt “gut”/”perfekt” 0.95
Nachbesserung nötig (1×) 0.6
Falscher Ansatz 0.1
Research geliefert + im Vault 0.8
Routine-Task erfolgreich 1.0

Ergebnis: 76% der Einträge haben jetzt echte Scores (vorher 28%).

7. Pareto View v2 (skill_pareto_v2)

Direkt auf skill_performance (nicht über skill_metrics). Enthält:

  • quality_per_10k_tokens — Effizienz-Metrik
  • quality_trend — letzte 5 vs. erste 5 Runs

8. Token-Usage-Script (token-usage.sh)

~/.openclaw/scripts/token-usage.sh session  # Aktuelle Session
~/.openclaw/scripts/token-usage.sh today    # Alle Sessions heute
~/.openclaw/scripts/token-usage.sh all      # Zusammenfassung mit Kostenschätzung

Liest echte Token-Zahlen aus sessions.json der Plattform (Input, Output, Cache Read/Write).

9. Dashboard (skill-dashboard.sh)

~/.openclaw/scripts/skill-dashboard.sh

Kombinierte Ansicht: Pareto-Ranking + letzte Ausführungen + Stats + Token-Usage.

Erste echte Daten (Tag 1)

Metrik Wert
Einträge gesamt 21
Unique Skills 7
Echte Quality-Scores (≠0.5) 76%
Top-Skill nach Effizienz deep-research (0.57 Qualität/10k Tokens)

Nächste Schritte (Phase-2-Voraussetzungen)

  • ≥50 Einträge über alle Agents sammeln
  • Erste Pareto-Analyse mit ≥3 Nutzungen pro Skill
  • Erster Alert feuert (Qualitätseinbruch oder Kosten-Ausreißer)
  • Bewertung: Ist manuelles Logging nachhaltig oder brauchen wir Automatisierung?
  • Skill-Routing basierend auf Pareto-Daten (geplant W14)
  • Token-Tracking aus API-Response statt Schätzungen (geplant W14)

Back to top

CC BY-SA 4.0 — Evolving Agents — A living research collection.