Deep Dive: EvoFlow, MCE, AgentFactory — and What AgentField Can Learn
Reading: 3 of 5 · After reading this, you’ll understand how EvoFlow implements niching selection for workflows, how MCE formalizes skill co-evolution, and a concrete 4-phase upgrade path from Prelife to full agent evolution.
Date: 2026-03-19 Type: Comparative Analysis (3 Papers → 1 System)
📑 On this page
Deep Dive: EvoFlow, MCE, AgentFactory — und was AgentField daraus lernen kann
Lesehinweis: 3 von 5 · Nach der Lektüre verstehst du, wie EvoFlow Niching-Selektion für Workflows implementiert, wie MCE Skill-Ko-Evolution formalisiert, und einen konkreten 4-Phasen-Upgrade-Pfad von Prelife zu voller Agent-Evolution.
Datum: 2026-03-19 Typ: Vergleichsanalyse (3 Papers → 1 System)
📑 Auf dieser Seite
1. The Three Papers at a Glance
Section Summary
EvoFlow evolves a population of workflows (beat o1-preview at 12.4% cost). MCE co-evolves skills via meta-agent (+16.9% avg). AgentFactory stores solutions as executable code, not text. Together they show three independent paths from manual to evolutionary agent design.
| EvoFlow | MCE | AgentFactory | |
|---|---|---|---|
| What evolves | Workflow graphs (operator DAGs) | Skills (prompt + tool configs) | Subagent code modules |
| Evolution mechanism | Crossover + mutation + niching | Meta-agent co-evolution | Accumulate & reuse executable subagents |
| Selection | Pareto (quality vs. cost) | Meta-agent evaluation (+16.9%) | Task performance filtering |
| Diversity strategy | Niching / MAP-Elites style | Skill specialization | Code library growth |
| Storage format | Text (prompts + graph) | Text (SKILL.md style) | Executable code |
| Best result | Beat o1-preview at 12.4% cost; +4.15pp avg over 6 benchmarks; cross-domain transfer works | +16.9% avg improvement | Autonomous skill accumulation |
| Nowak analogy | Population with selection (P2+P3) | Heredity via meta-agent (P7) | Replicator accumulation (P1) |
| arXiv | 2502.07373 | 2601.21557 | 2603.18000 |
EvoFlow (2502.07373) — Population-Based Workflow Evolution
Core Idea: Instead of searching for ONE optimal workflow, evolve a POPULATION of heterogeneous workflows. Use niching selection (related to MAP-Elites) to maintain diversity and quality simultaneously.
Two core problems of prior approaches:
- Missing LLM heterogeneity: Previous systems use a single expensive LLM (e.g., GPT-4o) for all agents
- Missing complexity diversity: Optimization converges to a single, complex workflow; simple queries get unnecessarily expensive treatment
Architecture — Hierarchical Search Space (3 Levels):
Workflow G = (Oˢ, Eᵃ) ← Directed graph of operator nodes + inter-operator connectivity
└── Operator Oⱼ = (Iᵒⱼ, Eᵒⱼ) ← Subset of Invoking Nodes + internal connectivity
└── Invoking Node Iᵢ = (Mᵢ, Pᵢ, τᵢ) ← LLM instance + Prompt + Temperature
where Mᵢ = (|Mᵢ|, Cᵢ, Lᵢ) → model size, token cost, latency
Feasible Space: I = M × P × ℝ[0,1]
Optimization Target (Multi-Objective):
G⋆ = arg max [u(G, T), −c(G, T)]ᵀ
G∈H(O,Eᵃ)
Where u(·) = Performance-Evaluator, c(·) = Cost-Evaluator. G⋆ is the Pareto-optimal set.
Operator Repository (8 Types)
| Operator | Description |
|---|---|
| CoT | Step-by-step Reasoning |
| LLM-Debate | 3 Debaters, max 2 rounds, Majority Voting |
| Take-a-Step-Back | First identify principles, then solve |
| Self-Consistency | 5 CoT paths aggregated via Majority Voting |
| Self-Refine | CoT + iterative self-reflection (max 5 iterations) |
| Ensemble | 3 LLM agents from different sources + Pairwise Ranking |
| ReAct | Agent uses tools (Code Interpreter, Web Search, Knowledge DB) |
| ExpertPrompt | Dynamic control flows, agent selects experts |
Evolutionary Algorithm (Pseudocode)
Input: Dataset D (train/test), Operator Set O
Output: Optimized Population P
// Phase 1: Population Initialization
FOR k = 1 TO N:
Gₖ ← {Oₖ, Eᵃₖ} // Random operator combination
Tags ← f_tag(Gₖ) // LLM-generated utility tags (κ=5 each)
P⁽⁰⁾ = {G₁, G₂, ..., Gₙ}
// Phase 2: Evolutionary Optimization (query-by-query)
FOR each query qₜ IN D_train:
// Step 1: Tag-based Retrieval
Parents {Gₜ₁,...,GₜK} = TopK(S({Gᵢ}|qₜ), K)
// S(Gᵢ|qₜ) = Σⱼ cosine_sim(v(κᵢ,ⱼ), v(qₜ))
// v(·) = all-MiniLM-L6-v2 Embeddings
// Step 2: Crossover (LLM-facilitated)
G°⁽ᵗ⁾ ← Crossover(Gₜ₁, ..., GₜK)
// Step 3: Mutation (3 types, applied sequentially)
G⊚⁽ᵗ⁾ ← µₒ(µₚ(µₗ(G°⁽ᵗ⁾)))
// Step 4: Niching-based Selection
// 4a: Identify Niching Area
PNA = TopK({−Rank(Gᵢ)}, E)
// Rank = RankS (Tag-Similarity) + Rankc (Cost-Similarity)
// 4b: Execute all relevant workflows
FOR Gᵢ IN PNA ∪ Parents ∪ {G⊚}:
Execute Gᵢ on qₜ → Update running avg cost c⁽ᵗ⁾ + performance p⁽ᵗ⁾
// 4c: Compute fitness (Pareto-Indicator-based)
F(G) = Σ exp(I(G, G⊚) / (φ · Imax)) // φ = 0.05
// 4d: Eliminate worst individual
G_worst = argmax F(G) over PNA ∪ {G⊚}
P⁽ᵗ⁺¹⁾ ← P⁽ᵗ⁾ \ G_worst ∪ G⊚⁽ᵗ⁾
RETURN P
The 4 Evolutionary Operators in Detail
| Operator | Type | Mechanism |
|---|---|---|
| LLM Crossover | Recombination | LLM-facilitated: combines structural elements of K parent workflows into a new offspring. Not random — the LLM understands the parent architectures and produces a meaningful synthesis. |
LLM Mutation µₗ | Backbone swap | LLM-powered decision based on LLM Experience Pool P_LLM. Pool contains per-LLM triplets (Workflow, Positive/Negative/None rating, qualitative feedback). Swaps the LLM backbone of an Invoking Node based on historical performance. |
Prompt Refinement µₚ | Prompt engineering | LLM-powered, uses Workflow Experience Pool P_WF with triplets (Workflow, Query, Evaluation with quantitative + qualitative feedback). Modifies instructions, adds few-shot examples. |
Operator Mutation µₒ | Topology change | Adds (O_add) or removes (O_del) operators from the workflow DAG, rewires topology. Changes the structural composition of the workflow. |
| Niching Selection | Diversity preservation | Identifies a “niching area” (similar workflows by tag + cost), evaluates all candidates, eliminates the worst using a Pareto-dominance-preserving binary indicator. Prevents population collapse to a single workflow type. |
Seed Population: 4 initial workflow types (Reflective Agent I/O, Arithmetic Collaborator, Lightweight Programmer, Advanced Multi-programmer).
Benchmark Results
Homogeneous Settings (all methods use gpt-4o-mini):
| Method | GSM8K | MATH | MultiArith | HumanEval | MBPP | ALFWorld | Avg. |
|---|---|---|---|---|---|---|---|
| Vanilla | 87.45 | 46.29 | 96.85 | 87.08 | 71.83 | 38.71 | 71.37 |
| CoT | 87.10 | 46.40 | 96.31 | 88.13 | 71.83 | 39.92 | 71.62 |
| SC (CoT×5) | 87.57 | 47.91 | 96.58 | 88.60 | 73.60 | 40.55 | 72.47 |
| LLM-Debate | 89.47 | 48.54 | 97.33 | 88.68 | 70.29 | 44.68 | 73.17 |
| DyLAN | 89.98 | 48.63 | 97.12 | 90.42 | 77.30 | 53.32 | 76.13 |
| AgentVerse | 89.91 | 47.35 | 97.50 | 89.29 | 74.28 | 45.03 | 73.89 |
| GPTSwarm | 89.14 | 47.88 | 96.79 | 89.32 | 77.43 | 53.19 | 75.63 |
| ADAS | 86.12 | 43.18 | 96.02 | 84.19 | 68.13 | 47.66 | 70.88 |
| AgentSquare | 87.62 | 48.51 | 97.77 | 89.08 | 78.46 | 66.42 | 78.14 |
| AFlow | 91.16 | 51.28 | 96.22 | 90.93 | 81.67 | 59.16 | 78.40 |
| EvoFlow | 92.90 | 57.70 | 98.80 | 92.85 | 84.50 | 68.57 | 82.55 |
Improvement over SOTA (AFlow): +1.23% to +29.86% per benchmark. Average +4.15pp.
Heterogeneous Settings (MATH & MBPP, Open-Source LLM Pool):
| Method | MATH Acc. | MATH Cost ($) | MBPP pass@1 | MBPP Cost ($) |
|---|---|---|---|---|
| Qwen-2.5-72b (Single) | 63.80% | 0.032 | 69.76% | 0.009 |
| o1-preview (Single) | 70.20% | 7.841 | 89.65% | 3.209 |
| AFlow_Qwen | 66.38% | 3.846 | 80.84% | 1.598 |
| DyLAN_Qwen | 64.17% | 16.864 | 75.63% | 8.973 |
| EvoFlow | 72.90% | 0.973 | 87.62% | 0.565 |
Key findings:
- EvoFlow outperforms o1-preview by +2.7% on MATH at only 12.4% of the cost
- Training cost: $0.459 vs AFlow $1.223 (37.5%)
- Inference cost: $0.513 vs AFlow $2.623 (19.5%)
- Uses exclusively open-source models: LLaMa-3.1-70b, Qwen-2.5-72b, Deepseek-V2.5, Hermes-3-70b
Cross-Domain Settings (MATH + MBPP combined training):
| Method | LLM | MATH (solo) | MBPP (solo) | MATH+MBPP →MATH | MATH+MBPP →MBPP |
|---|---|---|---|---|---|
| DyLAN | Deepseek | 46.20 | 80.13 | 43.85 ↓ | 78.62 ↓ |
| GPTSwarm | Deepseek | 45.36 | 77.52 | 39.18 ↓ | 74.09 ↓ |
| AFlow | Deepseek | 48.65 | 79.14 | 43.22 ↓ | 77.02 ↓ |
| EvoFlow | LLM Pool | 72.90 | 87.62 | 72.69 (≈) | 88.35 ↑ |
EvoFlow benefits from cross-domain training (MBPP rises 87.62→88.35), while all baselines degrade. This demonstrates true complexity-adaptive routing.
Ablation Study — What Matters Most
| Variant | MATH | MBPP | Impact |
|---|---|---|---|
| Full EvoFlow | ~73% | ~88% | — |
| w/o Tag-based Retrieval | ↓ significant | ↓ significant | Critical — high variance increase |
| w/o LLM Mutation | ↓ significant | ↓ significant | Critical — high variance increase |
| w/o Prompt Mutation | ↓ moderate | ↓ moderate | Important but not critical |
| w/o Operator Mutation | ↓ 3.5–7.3% | ↓ 3.5–7.3% | Contributes meaningful gains |
Key takeaway: Tag-based Retrieval and LLM Mutation are the two most critical components — removing either causes both performance drops AND variance increase (unreliable results).
Parameter Sensitivity
| Parameter | Optimum | Observation |
|---|---|---|
| K (parent count) | 3 | Too small → low offspring diversity; Too large → LLM can’t aggregate |
| κ (tags per workflow) | 5 | Stable range |
| N (population size) | 15 (trade-off) | N=5→25: +3.1% performance, but cost rises from 8e-4 to 2e-3 $/query |
Pareto Front (evolved population)
The evolved population forms a clear Pareto front:
- Simplest workflows: Basic I/O + Self-Refine → 38.7% Acc, $0.00018/query
- Medium workflows: Multi-agent Debate → balanced cost/performance
- Most complex workflows: Iterative Generation + Ensembling → 72.57% Acc, $0.0037/query
EvoFlow automatically selects the appropriate workflow at inference time based on query complexity.
Unique Capabilities vs. All Baselines
| Feature | EvoFlow | AFlow | AgentSquare | ADAS | GPTSwarm |
|---|---|---|---|---|---|
| Prompt Optimize | ✔ | ✔ | ✗ | ✔ | ✗ |
| Agent Topology | ✔ | ✔ | ✔ | ✗ | ✔ |
| Agent Profile | ✔ | ✔ | ✔ | ✔ | ✗ |
| LLM Backbone (heterogeneous) | ✔ | ✗ | ✗ | ✗ | ✗ |
| Complexity Adaptivity | ✔ | ✗ | ✗ | ✗ | ✗ |
Limitations (derived from paper)
- Operator repository is manually curated: The initial 8 operators are hand-designed; the system can evolve new topologies but starts from manual primitives.
- LLM-dependent mutation: Crossover and mutation are LLM-facilitated → quality depends on the orchestrator LLM’s ability to generate sensible workflows.
- Evaluation on limited domains: 6 benchmarks (Math, Code, Embodied) — no evaluation on open-ended generation, long-context reasoning, or multimodal tasks.
- Population scalability: Each query requires executing Parents + Niching Area + Offspring (K + E + 1 workflows). At N=15, K=3, E=5 that’s ~9 workflow executions per training query.
- No evaluation with stronger closed-source models in pool: Heterogeneous settings tested only with open-source models.
- Cross-domain generalization limited: Only MATH+MBPP combined, no broader multi-domain evaluation.
💡 Key Insight
EvoFlow beat o1-preview on MATH at just 12.4% of the cost — by evolving a population of heterogeneous workflows with open-source models. Population diversity + multi-objective selection is the key mechanism. The ablation confirms: Tag-based Retrieval and LLM Mutation are the two non-negotiable components.
MCE — Meta Context Engineering (2601.21557) — Skill Co-evolution
Core Idea: Context Engineering (CE) should not be designed manually. Instead: A Meta-Agent that automatically refines CE skills while a Base-Agent executes those skills.
Bi-Level Architecture:
Meta-Agent (upper level)
├── Deliberative Search over: historical skills, executions, evaluations
├── "Agentic Crossover": Combine successful skill elements
└── Refine skills iteratively
Base-Agent (lower level)
├── Executes skills
├── Optimizes context artifacts
└── Provides feedback to Meta-Agent
Results: 5.6–53.8% improvement over SOTA CE methods (mean: 16.9%), tested across 5 domains, offline + online.
💡 Key Insight
MCE's bi-level architecture (Meta-Agent refines skills, Base-Agent executes them) is directly applicable to AgentField: our Quality-Gate + Pulse already provide the evaluation signal — we just need to close the loop.
AgentFactory (2603.18000) — Executable Subagent Accumulation
Core Idea: Don’t store successful solutions as text experience (fragile), but as executable subagent code (Python). These subagents are continuously refined and reused.
Key Innovation:
- Textual reflection (e.g., “do it differently next time”) ≠ reliable re-execution
- Executable code = deterministically repeatable
- Standardized Python documentation → portability across systems
- Library grows over time → progressive effort reduction
1. Die drei Papers im Überblick
Abschnitt-Zusammenfassung
EvoFlow evolviert eine Population von Workflows (o1-preview geschlagen bei 12,4% der Kosten). MCE ko-evolviert Skills via Meta-Agent (+16,9% Ø). AgentFactory speichert Lösungen als ausführbaren Code, nicht Text. Zusammen zeigen sie drei unabhängige Wege von manuellem zu evolutionärem Agent-Design.
| EvoFlow | MCE | AgentFactory | |
|---|---|---|---|
| Was evolviert | Workflow-Graphen (Operator-DAGs) | Skills (Prompt + Tool-Configs) | Subagent-Code-Module |
| Evolutionsmechanismus | Crossover + Mutation + Niching | Meta-Agent-Ko-Evolution | Akkumuliere & nutze ausführbare Subagents |
| Selektion | Pareto (Qualität vs. Kosten) | Meta-Agent-Evaluation (+16,9%) | Task-Performance-Filtering |
| Diversitätsstrategie | Niching / MAP-Elites-Stil | Skill-Spezialisierung | Code-Library-Wachstum |
| Speicherformat | Text (Prompts + Graph) | Text (SKILL.md-Stil) | Ausführbarer Code |
| Bestes Ergebnis | o1-preview geschlagen bei 12,4% Kosten; +4,15pp Ø über 6 Benchmarks; Cross-Domain-Transfer funktioniert | +16,9% Ø Verbesserung | Autonome Skill-Akkumulation |
| Nowak-Analogie | Population mit Selektion (P2+P3) | Vererbung via Meta-Agent (P7) | Replikator-Akkumulation (P1) |
| arXiv | 2502.07373 | 2601.21557 | 2603.18000 |
EvoFlow (2502.07373) — Populationsbasierte Workflow-Evolution
Kernidee: Statt EINEN optimalen Workflow zu suchen, evolve eine POPULATION heterogener Workflows. Nutze Niching-Selektion (verwandt mit MAP-Elites) um Diversität und Qualität gleichzeitig zu erhalten.
Zwei Kernprobleme bisheriger Ansätze:
- Fehlende LLM-Heterogenität: Bisherige Systeme nutzen ein einzelnes teures LLM (z.B. GPT-4o) für alle Agenten
- Fehlende Komplexitätsdiversität: Optimierung konvergiert zu einem einzigen, komplexen Workflow; einfache Queries werden unnötig aufwändig bearbeitet
Architektur — Hierarchischer Search Space (3 Ebenen):
Workflow G = (Oˢ, Eᵃ) ← Gerichteter Graph von Operator-Knoten + Inter-Operator-Konnektivität
└── Operator Oⱼ = (Iᵒⱼ, Eᵒⱼ) ← Subset von Invoking Nodes + interne Konnektivität
└── Invoking Node Iᵢ = (Mᵢ, Pᵢ, τᵢ) ← LLM-Instanz + Prompt + Temperatur
wobei Mᵢ = (|Mᵢ|, Cᵢ, Lᵢ) → Modellgröße, Token-Cost, Latenz
Feasible Space: I = M × P × ℝ[0,1]
Optimierungsziel (Multi-Objective):
G⋆ = arg max [u(G, T), −c(G, T)]ᵀ
G∈H(O,Eᵃ)
Wobei u(·) = Performance-Evaluator, c(·) = Cost-Evaluator. G⋆ ist das Pareto-optimale Set.
Operator-Repository (8 Typen)
| Operator | Beschreibung |
|---|---|
| CoT | Step-by-step Reasoning |
| LLM-Debate | 3 Debatter, max 2 Runden, Majority Voting |
| Take-a-Step-Back | Erst Prinzipien identifizieren, dann lösen |
| Self-Consistency | 5 CoT-Pfade aggregiert via Majority Voting |
| Self-Refine | CoT + iterative Selbstreflexion (max 5 Iterationen) |
| Ensemble | 3 LLM-Agenten aus verschiedenen Quellen + Pairwise Ranking |
| ReAct | Agent nutzt Tools (Code Interpreter, Web Search, Knowledge DB) |
| ExpertPrompt | Dynamische Control Flows, Agent wählt Experten |
Evolutionärer Algorithmus (Pseudocode)
Input: Dataset D (train/test), Operator Set O
Output: Optimierte Population P
// Phase 1: Population Initialization
FOR k = 1 TO N:
Gₖ ← {Oₖ, Eᵃₖ} // Zufällige Operator-Kombination
Tags ← f_tag(Gₖ) // LLM-generierte Utility-Tags (κ=5 Stück)
P⁽⁰⁾ = {G₁, G₂, ..., Gₙ}
// Phase 2: Evolutionäre Optimierung (query-by-query)
FOR each query qₜ IN D_train:
// Schritt 1: Tag-based Retrieval
Parents {Gₜ₁,...,GₜK} = TopK(S({Gᵢ}|qₜ), K)
// S(Gᵢ|qₜ) = Σⱼ cosine_sim(v(κᵢ,ⱼ), v(qₜ))
// v(·) = all-MiniLM-L6-v2 Embeddings
// Schritt 2: Crossover (LLM-facilitated)
G°⁽ᵗ⁾ ← Crossover(Gₜ₁, ..., GₜK)
// Schritt 3: Mutation (3 Typen, sequenziell angewandt)
G⊚⁽ᵗ⁾ ← µₒ(µₚ(µₗ(G°⁽ᵗ⁾)))
// Schritt 4: Niching-based Selection
// 4a: Niching Area identifizieren
PNA = TopK({−Rank(Gᵢ)}, E)
// Rank = RankS (Tag-Similarity) + Rankc (Cost-Similarity)
// 4b: Alle relevanten Workflows ausführen
FOR Gᵢ IN PNA ∪ Parents ∪ {G⊚}:
Execute Gᵢ on qₜ → Update Running Avg Cost c⁽ᵗ⁾ + Performance p⁽ᵗ⁾
// 4c: Fitness berechnen (Pareto-Indikator-basiert)
F(G) = Σ exp(I(G, G⊚) / (φ · Imax)) // φ = 0.05
// 4d: Schlechtestes Individuum eliminieren
G_worst = argmax F(G) über PNA ∪ {G⊚}
P⁽ᵗ⁺¹⁾ ← P⁽ᵗ⁾ \ G_worst ∪ G⊚⁽ᵗ⁾
RETURN P
Die 4 Evolutions-Operatoren im Detail
| Operator | Typ | Mechanismus |
|---|---|---|
| LLM Crossover | Rekombination | LLM-facilitated: kombiniert strukturelle Elemente von K Eltern-Workflows zu einem neuen Offspring. Nicht zufällig — das LLM versteht die Eltern-Architekturen und produziert eine sinnvolle Synthese. |
LLM Mutation µₗ | Backbone-Tausch | LLM-gesteuerte Entscheidung basierend auf LLM Experience Pool P_LLM. Pool enthält pro LLM Triplets (Workflow, Positive/Negative/None Bewertung, qualitatives Feedback). Tauscht das LLM-Backbone eines Invoking Node basierend auf historischer Performance. |
Prompt Refinement µₚ | Prompt Engineering | LLM-gesteuert, nutzt Workflow Experience Pool P_WF mit Triplets (Workflow, Query, Evaluation mit quantitativem + qualitativem Feedback). Modifiziert Instruktionen, fügt Few-shot-Beispiele hinzu. |
Operator Mutation µₒ | Topologie-Änderung | Fügt Operatoren hinzu (O_add) oder entfernt sie (O_del), verdrahtet Topologie neu. Ändert die strukturelle Komposition des Workflows. |
| Niching Selection | Diversitätserhalt | Identifiziert eine “Niching Area” (ähnliche Workflows nach Tag + Cost), evaluiert alle Kandidaten, eliminiert den schlechtesten via Pareto-Dominanz-preserving Binary Indicator. Verhindert Populations-Kollaps auf einen einzigen Workflow-Typ. |
Seed Population: 4 initiale Workflow-Typen (Reflective Agent I/O, Arithmetic Collaborator, Lightweight Programmer, Advanced Multi-programmer).
Benchmark-Ergebnisse
Homogene Settings (alle Methoden nutzen gpt-4o-mini):
| Methode | GSM8K | MATH | MultiArith | HumanEval | MBPP | ALFWorld | Avg. |
|---|---|---|---|---|---|---|---|
| Vanilla | 87,45 | 46,29 | 96,85 | 87,08 | 71,83 | 38,71 | 71,37 |
| CoT | 87,10 | 46,40 | 96,31 | 88,13 | 71,83 | 39,92 | 71,62 |
| SC (CoT×5) | 87,57 | 47,91 | 96,58 | 88,60 | 73,60 | 40,55 | 72,47 |
| LLM-Debate | 89,47 | 48,54 | 97,33 | 88,68 | 70,29 | 44,68 | 73,17 |
| DyLAN | 89,98 | 48,63 | 97,12 | 90,42 | 77,30 | 53,32 | 76,13 |
| AgentVerse | 89,91 | 47,35 | 97,50 | 89,29 | 74,28 | 45,03 | 73,89 |
| GPTSwarm | 89,14 | 47,88 | 96,79 | 89,32 | 77,43 | 53,19 | 75,63 |
| ADAS | 86,12 | 43,18 | 96,02 | 84,19 | 68,13 | 47,66 | 70,88 |
| AgentSquare | 87,62 | 48,51 | 97,77 | 89,08 | 78,46 | 66,42 | 78,14 |
| AFlow | 91,16 | 51,28 | 96,22 | 90,93 | 81,67 | 59,16 | 78,40 |
| EvoFlow | 92,90 | 57,70 | 98,80 | 92,85 | 84,50 | 68,57 | 82,55 |
Verbesserung gegenüber SOTA (AFlow): +1,23% bis +29,86% je nach Benchmark. Durchschnitt +4,15pp.
Heterogene Settings (MATH & MBPP, Open-Source LLM Pool):
| Methode | MATH Acc. | MATH Kosten ($) | MBPP pass@1 | MBPP Kosten ($) |
|---|---|---|---|---|
| Qwen-2.5-72b (Single) | 63,80% | 0,032 | 69,76% | 0,009 |
| o1-preview (Single) | 70,20% | 7,841 | 89,65% | 3,209 |
| AFlow_Qwen | 66,38% | 3,846 | 80,84% | 1,598 |
| DyLAN_Qwen | 64,17% | 16,864 | 75,63% | 8,973 |
| EvoFlow | 72,90% | 0,973 | 87,62% | 0,565 |
Zentrale Erkenntnisse:
- EvoFlow übertrifft o1-preview um +2,7% auf MATH bei nur 12,4% der Kosten
- Training-Kosten: $0,459 vs AFlow $1,223 (37,5%)
- Inferenz-Kosten: $0,513 vs AFlow $2,623 (19,5%)
- Nutzt ausschließlich Open-Source-Modelle: LLaMa-3.1-70b, Qwen-2.5-72b, Deepseek-V2.5, Hermes-3-70b
Cross-Domain Settings (MATH + MBPP kombiniertes Training):
| Methode | LLM | MATH (solo) | MBPP (solo) | MATH+MBPP →MATH | MATH+MBPP →MBPP |
|---|---|---|---|---|---|
| DyLAN | Deepseek | 46,20 | 80,13 | 43,85 ↓ | 78,62 ↓ |
| GPTSwarm | Deepseek | 45,36 | 77,52 | 39,18 ↓ | 74,09 ↓ |
| AFlow | Deepseek | 48,65 | 79,14 | 43,22 ↓ | 77,02 ↓ |
| EvoFlow | LLM Pool | 72,90 | 87,62 | 72,69 (≈) | 88,35 ↑ |
EvoFlow profitiert von Cross-Domain-Training (MBPP steigt 87,62→88,35), während alle Baselines degradieren. Das demonstriert echte komplexitäts-adaptive Routing-Fähigkeit.
Ablation Study — Was am meisten zählt
| Variante | MATH | MBPP | Einfluss |
|---|---|---|---|
| Vollständiges EvoFlow | ~73% | ~88% | — |
| w/o Tag-based Retrieval | ↓ signifikant | ↓ signifikant | Kritisch — hoher Varianzanstieg |
| w/o LLM Mutation | ↓ signifikant | ↓ signifikant | Kritisch — hoher Varianzanstieg |
| w/o Prompt Mutation | ↓ moderat | ↓ moderat | Wichtig, aber nicht kritisch |
| w/o Operator Mutation | ↓ 3,5–7,3% | ↓ 3,5–7,3% | Trägt messbare Gewinne bei |
Zentrale Erkenntnis: Tag-based Retrieval und LLM Mutation sind die beiden kritischsten Komponenten — das Entfernen einer der beiden verursacht sowohl Performance-Einbrüche ALS AUCH Varianzanstieg (unzuverlässige Ergebnisse).
Parameter-Sensitivität
| Parameter | Optimum | Beobachtung |
|---|---|---|
| K (Eltern-Anzahl) | 3 | Zu klein → geringe Offspring-Diversität; Zu groß → LLM kann nicht aggregieren |
| κ (Tags pro Workflow) | 5 | Stabiler Bereich |
| N (Populationsgröße) | 15 (Trade-off) | N=5→25: +3,1% Performance, aber Cost steigt von 8e-4 auf 2e-3 $/query |
Pareto-Front (evolvierte Population)
Die evolvierte Population bildet eine klare Pareto-Front:
- Einfachste Workflows: Basic I/O + Self-Refine → 38,7% Acc, $0,00018/query
- Mittlere Workflows: Multi-agent Debate → balancierte Kosten/Performance
- Komplexeste Workflows: Iterative Generation + Ensembling → 72,57% Acc, $0,0037/query
EvoFlow wählt zur Inferenz automatisch den passenden Workflow basierend auf Query-Komplexität.
Unique Capabilities vs. alle Baselines
| Feature | EvoFlow | AFlow | AgentSquare | ADAS | GPTSwarm |
|---|---|---|---|---|---|
| Prompt Optimize | ✔ | ✔ | ✗ | ✔ | ✗ |
| Agent Topology | ✔ | ✔ | ✔ | ✗ | ✔ |
| Agent Profile | ✔ | ✔ | ✔ | ✔ | ✗ |
| LLM Backbone (heterogen) | ✔ | ✗ | ✗ | ✗ | ✗ |
| Complexity Adaptivity | ✔ | ✗ | ✗ | ✗ | ✗ |
Limitationen (aus dem Paper abgeleitet)
- Operator-Repository ist manuell kuratiert: Die initialen 8 Operatoren werden manuell definiert; das System kann neue Topologien evolvieren, startet aber von Hand-designed Primitives.
- LLM-abhängige Mutation: Crossover und Mutation sind LLM-facilitated → Qualität hängt von der Fähigkeit des Orchestrator-LLMs ab, sinnvolle Workflows zu generieren.
- Evaluation auf begrenzten Domains: 6 Benchmarks (Math, Code, Embodied) — keine Evaluation auf Open-ended Generation, Long-Context Reasoning oder multimodale Tasks.
- Skalierbarkeit der Population: Jede Query erfordert Ausführung von Parents + Niching Area + Offspring (K + E + 1 Workflows). Bei N=15, K=3, E=5 werden pro Training-Query ~9 Workflows ausgeführt.
- Keine Evaluation mit stärkeren Closed-Source-Modellen im Pool: Heterogene Settings nur mit Open-Source-Modellen getestet.
- Cross-Domain-Generalisierung begrenzt getestet: Nur MATH+MBPP kombiniert, keine breitere Multi-Domain-Evaluation.
💡 Kernerkenntnis
EvoFlow schlägt o1-preview auf MATH bei nur 12,4% der Kosten — durch Evolution einer Population heterogener Workflows mit Open-Source-Modellen. Populationsdiversität + Multi-Objective-Selektion ist der Schlüsselmechanismus. Die Ablation bestätigt: Tag-based Retrieval und LLM Mutation sind die beiden nicht-verhandelbaren Komponenten.
MCE — Meta Context Engineering (2601.21557) — Skill-Koevolution
Kernidee: Context Engineering (CE) sollte nicht manuell designt werden. Stattdessen: Ein Meta-Agent, der CE-Skills automatisch verfeinert, während ein Base-Agent diese Skills ausführt.
Bi-Level-Architektur:
Meta-Agent (obere Ebene)
├── Deliberative Search über: historische Skills, Ausführungen, Evaluationen
├── "Agentic Crossover": Kombiniere erfolgreiche Skill-Elemente
└── Verfeinere Skills iterativ
Base-Agent (untere Ebene)
├── Führt Skills aus
├── Optimiert Context-Artefakte
└── Liefert Feedback an Meta-Agent
Ergebnisse: 5.6–53.8% Verbesserung über SOTA CE-Methoden (Mittel: 16.9%), getestet in 5 Domänen, offline + online.
💡 Kernerkenntnis
MCEs Bi-Level-Architektur (Meta-Agent verfeinert Skills, Base-Agent führt sie aus) ist direkt auf AgentField übertragbar: Unser Quality-Gate + Pulse liefern bereits das Evaluationssignal — wir müssen nur den Loop schließen.
AgentFactory (2603.18000) — Executable Subagent Accumulation
Kernidee: Erfolgreiche Lösungen nicht als Text-Erfahrung speichern (fragil), sondern als ausführbaren Subagent-Code (Python). Diese Subagents werden kontinuierlich verfeinert und wiederverwendet.
Schlüsselinnovation:
- Textuelle Reflexion (z.B. “nächstes Mal anders machen”) ≠ zuverlässige Re-Execution
- Ausführbarer Code = deterministisch wiederholbar
- Standardisierte Python-Dokumentation → Portabilität über Systeme hinweg
- Bibliothek wächst über Zeit → progressive Aufwandsreduktion
2. The Isomorphism Table: EvoFlow → MCE → AgentFactory → AgentField
Section Summary
A concept-by-concept mapping across all three papers and AgentField. Of 11 evolutionary concepts:
- ✅ 3 fully available — base unit, composition, system
- ⚠️ 4 partial — selection, retrieval, population, feedback
- ❌ 4 missing — mutation, crossover, diversity, cost awareness
⚠️ Gap Analysis
5 of 11 concepts are missing or partial: Mutation (not automated), Crossover (missing entirely), Diversity mechanism (none), Cost awareness (no tracking), and Feedback loop (open). These are the targets for Phases 1–3.
| Concept | EvoFlow | MCE | AgentFactory | AgentField | Status |
|---|---|---|---|---|---|
| Base Unit | Invoking Node (M, P, τ) | CE Skill | Executable Subagent (Python) | Skill (SKILL.md) | ✅ Available |
| Composition | Operator Node (node cluster) | Skill chain | Subagent library | Skill-Chain (max 4) | ✅ Available |
| Overall System | Workflow (DAG) | Meta+Base-Agent | Growing library | Orchestrator + Skills | ✅ Available |
| Selection | Niching (Pareto: Quality/Cost) | Evaluation + Deliberative Search | Execution Feedback | Quality-Gate | ⚠️ Quality only, no Cost |
| Mutation | LLM/Prompt/Operator Mutation | Skill refinement | Code refinement via feedback | Manual skill editing | ❌ Not automated |
| Crossover | Workflow recombination | “Agentic Crossover” | — | — | ❌ Missing entirely |
| Retrieval | Tag-based (κ tags/individual) | History-based | Code search in library | orchestrator-routing | ⚠️ Intent-based, not tag-based |
| Population | N workflows in parallel | Skill variants | Subagent library | ~30 Skills | ⚠️ Static, no evolution |
| Diversity | Niching Selection enforces diversity | Domain diversity | Natural growth | Manually curated | ❌ No diversity mechanism |
| Feedback Loop | Benchmark evaluation → Selection | Meta-Agent evaluation | Execution feedback → Refinement | Pulse metrics + Quality-Gate | ⚠️ Feedback exists, loop missing |
| Cost Awareness | Multi-Objective (Quality + Cost) | Efficiency as secondary goal | Effort reduction as goal | — | ❌ No token cost tracking |
Legend: ✅ = we have it, ⚠️ = partial, ❌ = missing
⚠️ Gap: Selection
Quality-Gate evaluates quality only. EvoFlow uses Pareto-optimal multi-objective selection (Quality + Cost). Adding cost as a second axis is Phase 1.
2. Die Isomorphie-Tabelle: EvoFlow → MCE → AgentFactory → AgentField
Abschnitt-Zusammenfassung
Ein Konzept-für-Konzept-Mapping über alle drei Papers und AgentField. Von 11 evolutionären Konzepten:
- ✅ 3 vollständig — Basiseinheit, Komposition, System
- ⚠️ 4 teilweise — Selektion, Retrieval, Population, Feedback
- ❌ 4 fehlen — Mutation, Crossover, Diversität, Cost-Awareness
⚠️ Gap-Analyse
5 von 11 Konzepten fehlen oder sind unvollständig: Mutation (nicht automatisiert), Crossover (fehlt komplett), Diversitäts-Mechanismus (keiner), Cost-Awareness (kein Tracking), Feedback-Loop (offen). Das sind die Ziele für Phase 1–3.
| Konzept | EvoFlow | MCE | AgentFactory | AgentField | Status |
|---|---|---|---|---|---|
| Basiseinheit | Invoking Node (M, P, τ) | CE Skill | Executable Subagent (Python) | Skill (SKILL.md) | ✅ Vorhanden |
| Komposition | Operator Node (Knoten-Cluster) | Skill-Kette | Subagent-Bibliothek | Skill-Chain (max 4) | ✅ Vorhanden |
| Gesamtsystem | Workflow (DAG) | Meta+Base-Agent | Wachsende Bibliothek | Orchestrator + Skills | ✅ Vorhanden |
| Selektion | Niching (Pareto: Quality/Cost) | Evaluation + Deliberative Search | Execution Feedback | Quality-Gate | ⚠️ Nur Quality, kein Cost |
| Mutation | LLM/Prompt/Operator-Mutation | Skill-Verfeinerung | Code-Refinement via Feedback | Manuelles Skill-Editing | ❌ Nicht automatisiert |
| Crossover | Workflow-Rekombination | “Agentic Crossover” | — | — | ❌ Fehlt komplett |
| Retrieval | Tag-based (κ Tags/Individual) | History-basiert | Code-Suche in Bibliothek | orchestrator-routing | ⚠️ Intent-basiert, nicht tag-basiert |
| Population | N Workflows parallel | Skill-Varianten | Subagent-Bibliothek | ~30 Skills | ⚠️ Statisch, keine Evolution |
| Diversität | Niching Selection erzwingt Diversität | Domänen-Diversität | Natürliches Wachstum | Manuell kuratiert | ❌ Kein Diversitäts-Mechanismus |
| Feedback-Loop | Benchmark-Evaluation → Selection | Meta-Agent-Evaluation | Execution Feedback → Refinement | Pulse-Metriken + Quality-Gate | ⚠️ Feedback existiert, Loop fehlt |
| Cost-Awareness | Multi-Objective (Quality + Cost) | Effizienz als Nebenziel | Aufwandsreduktion als Ziel | — | ❌ Kein Token-Cost-Tracking |
Legende: ✅ = haben wir, ⚠️ = teilweise, ❌ = fehlt
⚠️ Gap: Selektion
Quality-Gate bewertet nur Qualität. EvoFlow nutzt Pareto-optimale Multi-Objective-Selektion (Quality + Cost). Cost als zweite Achse hinzuzufügen ist Phase 1.
3. What AgentField Can Concretely Learn
Section Summary
Four concrete upgrade steps, ordered by effort:
- Cost tracking via Pulse — low effort, immediate
- Tag-based skill retrieval — medium effort
- Automated skill mutation via MCE pattern — high effort
- Crossover for new skills from existing ones — requires mutation first
3.1 Immediately Actionable: Cost Tracking as a Second Optimization Axis
EvoFlow Insight: Multi-Objective (Quality + Cost) is better than Single-Objective (Quality only).
For Us: Pulse already tracks token usage. Integrate this signal as a second axis in Quality-Gate:
- Quality score ≥ threshold AND token cost ≤ budget → Keep
- Alternatively: Pareto visualization per skill (Quality vs. Cost)
Effort: Low. Pulse data exists, Quality-Gate just needs to be extended.
3.2 Short-term: Tag-based Skill Retrieval
EvoFlow Insight: Tag-based Retrieval (κ tags per workflow) is better than free text search for skill selection. Ablation: -3–4% performance without tags.
For Us: orchestrator-routing works intent-based (text classification). Supplement with structured tags:
- Each skill gets tags:
domain,complexity,input-type,output-type - Routing uses tag matching as first stage, intent classification as fallback
Effort: Medium. Define tag schema, tag skills, adjust routing.
3.3 Medium-term: Automated Skill Mutation
MCE Insight: A Meta-Agent can automatically refine skills through “deliberative search over history of skills, executions, and evaluations.”
AgentFactory Insight: Executable code > textual reflection for reliable reuse.
For Us: Our skills ARE already semi-structured code (SKILL.md with workflow steps). The step toward automatic mutation:
- Prompt Mutation: Small variations in skill instructions, A/B test via Pulse metrics
- Workflow Mutation: Add/remove steps, change ordering
- Evaluation: Quality-Gate + token cost as fitness
Effort: High. Requires: skill versioning, A/B testing infrastructure, automatic evaluation.
3.4 Long-term: Crossover — Combining Existing Skills into New Ones
EvoFlow Insight: Crossover (combining elements of two parent workflows) produces solutions that neither parent could achieve alone.
MCE Insight: “Agentic Crossover” = not random recombination, but deliberative search over historical successes.
For Us: Could we, for example, cross research-pipeline + challenge into an “adversarial research” skill? Or express + signal-check into a “self-checked writing” skill?
Prerequisite: Skill mutation must work first. Crossover is the next step.
3. Was AgentField konkret lernen kann
Abschnitt-Zusammenfassung
Vier konkrete Upgrade-Schritte, sortiert nach Aufwand:
- Cost-Tracking via Pulse — geringer Aufwand, sofort machbar
- Tag-basiertes Skill-Retrieval — mittlerer Aufwand
- Automatisierte Skill-Mutation via MCE-Pattern — hoher Aufwand
- Crossover für neue Skills aus bestehenden — setzt Mutation voraus
3.1 Sofort umsetzbar: Cost-Tracking als zweite Optimierungsachse
EvoFlow-Insight: Multi-Objective (Quality + Cost) ist besser als Single-Objective (nur Quality).
Für uns: Pulse trackt bereits Token-Verbrauch. Dieses Signal als zweite Achse in Quality-Gate einbauen:
- Quality-Score ≥ Schwelle UND Token-Cost ≤ Budget → Keep
- Alternativ: Pareto-Darstellung pro Skill (Quality vs. Cost)
Aufwand: Gering. Pulse-Daten existieren, Quality-Gate muss nur erweitert werden.
3.2 Kurzfristig: Tag-basiertes Skill-Retrieval
EvoFlow-Insight: Tag-based Retrieval (κ Tags pro Workflow) ist besser als freie Textsuche für Skill-Auswahl. Ablation: -3-4% Performance ohne Tags.
Für uns: orchestrator-routing arbeitet intent-basiert (Textklassifikation). Ergänzung um strukturierte Tags:
- Jeder Skill bekommt Tags:
domain,complexity,input-type,output-type - Routing nutzt Tag-Matching als erste Stufe, Intent-Klassifikation als Fallback
Aufwand: Mittel. Tag-Schema definieren, Skills taggen, Routing anpassen.
3.3 Mittelfristig: Automatisierte Skill-Mutation
MCE-Insight: Ein Meta-Agent kann Skills durch “deliberative search over history of skills, executions, and evaluations” automatisch verfeinern.
AgentFactory-Insight: Executable Code > textuelle Reflexion für zuverlässige Wiederverwendung.
Für uns: Unsere Skills SIND bereits semi-strukturierter Code (SKILL.md mit Workflow-Steps). Der Schritt zu automatischer Mutation:
- Prompt-Mutation: Kleine Variationen in Skill-Anweisungen, A/B-Test über Pulse-Metriken
- Workflow-Mutation: Schritte hinzufügen/entfernen, Reihenfolge ändern
- Bewertung: Quality-Gate + Token-Cost als Fitness
Aufwand: Hoch. Braucht: Skill-Versionierung, A/B-Test-Infrastruktur, automatische Evaluation.
3.4 Langfristig: Crossover — neue Skills aus bestehenden kombinieren
EvoFlow-Insight: Crossover (Elemente zweier Eltern-Workflows kombinieren) erzeugt Lösungen, die keiner der Eltern allein erreicht hätte.
MCE-Insight: “Agentic Crossover” = nicht zufällige Rekombination, sondern deliberative Search über historische Erfolge.
Für uns: Könnten wir z.B. research-pipeline + challenge zu einem “adversarial research” Skill kreuzen? Oder express + signal-check zu einem “self-checked writing” Skill?
Voraussetzung: Skill-Mutation muss erst funktionieren. Crossover ist der nächste Schritt.
4. The Nowak Bridge — Formal Correspondences
Section Summary
Mapping EvoFlow concepts to Nowak's evolutionary dynamics reveals that AgentField is in the "Prelife" phase: selection exists (Quality-Gate), but replication (automatic inheritance of successful patterns) is missing. The phase transition to "Life" requires closing exactly that gap.
| Nowak | EvoFlow | AgentField (Current) | AgentField (Target) |
|---|---|---|---|
| Sequence/Replicator | Invoking Node (M,P,τ) | Skill (SKILL.md) | Skill with version + tags |
| Fitness fᵢ | Utility u(G,q) | Quality score | Quality + Cost (Pareto) |
| Mutation | 3 mutation types | Manual editing | Automated skill mutation |
| Selection (φ) | Niching (Pareto) | Quality-Gate (binary) | Multi-Objective Niching |
| Replication | Seed population → inheritance | SKILL.md remains static | Skill versioning + inheritance |
| Population size | N workflows | ~30 Skills | 30+ Skills, actively managed |
| Error Threshold | Max. mutations/iteration | Deviation Rules (max 3 fixes) | Formalized: 1 change/iteration |
| Prelife → Life | Seed → evolved population | Manual → ? | Manual → automatic |
| Cooperation (Γ) | Workflow composition | Skill chains | Measurable cooperation gains |
Key Observation: AgentField is currently in the “Prelife” phase according to Nowak’s terminology — there is diversity and selection (Quality-Gate), but no replication (automatic inheritance of successful patterns). EvoFlow shows what the phase transition to “Life” could look like.
4. Die Nowak-Brücke — formale Entsprechungen
Abschnitt-Zusammenfassung
Das Mapping von EvoFlow-Konzepten auf Nowaks Evolutionsdynamik zeigt: AgentField ist in der "Prelife"-Phase. Selektion existiert (Quality-Gate), aber Replikation (automatische Vererbung erfolgreicher Muster) fehlt. Der Phasenübergang zu "Life" erfordert genau diese Lücke zu schließen.
| Nowak | EvoFlow | AgentField (Ist) | AgentField (Soll) |
|---|---|---|---|
| Sequenz/Replikator | Invoking Node (M,P,τ) | Skill (SKILL.md) | Skill mit Version + Tags |
| Fitness fᵢ | Utility u(G,q) | Quality-Score | Quality + Cost (Pareto) |
| Mutation | 3 Mutations-Typen | Manuelles Editing | Automatisierte Skill-Mutation |
| Selektion (φ) | Niching (Pareto) | Quality-Gate (binary) | Multi-Objective Niching |
| Replikation | Seed-Population → Vererbung | SKILL.md bleibt statisch | Skill-Versionierung + Vererbung |
| Populationsgröße | N Workflows | ~30 Skills | 30+ Skills, aktiv gemanagt |
| Error Threshold | Max. Mutationen/Iteration | Deviation Rules (max 3 Fixes) | Formalisiert: 1 Änderung/Iteration |
| Prelife → Life | Seed → evolvierte Population | Manuell → ? | Manuell → automatisch |
| Kooperation (Γ) | Workflow-Komposition | Skill-Chains | Messbare Kooperationsgewinne |
Schlüsselbeobachtung: AgentField befindet sich derzeit in der “Prelife”-Phase nach Nowaks Terminologie — es gibt Diversität und Selektion (Quality-Gate), aber keine Replikation (automatische Vererbung erfolgreicher Muster). EvoFlow zeigt, wie der Phasenübergang zu “Life” aussehen könnte.
5. The Upgrade Path: Prelife → Life for AgentField
Section Summary
A concrete 5-phase roadmap from manual skill curation (Phase 0 / Prelife) to full evolutionary skill management (Phase 4 / Life). Each phase builds on the previous. The critical transition point rₓ is where evolution outpaces manual curation.
💡 The Roadmap
4 phases: (0) Manual curation → (1) Close the feedback loop → (2) Automated mutation → (3) Population management. Each phase builds on the previous one. Phase 1 is specified and ready to implement.
graph TD
P0["🧪 Phase 0: Prelife<br/><small>Manual skill curation<br/>Quality-Gate without consequences<br/>Metrics collected but unused</small>"]
P1["📊 Phase 1: Feedback Loop<br/><small>Cost tracking (2nd axis)<br/>Tag-based retrieval<br/>Skill performance history</small>"]
P2["🧬 Phase 2: Automated Mutation<br/><small>A/B prompt testing<br/>MCE: Meta-Agent proposes changes<br/>Human-in-the-Loop selection</small>"]
P3["🌿 Phase 3: Population Management<br/><small>Multiple skill variants per task<br/>Niching: preserve diversity<br/>Crossover: combine skills</small>"]
P4["🔄 Phase 4: Full Evolution = Life<br/><small>Autonomous cycle: Select → Execute → Evaluate → Mutate<br/>r > rₓ: evolution outpaces manual curation</small>"]
P0 -->|"Measure everything"| P1
P1 -->|"Let metrics drive changes"| P2
P2 -->|"Scale variation"| P3
P3 -->|"r > rₓ: phase transition"| P4
style P0 fill:#1e1e2e,stroke:#f87171,color:#f87171
style P1 fill:#1e1e2e,stroke:#f59e0b,color:#f59e0b
style P2 fill:#1e1e2e,stroke:#38bdf8,color:#38bdf8
style P3 fill:#1e1e2e,stroke:#a78bfa,color:#a78bfa
style P4 fill:#1e1e2e,stroke:#34d399,color:#34d399
5. Der Upgrade-Pfad: Prelife → Life für AgentField
Abschnitt-Zusammenfassung
Eine konkrete 5-Phasen-Roadmap von manueller Skill-Kuration (Phase 0 / Prelife) zu vollem evolutionärem Skill-Management (Phase 4 / Life). Jede Phase baut auf der vorherigen auf. Der kritische Übergangspunkt rₓ ist dort, wo Evolution manuelle Kuration überholt.
💡 Die Roadmap
4 Phasen: (0) Manuelle Kuration → (1) Feedback-Loop schließen → (2) Automatisierte Mutation → (3) Populationsmanagement. Jede Phase baut auf der vorherigen auf. Phase 1 ist spezifiziert und implementierungsbereit.
graph TD
P0["🧪 Phase 0: Prelife<br/><small>Manuelle Skill-Kuration<br/>Quality-Gate ohne Konsequenz<br/>Metriken gesammelt aber ungenutzt</small>"]
P1["📊 Phase 1: Feedback-Loop<br/><small>Cost-Tracking (2. Achse)<br/>Tag-basiertes Retrieval<br/>Skill-Performance-Historie</small>"]
P2["🧬 Phase 2: Automatisierte Mutation<br/><small>A/B Prompt-Testing<br/>MCE: Meta-Agent schlägt Änderungen vor<br/>Human-in-the-Loop Selektion</small>"]
P3["🌿 Phase 3: Population Management<br/><small>Mehrere Skill-Varianten pro Task<br/>Niching: Diversität erhalten<br/>Crossover: Skills kombinieren</small>"]
P4["🔄 Phase 4: Full Evolution = Life<br/><small>Autonomer Zyklus: Auswahl → Ausführung → Evaluation → Mutation<br/>r > rₓ: Evolution überholt manuelle Kuration</small>"]
P0 -->|"Alles messen"| P1
P1 -->|"Metriken steuern Änderungen"| P2
P2 -->|"Variation skalieren"| P3
P3 -->|"r > rₓ: Phasenübergang"| P4
style P0 fill:#1e1e2e,stroke:#f87171,color:#f87171
style P1 fill:#1e1e2e,stroke:#f59e0b,color:#f59e0b
style P2 fill:#1e1e2e,stroke:#38bdf8,color:#38bdf8
style P3 fill:#1e1e2e,stroke:#a78bfa,color:#a78bfa
style P4 fill:#1e1e2e,stroke:#34d399,color:#34d399
6. Critical Assessment
Section Summary
None of the three papers addresses persistent memory, human-in-the-loop feedback, or rich project context — all of which AgentField already has. This puts us at a unique intersection: population diversity (EvoFlow) + skill co-evolution (MCE) + persistent memory + human feedback.
What EvoFlow Does NOT Address (and We Must Consider)
- No persistent memory: EvoFlow evaluates on benchmarks, has no session history. AgentField’s memory system is a fundamental advantage.
- No Human-in-the-Loop: EvoFlow is fully automatic. Our human feedback is a richer signal than automatic evaluation.
- No context: EvoFlow works on isolated tasks. Real-world agent projects have rich context (project state, persistent memory, long-term knowledge stores).
- Benchmark ≠ Real World: EvoFlow’s 29.86% improvement is on MATH/GSM8K. Whether this transfers to open, creative tasks (our core domain) remains unclear.
What MCE Does NOT Address
- No population diversity: MCE co-evolves ONE skill set, not a population. No niching.
- No cost optimization: Only performance as a goal.
What AgentFactory Does NOT Address
- Code tasks only: Executable subagents = Python code. Our skills are broader (Research, Writing, Analysis).
- No composition: Subagents are accumulated individually, not combined.
Our Unique Position
AgentField sits at an intersection that none of the three papers covers:
- Population diversity (EvoFlow) + Skill co-evolution (MCE) + Persistent Memory (us)
- Additionally: Human-in-the-Loop as the richest feedback signal
- And: Project context as information that none of the benchmark papers have
6. Kritische Einordnung
Abschnitt-Zusammenfassung
Keines der drei Papers adressiert persistentes Gedächtnis, Human-in-the-Loop-Feedback oder reichen Projektkontext — alles Dinge, die AgentField bereits hat. Das positioniert uns an einer einzigartigen Kreuzung: Populationsdiversität (EvoFlow) + Skill-Koevolution (MCE) + persistentes Gedächtnis + menschliches Feedback.
Was EvoFlow NICHT adressiert (und wir berücksichtigen müssen)
- Kein persistentes Gedächtnis: EvoFlow evaluiert auf Benchmarks, hat keine Session-History. AgentField’s memory-System ist ein fundamentaler Vorteil.
- Keine Human-in-the-Loop: EvoFlow ist voll automatisch. Unser Human-Feedback ist reicheres Signal als automatische Evaluation.
- Kein Kontext: EvoFlow arbeitet auf isolierten Tasks. Reale Agent-Projekte haben reichen Kontext (Projekt-State, persistentes Memory, Langzeit-Wissenspeicher).
- Benchmark ≠ Real World: EvoFlow’s 29.86% Verbesserung ist auf MATH/GSM8K. Ob das auf offene, kreative Aufgaben (unsere Kerndomäne) übertragbar ist: unklar.
Was MCE NICHT adressiert
- Keine Populationsdiversität: MCE ko-evolviert EIN Skill-Set, nicht eine Population. Kein Niching.
- Keine Cost-Optimierung: Nur Performance als Ziel.
Was AgentFactory NICHT adressiert
- Nur Code-Aufgaben: Executable Subagents = Python-Code. Unsere Skills sind breiter (Research, Writing, Analysis).
- Keine Komposition: Subagents werden einzeln akkumuliert, nicht kombiniert.
Unsere einzigartige Position
Agent systems sit an einer Kreuzung, die keines der drei Papers abdeckt:
- Populationsdiversität (EvoFlow) + Skill-Koevolution (MCE) + Persistent Memory (wir)
- Dazu: Human-in-the-Loop als reichstes Feedback-Signal
- Und: Projektkontext als Information, die keines der Benchmark-Papers hat
💡 Kernerkenntnis
AgentFields einzigartiger Vorteil ist die Kombination dessen, was keines der Papers hat: persistentes Gedächtnis + menschliches Feedback + Projektkontext. Wenn wir die evolutionären Mechanismen von EvoFlow/MCE hinzufügen, entsteht etwas genuinely Neues — keine bloße Re-Implementierung.