🧬 Evolve Your Hermes Agent: A Practical Guide to Self-Improving AI

TL;DR

You can automatically improve your Hermes Agent's skills, prompts, and tools using evolutionary algorithms — no GPU, just API calls ($2–10/run). This guide shows you how to run the self-evolution pipeline on your existing Ubuntu VPS alongside Claude Code and Gemini CLI, using real execution traces to drive targeted improvements.

Context

You're running:

Hermes Agent instance on Ubuntu
Claude Code CLI (Anthropic) and Gemini CLI (Google) on same Ubuntu

These CLIs generate valuable execution traces — real-world usage logs showing what worked, what failed, and why failures happened. The Hermes Agent Self-Evolution project uses these traces to drive evolutionary optimization of:

Skill files (SKILL.md)
Tool descriptions
System prompts
(Later) tool implementation code

The key insight: failure traces contain more signal than success traces. The GEPA optimizer reads why things fail and proposes targeted mutations, then evaluates candidates against test suites + size limits + semantic preservation constraints.

The Approach

How the Evolution Pipeline Works

Execution traces (from Claude Code, Gemini, Hermes)
        │
        ▼
GEPA Optimizer ──► Mutates prompts/skills based on failure reasons
        │
        ▼
Candidate variants (e.g., 10 mutated skill descriptions)
        │
        ▼
Constraint gates:
  • Full test suite (pytest)
  • Size limits (≤15KB for skills)
  • Semantic preservation
        │
        ▼
Best variant ──► Creates PR for hermes-agent repo

On Your VPS: Step-by-Step

1. Install Hermes Agent Self-Evolution

# Clone the repo
git clone https://github.com/NousResearch/hermes-agent-self-evolution.git
cd hermes-agent-self-evolution

# Install with dev dependencies
pip install -e ".[dev]"

# Point at your existing hermes-agent repo
export HERMES_AGENT_REPO=/path/to/your/hermes-agent

2. Import real execution traces from Claude Code & Gemini

# The repo includes importers for:
# - Claude Code sessions
# - Gemini CLI history
# - Hermes own history

python -m evolution.importers.import_claude_sessions \
    --claude-log-dir ~/.claude/logs \
    --output ./datasets/session_db

# Similarly for Gemini
python -m evolution.importers.import_gemini_sessions \
    --gemini-history ~/.gemini/history.json \
    --output ./datasets/session_db

3. Run skill evolution using real session data

# Evolve a specific skill using imported session history
python -m evolution.skills.evolve_skill \
    --skill github-code-review \
    --iterations 10 \
    --eval-source sessiondb \
    --session-db ./datasets/session_db

The optimizer will:

Read execution traces where github-code-review skill failed/succeeded
Generate 10–20 mutated variants of the skill's SKILL.md
Evaluate each variant against your tests
Select the best performer that passes all guardrails
Open a PR against your hermes-agent repo

4. (Optional) Try synthetic evaluation first

# Quick test without real traces
python -m evolution.skills.evolve_skill \
    --skill github-code-review \
    --iterations 5 \
    --eval-source synthetic

Why It Worked

1. Failure-driven mutation is more efficient than random search

GEPA doesn't blindly mutate — it parses execution traces to understand root causes:

"Tool description too vague → model didn't call it"
"Skill exceeds context limit → got truncated"
"Example format inconsistent → parsing failed"

Then it mutates specifically to address those causes.

2. API-driven means no GPU bottleneck

Your Ubuntu VPS doesn't need expensive hardware. Each run:

Calls Claude/Gemini APIs for mutations (pennies)
Evaluates variants using existing test suite (CPU only)
Total cost: $2–10 per optimization run

3. Multiple trace sources create richer evolution

Claude Code, Gemini CLI, and Hermes Agent all generate different failure modes:

Claude: verbose, good at multi-step planning
Gemini: fast, sometimes truncates
Hermes: function-calling specific

Cross-pollinating traces = broader coverage.

4. Guardrails prevent regressions

Every candidate must pass:

Full pytest suite (your existing tests)
Size limits (prevents bloat)
Semantic drift checks (no changing skill's core purpose)

This means evolved versions are strictly better on metrics you care about, not just different.

Apply It Yourself

Prerequisites

Ubuntu VPS (20.04+)
Hermes Agent installed and tested
Claude Code CLI + Gemini CLI (optional but valuable)
API keys for at least one LLM (OpenAI, Anthropic, or Gemini)
Python 3.10+

Configuration Checklist

# 1. Set API keys for mutation/evaluation
export OPENAI_API_KEY="sk-..."
# or
export ANTHROPIC_API_KEY="..."
export GOOGLE_API_KEY="..."

# 2. Point at your hermes-agent repo
export HERMES_AGENT_REPO="$HOME/hermes-agent"

# 3. Run a single skill evolution as a test
python -m evolution.skills.evolve_skill --skill your-skill-name --iterations 3 --eval-source synthetic

# 4. Examine the candidate PR
ls evolution/reports/latest/candidates/

Typical Workflow for Your Team

Weekly evolution run against last 7 days of Claude Code + Gemini traces
Human reviews the generated PR (guardrails ensure it passes tests, but semantics need human check)
Merge the evolved skill back to your main branch
Measure improvement via before/after benchmarks (the tool generates comparison reports)

Expected Outcomes

Metric	Typical Improvement
Skill success rate	+15–30%
Token efficiency	-10–20% (shorter, more precise prompts)
Edge case handling	+40–60% (from failure traces)

Source

GitHub Repository: NousResearch/hermes-agent-self-evolution
Research Paper: ICLR 2026 Oral (see repo for citation)
Full Architecture: PLAN.md in the repo