Hermes Agent: Self-Improving Guide
Automatically improve your Hermes Agent's skills, prompts, and tools using evolutionary algorithms driven by execution traces, requiring only API calls and no GPU.
🧬 Evolve Your Hermes Agent: A Practical Guide to Self-Improving AI
TL;DR
You can automatically improve your Hermes Agent's skills, prompts, and tools using evolutionary algorithms — no GPU, just API calls ($2–10/run). This guide shows you how to run the self-evolution pipeline on your existing Ubuntu VPS alongside Claude Code and Gemini CLI, using real execution traces to drive targeted improvements.
Context
You're running:
- Hermes Agent instance on Ubuntu
- Claude Code CLI (Anthropic) and Gemini CLI (Google) on same Ubuntu
These CLIs generate valuable execution traces — real-world usage logs showing what worked, what failed, and why failures happened. The Hermes Agent Self-Evolution project uses these traces to drive evolutionary optimization of:
- Skill files (
SKILL.md) - Tool descriptions
- System prompts
- (Later) tool implementation code
The key insight: failure traces contain more signal than success traces. The GEPA optimizer reads why things fail and proposes targeted mutations, then evaluates candidates against test suites + size limits + semantic preservation constraints.
The Approach
How the Evolution Pipeline Works
Execution traces (from Claude Code, Gemini, Hermes)
│
▼
GEPA Optimizer ──► Mutates prompts/skills based on failure reasons
│
▼
Candidate variants (e.g., 10 mutated skill descriptions)
│
▼
Constraint gates:
• Full test suite (pytest)
• Size limits (≤15KB for skills)
• Semantic preservation
│
▼
Best variant ──► Creates PR for hermes-agent repo
On Your VPS: Step-by-Step
1. Install Hermes Agent Self-Evolution
# Clone the repo
git clone https://github.com/NousResearch/hermes-agent-self-evolution.git
cd hermes-agent-self-evolution
# Install with dev dependencies
pip install -e ".[dev]"
# Point at your existing hermes-agent repo
export HERMES_AGENT_REPO=/path/to/your/hermes-agent
2. Import real execution traces from Claude Code & Gemini
# The repo includes importers for:
# - Claude Code sessions
# - Gemini CLI history
# - Hermes own history
python -m evolution.importers.import_claude_sessions \
--claude-log-dir ~/.claude/logs \
--output ./datasets/session_db
# Similarly for Gemini
python -m evolution.importers.import_gemini_sessions \
--gemini-history ~/.gemini/history.json \
--output ./datasets/session_db
3. Run skill evolution using real session data
# Evolve a specific skill using imported session history
python -m evolution.skills.evolve_skill \
--skill github-code-review \
--iterations 10 \
--eval-source sessiondb \
--session-db ./datasets/session_db
The optimizer will:
- Read execution traces where
github-code-reviewskill failed/succeeded - Generate 10–20 mutated variants of the skill's
SKILL.md - Evaluate each variant against your tests
- Select the best performer that passes all guardrails
- Open a PR against your hermes-agent repo
4. (Optional) Try synthetic evaluation first
# Quick test without real traces
python -m evolution.skills.evolve_skill \
--skill github-code-review \
--iterations 5 \
--eval-source synthetic
Why It Worked
1. Failure-driven mutation is more efficient than random search
GEPA doesn't blindly mutate — it parses execution traces to understand root causes:
- "Tool description too vague → model didn't call it"
- "Skill exceeds context limit → got truncated"
- "Example format inconsistent → parsing failed"
Then it mutates specifically to address those causes.
2. API-driven means no GPU bottleneck
Your Ubuntu VPS doesn't need expensive hardware. Each run:
- Calls Claude/Gemini APIs for mutations (pennies)
- Evaluates variants using existing test suite (CPU only)
- Total cost: $2–10 per optimization run
3. Multiple trace sources create richer evolution
Claude Code, Gemini CLI, and Hermes Agent all generate different failure modes:
- Claude: verbose, good at multi-step planning
- Gemini: fast, sometimes truncates
- Hermes: function-calling specific
Cross-pollinating traces = broader coverage.
4. Guardrails prevent regressions
Every candidate must pass:
- Full
pytestsuite (your existing tests) - Size limits (prevents bloat)
- Semantic drift checks (no changing skill's core purpose)
This means evolved versions are strictly better on metrics you care about, not just different.
Apply It Yourself
Prerequisites
- Ubuntu VPS (20.04+)
- Hermes Agent installed and tested
- Claude Code CLI + Gemini CLI (optional but valuable)
- API keys for at least one LLM (OpenAI, Anthropic, or Gemini)
- Python 3.10+
Configuration Checklist
# 1. Set API keys for mutation/evaluation
export OPENAI_API_KEY="sk-..."
# or
export ANTHROPIC_API_KEY="..."
export GOOGLE_API_KEY="..."
# 2. Point at your hermes-agent repo
export HERMES_AGENT_REPO="$HOME/hermes-agent"
# 3. Run a single skill evolution as a test
python -m evolution.skills.evolve_skill --skill your-skill-name --iterations 3 --eval-source synthetic
# 4. Examine the candidate PR
ls evolution/reports/latest/candidates/
Typical Workflow for Your Team
- Weekly evolution run against last 7 days of Claude Code + Gemini traces
- Human reviews the generated PR (guardrails ensure it passes tests, but semantics need human check)
- Merge the evolved skill back to your main branch
- Measure improvement via before/after benchmarks (the tool generates comparison reports)
Expected Outcomes
| Metric | Typical Improvement |
|---|---|
| Skill success rate | +15–30% |
| Token efficiency | -10–20% (shorter, more precise prompts) |
| Edge case handling | +40–60% (from failure traces) |
Source
- GitHub Repository: NousResearch/hermes-agent-self-evolution
- Research Paper: ICLR 2026 Oral (see repo for citation)
- Full Architecture: PLAN.md in the repo