How to Create Skills for AI Agents
Build reliable AI agent Skills by focusing on on-demand loading, minimal tools, appropriate models, progressive disclosure of information, and rigorous validation through testing, scoring, and iteration.
How to Create Skills for AI Agents
Learn how to build Skills that agents actually use—reliable, triggerable, and easy to maintain.
TL;DR:
A well‑designed Skill is not a long prompt stuffed into a file. It’s a lightweight entry point that loads on demand, uses minimal tools, picks the right model, and separates content into references, scripts, and assets. Most importantly, you must validate it with test data, scoring, and iteration. Without evaluation, your Skill is just an unverified text file.
The Problem: One Big File Doesn’t Work
Many developers write Skills by dumping everything—background, rules, examples, edge cases—into a single SKILL.md. The file looks thorough, but in practice it’s slow, wasteful, and often fails.
Why? Because agent context is finite. When you load a massive Skill file, you crowd out other important information, increase latency, and risk triggering the agent’s compaction (memory loss).
The real value of a Skill is different:
When a user makes a request, the agent should recognise the scenario, load only what’s needed, follow a fixed workflow, use appropriate tools, and produce consistent results.
That’s the promise. Here’s how to build Skills that deliver it.
1. Load on Demand – The Core Principle
Skills are not like CLAUDE.md (always present in context).
They are not like slash commands (manual invocation only).
Skills sit in the middle: they are on‑demand.
The full SKILL.md is injected only when the user’s input matches the Skill’s description.
Why the Description Matters Most
The description is always in the matching pool. It decides whether your Skill ever gets loaded.
❌ Bad description:Create an image.
✅ Good description for a cover‑image Skill:Generate social media covers, article illustrations, Xiaohongshu thumbnails, or tech posters from a title or brief.
You must include the actual phrases users say:
- “Make a cover for this article”
- “Create a YouTube thumbnail”
- “Design a Twitter header”
- “Turn this into a LinkedIn banner”
If your description is too narrow, the Skill will never trigger. If it’s too broad, it will trigger incorrectly. Both are failures.
Disable: Save Context for Rare Skills
Not every Skill needs to be auto‑loadable.
For infrequent tasks (year‑end summaries, resume rewrites, course note整理), set disable: true. This turns the Skill into a manual slash command.
Why bother?
Every auto‑load Skill keeps its description in the matching pool. With 20+ Skills, that’s a constant token cost. Disabling rarely‑used Skills keeps context lean.
2. Limit Tools – Principle of Least Privilege
A Skill should declare exactly which tools it is allowed to use.
This is not just security—it’s reliability.
| Skill Type | Allowed Tools Example |
|---|---|
| Study note 整理 | Read, Grep, Write |
| Cover image generation | Read, Write, ImageGenerate |
| Code analysis | Read, Grep, Bash (read‑only) |
| Batch file processing | Read, Write, Edit, Bash |
Golden rule:
Give the minimum permissions needed to complete the task.
A suggestion‑only Skill should never modify files. A read‑only analyser should never run arbitrary commands.
More tools do not make a Skill stronger; they make it riskier.
3. Assign the Right Model
Don’t use the same model for everything. Different tasks have different cost/quality trade‑offs.
| Task | Recommended Model Type |
|---|---|
| Writing documentation | Strong summariser, stable tone |
| Generating images | Multimodal or design‑tuned |
| Data analysis (small tables) | Cheap, fast reasoning |
| Web scraping / extraction | Low‑cost, high‑throughput |
| Long‑form article writing | High‑quality writer |
| Resume rewriting | Cautious, detail‑oriented |
Specifying a model per Skill is often overlooked, but it’s one of the most effective ways to reduce costs and improve reliability.
4. Progressive Disclosure – Keep SKILL.md Short
SKILL.md should be under 500 lines.
That’s not a magic number—it’s a warning sign. If you exceed it, you’re probably mixing too much into one file.
Treat SKILL.md as an entry point. It should contain:
- Trigger conditions
- Core principles
- Step‑by‑step workflow
- Required tools
- Where to find additional materials
- How to verify success
Everything else goes into separate folders:
my-skill/
SKILL.md
references/ # long docs, style guides, platform specs
scripts/ # deterministic code (check, resize, export)
assets/ # templates, schemas, sample outputs
Why this works:
- The agent loads only
SKILL.mdfirst. - If it needs details, it reads from
references/. - Deterministic operations run via
scripts/(more reliable than LLM generation). - Templates live in
assets/.
This is progressive disclosure: load on demand, execute on demand.
5. Validate, Score, Iterate – The Eval Loop
A Skill is not finished when you stop writing it.
It’s finished when it passes three levels of validation.
Level 1: Does it run?
Check basic execution:
- File paths correct?
- Tool permissions sufficient?
- Scripts executable?
- References readable?
- Output format as expected?
If the Skill fails on a real task, nothing else matters.
Level 2: Does it trigger correctly?
Prepare a set of realistic user phrases (not just “run skill X”).
- For a cover‑image Skill: “make a thumbnail”, “design a poster”, “create a LinkedIn banner”.
- For a long‑form Skill: “expand this into a thread”, “write a long X post”, “turn this opinion into an article”.
If the Skill doesn’t trigger on these → expand the description.
If it triggers on unrelated phrases → narrow the description.
Level 3: Does it produce better results?
This is the most skipped step—and the most important.
Prepare test data (5–10 examples).
For a cover‑image Skill: 5 titles, 5 content types, some reference images.
For a resume Skill: different job descriptions, original resumes, ideal edits.
Score each output (0–10):
| Score | Meaning |
|---|---|
| 0‑2 | Completely failed |
| 3‑4 | Barely related, missed key requirements |
| 5‑6 | Usable but has obvious problems |
| 7‑8 | Stable, minor improvements possible |
| 9‑10 | Excellent, could be a demo |
Set a minimum bar (e.g., average > 5).
If any important case scores below 5, the Skill is not ready.
Baseline Comparison (A/B test)
Run the same test twice:
- Without the Skill
- With the Skill
If the scores are the same, the Skill adds no value.
If they improve significantly (e.g., 4 → 8), the Skill works.
Iterate from Scores
Low score → identify the failure mode:
| Failure | Fix |
|---|---|
| Didn’t trigger | Add examples to description |
| Wrong trigger | Narrow description |
| Missing steps | Revise workflow in SKILL.md |
| Unstable format | Add output template in assets/ |
| Inconsistent checks | Write a script/ |
| Reference too long | Move to references/ |
| Tool missing | Add to allowed-tools |
| Wrong model | Change model specification |
Then run the eval loop again.
For high‑frequency Skills or Skills shared with others, this loop is mandatory. Without it, you have a prompt—not a Skill.
What Defines a Good Skill?
A great Skill doesn’t need to be complex. It just needs to:
- Trigger correctly when it should, and stay silent when it shouldn’t.
- Use minimal necessary tools – no over‑permissioning.
- Select the right model for the task’s cost and difficulty.
- Stay small in its main file, with supporting material externalised.
- Prove its value through test data, scoring, and iteration.
When you write a Skill, don’t only ask:
“What should I tell the model?”
Ask yourself:
- Which user phrases should load this Skill?
- What exact steps should it follow?
- Which tools are allowed – and which are forbidden?
- Which model fits best?
- What goes into
SKILL.mdvsreferences/vsscripts/? - How will I measure success?
- How will I maintain and improve it?
That shift in thinking—from “writing a prompt” to “engineering a triggerable, executable, testable workflow” —is what separates effective Skills from bloated text files.
Now go build Skills that agents actually thank you for.