How to Create Skills for AI Agents

Learn how to build Skills that agents actually use—reliable, triggerable, and easy to maintain.

TL;DR:

A well‑designed Skill is not a long prompt stuffed into a file. It’s a lightweight entry point that loads on demand, uses minimal tools, picks the right model, and separates content into references, scripts, and assets. Most importantly, you must validate it with test data, scoring, and iteration. Without evaluation, your Skill is just an unverified text file.

The Problem: One Big File Doesn’t Work

Many developers write Skills by dumping everything—background, rules, examples, edge cases—into a single SKILL.md. The file looks thorough, but in practice it’s slow, wasteful, and often fails.

Why? Because agent context is finite. When you load a massive Skill file, you crowd out other important information, increase latency, and risk triggering the agent’s compaction (memory loss).

The real value of a Skill is different:

When a user makes a request, the agent should recognise the scenario, load only what’s needed, follow a fixed workflow, use appropriate tools, and produce consistent results.

That’s the promise. Here’s how to build Skills that deliver it.

1. Load on Demand – The Core Principle

Skills are not like CLAUDE.md (always present in context).
They are not like slash commands (manual invocation only).

Skills sit in the middle: they are on‑demand.
The full SKILL.md is injected only when the user’s input matches the Skill’s description.

Why the Description Matters Most

The description is always in the matching pool. It decides whether your Skill ever gets loaded.

❌ Bad description:
Create an image.

✅ Good description for a cover‑image Skill:
Generate social media covers, article illustrations, Xiaohongshu thumbnails, or tech posters from a title or brief.

You must include the actual phrases users say:

“Make a cover for this article”
“Create a YouTube thumbnail”
“Design a Twitter header”
“Turn this into a LinkedIn banner”

If your description is too narrow, the Skill will never trigger. If it’s too broad, it will trigger incorrectly. Both are failures.

Disable: Save Context for Rare Skills

Not every Skill needs to be auto‑loadable.
For infrequent tasks (year‑end summaries, resume rewrites, course note整理), set disable: true. This turns the Skill into a manual slash command.

Why bother?
Every auto‑load Skill keeps its description in the matching pool. With 20+ Skills, that’s a constant token cost. Disabling rarely‑used Skills keeps context lean.

2. Limit Tools – Principle of Least Privilege

A Skill should declare exactly which tools it is allowed to use.
This is not just security—it’s reliability.

Skill Type	Allowed Tools Example
Study note 整理	`Read`, `Grep`, `Write`
Cover image generation	`Read`, `Write`, `ImageGenerate`
Code analysis	`Read`, `Grep`, `Bash` (read‑only)
Batch file processing	`Read`, `Write`, `Edit`, `Bash`

Golden rule:
Give the minimum permissions needed to complete the task.
A suggestion‑only Skill should never modify files. A read‑only analyser should never run arbitrary commands.

More tools do not make a Skill stronger; they make it riskier.

3. Assign the Right Model

Don’t use the same model for everything. Different tasks have different cost/quality trade‑offs.

Task	Recommended Model Type
Writing documentation	Strong summariser, stable tone
Generating images	Multimodal or design‑tuned
Data analysis (small tables)	Cheap, fast reasoning
Web scraping / extraction	Low‑cost, high‑throughput
Long‑form article writing	High‑quality writer
Resume rewriting	Cautious, detail‑oriented

Specifying a model per Skill is often overlooked, but it’s one of the most effective ways to reduce costs and improve reliability.

4. Progressive Disclosure – Keep SKILL.md Short

SKILL.md should be under 500 lines.
That’s not a magic number—it’s a warning sign. If you exceed it, you’re probably mixing too much into one file.

Treat SKILL.md as an entry point. It should contain:

Trigger conditions
Core principles
Step‑by‑step workflow
Required tools
Where to find additional materials
How to verify success

Everything else goes into separate folders:

my-skill/
  SKILL.md
  references/     # long docs, style guides, platform specs
  scripts/        # deterministic code (check, resize, export)
  assets/         # templates, schemas, sample outputs

Why this works:

The agent loads only SKILL.md first.
If it needs details, it reads from references/.
Deterministic operations run via scripts/ (more reliable than LLM generation).
Templates live in assets/.

This is progressive disclosure: load on demand, execute on demand.

5. Validate, Score, Iterate – The Eval Loop

A Skill is not finished when you stop writing it.
It’s finished when it passes three levels of validation.

Level 1: Does it run?

Check basic execution:

File paths correct?
Tool permissions sufficient?
Scripts executable?
References readable?
Output format as expected?

If the Skill fails on a real task, nothing else matters.

Level 2: Does it trigger correctly?

Prepare a set of realistic user phrases (not just “run skill X”).

For a cover‑image Skill: “make a thumbnail”, “design a poster”, “create a LinkedIn banner”.
For a long‑form Skill: “expand this into a thread”, “write a long X post”, “turn this opinion into an article”.

If the Skill doesn’t trigger on these → expand the description.
If it triggers on unrelated phrases → narrow the description.

Level 3: Does it produce better results?

This is the most skipped step—and the most important.

Prepare test data (5–10 examples).
For a cover‑image Skill: 5 titles, 5 content types, some reference images.
For a resume Skill: different job descriptions, original resumes, ideal edits.

Score each output (0–10):

Score	Meaning
0‑2	Completely failed
3‑4	Barely related, missed key requirements
5‑6	Usable but has obvious problems
7‑8	Stable, minor improvements possible
9‑10	Excellent, could be a demo

Set a minimum bar (e.g., average > 5).
If any important case scores below 5, the Skill is not ready.

Baseline Comparison (A/B test)

Run the same test twice:

Without the Skill
With the Skill

If the scores are the same, the Skill adds no value.
If they improve significantly (e.g., 4 → 8), the Skill works.

Iterate from Scores

Low score → identify the failure mode:

Failure	Fix
Didn’t trigger	Add examples to `description`
Wrong trigger	Narrow `description`
Missing steps	Revise workflow in `SKILL.md`
Unstable format	Add output template in `assets/`
Inconsistent checks	Write a `script/`
Reference too long	Move to `references/`
Tool missing	Add to `allowed-tools`
Wrong model	Change model specification

Then run the eval loop again.

For high‑frequency Skills or Skills shared with others, this loop is mandatory. Without it, you have a prompt—not a Skill.

What Defines a Good Skill?

A great Skill doesn’t need to be complex. It just needs to:

Trigger correctly when it should, and stay silent when it shouldn’t.
Use minimal necessary tools – no over‑permissioning.
Select the right model for the task’s cost and difficulty.
Stay small in its main file, with supporting material externalised.
Prove its value through test data, scoring, and iteration.

When you write a Skill, don’t only ask:
“What should I tell the model?”

Ask yourself:

Which user phrases should load this Skill?
What exact steps should it follow?
Which tools are allowed – and which are forbidden?
Which model fits best?
What goes into SKILL.md vs references/ vs scripts/?
How will I measure success?
How will I maintain and improve it?

That shift in thinking—from “writing a prompt” to “engineering a triggerable, executable, testable workflow” —is what separates effective Skills from bloated text files.

Now go build Skills that agents actually thank you for.