Let me tell you about the day I first used an LLM.

I was genuinely excited. I remember thinking: this is it, this is the thing that’s going to change everything. I asked it to help me with something technical and it gave me an answer that felt almost magical.

Then I spent the next three hours rewriting everything it gave me.

Not because the AI was bad. It wasn’t. It was producing decent, reasonable output for a generic developer. The problem was that I’m not a generic developer, and it had no idea who I was, how I worked, or what I actually needed.

Image

That’s the context problem. And I’m willing to bet it’s not just me.

Your Expertise is Invisible to AI

Here’s what AI has access to when you open a fresh chat: nothing. Zero. It doesn’t know your codebase, your team conventions, your architecture decisions from last quarter, or the hard lessons you’ve learned over 10 years of shipping software.

Your experiences? Invisible.
Your ways of working? Invisible.
Your standards? Completely invisible.

And here’s the frustrating part. The more senior you are, the worse this gets. I call it the Seniority Issue.

When you’re a junior developer, you’re still discovering what “good” looks like. AI’s output is close enough to acceptable that you can work with it.

Image

The more experience you accumulate, the clearer your expectations become. You know exactly what you want. You have opinions about folder structures, naming conventions, error handling, documentation style. And when AI hands you something that doesn’t match any of that… you rewrite it. All of it.

More expertise = more gap. Every single time.

There’s another thing worth understanding: AI is not a script.

Input A + Script = B           (always, deterministically)
Input A + AI     = Probably B  (maybe, probabilistically)

Scripts are deterministic. AI is probabilistic. Same prompt, different session, you might get a meaningfully different answer. That’s not a bug, it’s how the technology works. But it means you can’t treat AI the way you treat a build tool. If you ignore this distinction, you’ll design systems that feel unreliable when really they’re just being used the wrong way.

Just a Chatbot? Or Maybe Not?

At this point you might be thinking: AI is just a chatbot. I go to ChatGPT, I ask a question, I get an answer, the execution is on me.

If that’s where you are, we’re working with a different context. You can do much more with AI than that.

The thing I want you to internalise is that AI today is not just a chatbot. It’s an agent. And the difference matters.

A chatbot answers. An agent acts.

The difference is what tools you connect to it. With the right tools, an agent can:

  • Read and modify files on your machine
  • Execute scripts and run code
  • Send emails and messages
  • Book meetings
  • Update documentation

You define the tools. The agent uses them. That’s it. That’s the whole shift.

Once you see AI as an agent that acts, the question changes. It’s no longer “how do I write a better prompt?” but “how do I give this agent the right context and the right tools so it actually does the work the way I’d do it?”

That’s where the next two approaches come in.

The Two Paths

The fix is simple: if AI doesn’t know how you work, teach it once and let it remember forever.

Two ways to do that. Let’s walk through both.

Approach #1: (Claude) Skills

Simple, quick, and easy to build.

Image

What is a Skill?

A skill is a set of instructions (plus optional documentation and tools) that tells an AI how to perform a specific task it couldn’t do reliably on its own. Things like calling an external API, generating an image, or following your specific workflow.

Think of it like a detailed brief you’d give a new colleague. Except instead of explaining it every morning, you write it down once and the AI reads it every time it needs to do that job.

Skills come in two flavours, and you’ll pick based on how much heavy lifting the skill needs to do on its own.

Simple: a single SKILL.md file

Just a markdown file with plain-text instructions. That’s it. Here’s a stripped-down example:

---
name: blog-post-reviewer
description: "Multi-perspective review system for technical blog posts. 
Spawns 4 contextually-relevant reviewers in parallel."
---

# Blog Post Reviewer

## What This Skill Does
Analyzes blog content, identifies the target audience, then runs
a multi-angle review with 4 parallel sub-agents.

## Review Perspectives
1. Domain expert (checks technical accuracy)
2. Tech friend (checks clarity and tone)
3. Target reader (checks accessibility)
4. General reader (checks flow and engagement)

## Output Format
Each reviewer produces ~350 words of specific, actionable feedback.
Consolidate into a prioritised list of improvements.

Advanced: a .skill file (a zip with instructions + tools)

When a skill needs to execute a more complex task that requires supporting code, like calling an API, running a script, or processing a file, you package the instructions together with Python tool scripts into a .skill file. The AI reads the instructions and executes the scripts. You get deterministic results for the parts that need to be deterministic.

Where Do You Get Skills?

You’ve got a couple of paths here.

The easiest one is to grab a skill someone else already built. The community keeps a curated list over at awesome-claude-skills, and dropping any of them into your project’s .claude/skills/ directory is enough for Claude to find it. Done.

If nothing fits what you need, you build your own. And here’s the fun part: you create skills using another skill. There’s a skill-creator skill (yes, that exists) that guides Claude Code through building a new skill from scratch. Meta? Yes. Useful? Absolutely. 🤓

A Real Example: My Blog Post Reviewer

Every time I wrote a post, I’d send a draft to friends for feedback. The problem? I knew the draft wasn’t great yet. I was burning their time on issues I could have caught myself, often across multiple review rounds. Bad deal for them, slow for me.

What I actually wanted was to hand them something already in much better shape, so their review time went to the real issues instead of the obvious ones.

So I built a skill that does this:

One command →
  Analyzes the content and identifies the target audience
  Spawns 4 sub-agents, each reviewing from a different perspective
  Collects all feedback
  Synthesizes it into prioritised, actionable suggestions
  Applies the fixes to the draft
  Repeats until the post is (almost) perfect

The whole thing runs from a single command in Claude Code. I use it on every post I publish, including this one.

Want to see it in action? Here’s the demo I showed at the conference:

Why Skills Are Great

Easy to create. Describe what you need in plain language, Claude Code scaffolds the rest.
Easy to use. Claude auto-detects when a skill is relevant based on what you’re doing.
Self-contained. A single file (or zip) you can share with anyone on your team.
Stable and repeatable. The Python tools inside are deterministic scripts. They don’t hallucinate.
Works across AI tools. Plain text + standard Python = not locked to Claude. The same skill works wherever the CLI supports it.

Where Skills Hit the Ceiling

Struggles with complex, multi-step tasks. Squeeze too many moving parts into one markdown file and things get messy fast.
Hard to combine. Chaining two skills without an external orchestrator isn’t straightforward.
Don’t scale well. Skills are self-contained by design, which is both their strength and their limit.

When you start feeling that ceiling, it’s time to look at the second approach.

Approach #2: The WAT Framework

Image

WAT stands for Workflows, Agents, and Tools, and I first came across it in a Nate Herkelman YouTube video. It’s a framework from the AI Automation Society community, and once you know the name you’ll see it referenced all over the place. It solves exactly the problem skills can’t: complex, multi-step automation that needs to scale.

The core idea is divide and conquer. Instead of one big instruction file that does everything, you split concerns into three distinct layers.

Layer 1: Workflows (The Instructions)

A workflow is a markdown file, an SOP, that describes what needs to happen from start to finish. It includes the objective, the inputs the agent needs to ask for, the phases of the work, the tools to call along the way, and the edge cases to handle. Crucially, it evolves. As you use it, the agent updates the workflow with what it learns.

The shape of a workflow is simple:

# <What this workflow does>

**Objective**: <one-line description>

**Inputs**:
- <thing the agent should ask the user for>
- <or look up in another file>

---

## Phase 1: <name>
<plain-language steps the agent should take>

## Phase 2: <name>
<more steps>

> <a callout for an edge case or a hard rule>

## Phase 3: <name>
<...>

That is it. No code, no special syntax, just markdown that reads like onboarding documentation for a new team member. Except the team member is the AI.

Layer 2: Agents (The Decision-Maker)

Quick naming note: when we said earlier that AI is an agent, we meant the broader concept of an AI that takes actions. The “Agent” in WAT is more specific. Same word, narrower meaning here.

This is the AI. The agent reads the workflow and:

  • Takes decisions based on the instructions
  • Runs tasks in parallel using sub-agents when needed
  • Corrects edge cases autonomously
  • Recovers from failures instead of bailing on the first error

The agent never tries to handle execution directly. If it needs to fetch a version number, it runs a tool. If it needs to push files to a repo, it runs a tool. The agent orchestrates. The tools execute.

Why does this matter? LLMs hallucinate. Not always, not dramatically, but often enough that you can’t fully trust any single step. Chain a few LLM steps together and the errors compound fast. Here’s the punchy version:

If each step is 90% accurate, you’re down to 59% success after just five steps.

That’s not exact math (real-world errors don’t behave that uniformly), but the point holds. By keeping the AI focused on reasoning and orchestration, and handing execution off to deterministic scripts (which don’t hallucinate), you hold accuracy high where it matters.

Layer 3: Tools (The Execution)

Tools are Python scripts in a tools/ directory. They’re deterministic, reusable, testable, and they don’t hallucinate. That last point matters more than it sounds. The more you use AI in a session, the more context grows, and the more it starts to drift. Tools fix this. Each one does exactly one thing, and does it reliably, every single time.

The shape is simple: a tools/ folder with one Python file per task.

tools/
├── fetch_latest_version.py   # check the latest release of a package
├── push_to_repo.py           # commit and PR in one call
├── trigger_pipeline.py       # kick off a CI run
└── ...

Each tool is called by the agent the same way you’d call any CLI. The agent reads its output and decides what to do next.

The key pattern: credentials never touch the AI. API keys live in .env. The Python scripts read them. The AI only ever sees results.

Getting Started With WAT

Honestly? It is not that complicated. Here is the shortest possible path:

✅ Drop a CLAUDE.md file into your project that describes the WAT framework
✅ Ask Claude to initialise the project following those instructions (it creates workflows/, tools/, etc.)
✅ Create your first workflow simply by asking
✅ Enjoy

That is it. From there, the framework does what it is designed to do: every time you use it, it gets a little better.

A Real Example: My Homelab

Clearly, I didn’t have enough hobbies. So I started a homelab. (Naturally.)

It started small: Home Assistant on a Raspberry Pi, a couple of containers on my Synology NAS. Then I discovered Proxmox and the magic world of homelab. And here’s the thing: with me, it’s all in or nothing. So I went all in. Proxmox cluster, LXC containers, Ansible, Terraform, Gitea, Semaphore, the full self-hosted enthusiast stack.

I didn’t build all of this to look impressive. I built it because once you go past a few services, doing things manually stops being fun. The whole stack exists for one reason: spend less time on operations, more time on the parts of the homelab I actually enjoy.

If you don’t have a homelab, mentally swap this for your side project’s deploy pipeline, or your team’s “release a new service” runbook. The shape is the same: a bunch of tools that need to fire in the right order to ship something.

Now I have one WAT system that handles everything:

  • Deploy services. Research, configure, and deploy new self-hosted apps end-to-end.
  • Manage infrastructure. Provision new VMs and containers via Terraform (IaC).
  • Handle updates. Coordinate updates across the entire infrastructure.
  • Write documentation. Generate and update docs for every change.

Quick glossary if these names aren’t familiar: Gitea is a self-hosted Git platform (think your GitHub). Semaphore is a CI/CD runner (think your GitHub Actions or Jenkins). LXC is a lightweight container, like Docker but more like a tiny VM.

Let me walk you through what this looks like in practice, layer by layer.

The workflow. When I want to deploy a new app, the agent reads deploy_app.md. Here’s the opening of it:

# Deploy Application

**Objective**: Research, plan, and deploy a self-hosted application 
end-to-end, from provisioning infrastructure to running the service 
and writing documentation.

**Inputs**:
- Application name
- Any specific requirements or constraints

---

## Phase 1: Discovery
Ask if not already provided:
- Application name and purpose
- Any hard requirements (GPU passthrough, USB devices, etc.)
- Any known installation method preference

## Phase 2: Research
**Step 1: Read the official documentation first. 
This is mandatory and non-negotiable.**

Use WebSearch + WebFetch to find and read the official installation guide.
Extract: supported install method, version requirements, known limitations.

> Do not decide on infrastructure until this step is complete.

## Phase 3: Plan + Clarification
Present a proposed deployment plan:
  Deployment method: [reasoning citing official docs]
  Hostname:          [appname]
  Resources:         [X cores, Y MB RAM, Z GB disk]
  IP:                [next available from IP Registry]
  
Wait for user approval before proceeding.

Notice the workflow specifies what to do, not how to do every individual step. That’s the agent’s job. It includes edge cases, decision criteria, and explicit gates like “wait for user approval before proceeding.”

The agent. A CLAUDE.md at the root of the project tells Claude Code how to operate inside this WAT system:

## How to Operate

**1. Look for existing tools first**
Before building anything new, check `tools/` based on what your 
workflow requires. Only create new scripts when nothing exists.

**2. Learn and adapt when things fail**
When you hit an error:
- Read the full error message and trace
- Fix the script and retest
- Document what you learned in the workflow (rate limits, timing quirks)
- Update the workflow so this never happens again

**3. Keep workflows current**
Workflows should evolve as you learn. When you find better methods,
discover constraints, or encounter recurring issues, update the workflow.

The tools. When the agent needs to know the latest stable version of a Docker image, rather than asking the AI to guess (which it would do, confidently and incorrectly), the workflow calls a Python tool. Running python tools/fetch_latest_version.py --image postgres returns the latest stable version, like 16.4, pulled from the Docker Hub or GitHub releases API. Boring, deterministic, fast.

And here is the part that matters: the workflow file requires the agent to do this before writing any image tag. From the real craft_compose.md:

## Step 1: Version Lookup (MANDATORY: run before writing any image)

For every image in the stack:
1. Run: python tools/fetch_latest_version.py --image <image-name>
2. If the tool returns a version, use it. Add a comment:
     image: postgres:16.4  # v16.4, source: dockerhub (2025-01-15)
3. If the tool fails, fall back to WebFetch on Docker Hub.
4. If both fail, flag explicitly with a TODO comment.

**Never** use `latest`, `stable`, or unversioned tags.

That rule (never use latest) is a hard-won lesson from running things in production. It’s now baked into the workflow, which means the AI reads it on every single run.

Other tools in my setup:

  • gitea_ops.py: push files to Gitea and open PRs in one call
  • semaphore_ops.py: trigger Ansible playbooks via Semaphore with the right template names pre-configured
  • query_gitea.py: read-only repo inspection before pushing

Putting it all together. Here’s what the deploy workflow does when I say “deploy Immich”:

Phase 1: Asks any missing questions
Phase 2: Reads official Immich docs (mandatory, non-negotiable)
Phase 3: Proposes a deployment plan, waits for my approval
Phase 4: Invokes docker-compose-crafter agent →
           Fetches latest Immich version via tool
           Generates compose.yaml with pinned versions
           Validates against official docs
           Pushes to Gitea via tool, opens PR
Phase 5: Invokes provision_lxc workflow →
           Generates Terraform HCL
           Pushes to terraform-proxmox repo
           Triggers Semaphore via tool:
             Plan → Apply → Linux Docker → Deploy Compose
Phase 6: Invokes homelab-docs-writer agent →
           Writes deployment documentation

All of that from one instruction. And every phase has a corresponding workflow file defining the exact steps, the tools to call, the edge cases to handle.

Here’s a quick demo. This one shows the system handling an update to an existing container rather than a fresh deploy, but you’ll see how the workflow, agent, and tools fit together:

Why I Like WAT

Easy to create. Same as a skill. Just describe what you want.
Easy to handle complex tasks. This is where it really shines over skills.
Tools are reusable. A tool that lives outside a skill can be shared across multiple workflows.
Scales well. Adding new workflows is cheap.
Wide compatibility with LLMs. Skills depend on whether the CLI supports them. WAT is more flexible here, because the instructions are just plain text.

But It’s Not a Silver Bullet

Upfront setup before you see value. You need to build the workflows before the framework starts paying off.
Workflow discoverability isn’t always perfect. Skills are easier in this regard, Claude auto-detects them.
Overkill for one-off tasks. If you only need to do something once, just do it.
New team members need onboarding. It is a framework, and frameworks need explanation.

Skills vs WAT in Practice

I didn’t start with WAT. I started with a skill.

My original “deploy homelab app” skill worked. For a while. I wrote it to handle the basics: research, generate a compose file, deploy. It was great for simple apps.

Then I started adding cases. GPU passthrough. NPM-based apps that don’t use Docker. VMs instead of containers. Infrastructure provisioning via Terraform. Ansible playbooks. Documentation generation.

The skill file became a monster. Hundreds of lines of conditional logic crammed into one markdown file. The AI would get confused halfway through complex deployments. Edge cases broke things. I was spending more time debugging the skill than it saved me.

Image

So I migrated to WAT. And here’s the thing: nothing was thrown away. The skill became the first workflow. The logic I’d written became the basis for deploy_app.md. The concepts transferred directly. I just split them across the three layers they always should have occupied.

The result is dramatically easier to manage. Each workflow is focused on one thing. Tools are reusable across workflows. When something breaks, I know exactly which layer to look at. And the system has kept growing (workflows for VM provisioning, Ansible playbooks, compose file generation, and more) without any of the complexity pain I had with the skill.

Start small, think big. Use a skill to prove the concept. Migrate when it hurts. That’s the path.

Side by Side: The Comparison Table

If you want the same picture in table form:

SkillsWAT
Setup costMinutesHours (upfront)
Best forFocused, stable tasksComplex, multi-step workflows
ComposabilityHard to chainDesigned for it
ScalabilityHits a ceilingScales well
DiscoverabilityAuto-detected by ClaudeNeeds good workflow names
LLM compatibilityDepends on CLI supportWide, instructions are portable
Team onboardingShare a zip fileNeeds some explanation
ReusabilitySelf-contained per skillTools shared across workflows

The decision tree is actually simple:

  • Quick win? One-off task? No tooling needed? → Use Skills
  • Multi-step workflow? Needs to scale? Team will build on it? → Use WAT
  • Still not sure? → Start with Skills. Easy to build, no setup required. Experiment, and when it starts feeling too complex, migrate to WAT. That is exactly the path I took.

Together is Better

One more thing worth knowing: you don’t have to choose. WAT and Skills can work together, and they work really well.

Use WAT as the base framework, the orchestration layer that ties it all together. Use Skills as tools inside it. You get rich, reusable tools that go beyond a plain script (well documented, more robust, shareable on their own) plus the structured orchestration that scales.

Image

This is exactly the direction I’m taking with my homelab. WAT still handles the orchestration, and I’m gradually pulling out the more reusable pieces into proper skills, the kind I can share, version, and drop into other projects without rebuilding from scratch. So far it’s working well.

Honest Lessons

What worked, what didn’t, and what to watch out for.

AI is Not Always the Answer

Whatever you choose, remember this: AI is not always the answer.

Use AI where judgment matters. But deterministic tasks deserve deterministic tools. Nothing beats a well-written script for a purely deterministic task.

My CLAUDE.md has a whole section on web research tool priority. It starts with a Python script for Docker version lookups (under 1 second, always accurate), escalates to WebFetch for specific pages (3 seconds), then WebSearch for finding URLs (2 seconds), and only falls back to an AI research agent as a last resort (30 to 70 seconds, and the results need verification anyway).

The principle: use AI where judgment matters. For everything else, write a script.

About These Slides

Quick aside, since this post is the written version of a talk I gave at AI Heroes 2026: yes, I also used AI to help build the slide deck for that talk.

The result was a bit… meh.

It works well for content. AI is genuinely good at structuring an argument, writing speaker notes, suggesting flow. But for everything else? Layout, alignment, visual hierarchy? It struggled. Maybe it was the tool I picked (I tried generating slides via the Google Workspace CLI). Maybe there were better options. The point is the result was not what I expected.

Which leads to the most important lesson of all.

Stay in the Loop

You are the quality gate of everything you do.

Automations can break. They will break. Plan for it.
AI can make mistakes. You are the last line of defense.

You need to understand what AI is doing well enough to catch it when it goes wrong. The self-improvement loop in WAT exists precisely because failure is expected. When something breaks, you fix the tool, update the workflow, and the system gets stronger. But that loop requires a human in it.

Image

Stay curious. Stay sharp. The best asset in the loop? Still you.

Let’s Teach AI How to Replace Us

One more thing before we get to that. Yes, I am also using AI extensively at work. But the examples I shared in this post are not from the office. They are from my homelab, from my blog, from the things I do in my own time.

Why? Because work is the obvious use case. But that is not really the point. Life is where it gets interesting. You have the freedom to experiment when the stakes are low and the project is yours. Apply AI to whatever matters most to you, the weird personal projects, the things you do for fun. And you will find many ways to bring that knowledge back to your work. Faster than you expect.

So how do you avoid being replaced by AI? Simple. Teach it.

You can pretend the change isn’t happening, but it will hit you anyway. The only real lever you have is to engage with it, so you get to shape what it does for you instead of letting it shape what happens to you.

Image

My recommended path: start small with a skill. Pick something you do over and over and write it down once. When that skill starts feeling cramped, that is your signal to move to WAT. One workflow, one tool, then another, and the whole thing gets a little better every time you use it. Stay in the loop while it runs. Stay curious about what it gets wrong. The specifics will keep changing, but the habit, encoding what you know so AI can use it, is what compounds.

If you want to dig deeper into WAT, the AI Automation Society community is where the framework comes from. Credit where credit is due. And if you want to keep the conversation going, come say hi on X / Twitter.

I hope that you liked this post and thanks for reading till here! 🙏