Give your agent a mirror

June 1, 2026

by Ted Spare

Experiment

Give your agent a mirror

A skill for long-running agents that iteratively improve their own work

The loop we keep rebuilding

When we colocate a coding agent with the code it ships, the bottleneck stops being writing software and starts being judging it. A one-shot agent will happily hand you a landing page that looks like purple slop, a React component that takes two seconds to filter, or an onboarding flow that dead-ends on step three — and then call it done.

The fix isn't a better prompt. It's a better loop. We want a long-running agent (Pi, OpenClaw, Hermes Agent — anything that can run for a while and call tools) to do what a good engineer does: ship a draft, look at it, write down what's wrong, fix the worst thing, and look again.

That behavior fits in one skill.md. This post builds it up across three examples of increasing difficulty, starting from the most naive version imaginable.

Here's where we begin:

---
name: iterate-ui
description: Build a UI, screenshot it, critique the screenshot, and improve it.
---
 
# Iterate on UI
 
1. Build the feature.
2. Run the dev server and take a screenshot.
3. Look at the screenshot and list what looks bad.
4. Fix the worst issues.
5. Repeat until it looks good.

That's it. Surprisingly, it almost works.

Example 1 — The purple-hell landing page

We pointed Claude Code at a deliberately lazy prompt — "build a mobile landing page for Cadence, a habit tracker for teams" — and let the skill above run. Iteration 0 is exactly the slop you'd expect: a centered purple gradient, a generic headline, emoji bullets, no hierarchy, no trust signals.

Landing page iteration 0 — generic purple slopLanding page mid iterationLanding page final, polished
Left: the one-shot baseline. Middle: after the agent critiques contrast and hierarchy. Right: the final pass — restrained palette, real type scale, a clear CTA, and social proof.

Just by screenshotting its own work and being asked "what looks bad?", the agent walked itself out of slop. But the loop was fragile. It would declare victory early, fixate on one detail, or "improve" something into a regression — because "until it looks good" has no definition. The skill needs a rubric and a budget.

---
name: iterate-ui
description: Build a UI, then critique and improve it against an explicit rubric for a fixed number of passes.
---
 
# Iterate on UI
 
1. Build the feature. Run it and screenshot at a fixed viewport (390×844 for mobile).
2. Score the screenshot 1–5 on each: visual hierarchy, color/contrast, type scale,
   spacing, CTA clarity, trust signals, "does this look generic/AI-made?".
3. Fix only the **two lowest-scoring** items. Re-screenshot at the same viewport.
4. Stop after 4 passes, or when nothing scores below 4. Keep a changelog.

Example 2 — Make it fast, keep it correct

Looks are subjective; speed isn't. We one-shotted a "Transactions dashboard" React component rendering ~10,000 rows with filtering, sorting, and a live summary — built naively, with derived data recomputed on every keystroke and every row mounted at once.

The visual rubric is useless here. Performance work needs a measurable objective and a guardrail: the component has to get faster and stay correct. So before optimizing, the agent writes a behavior test suite (filter results, sort order, summary totals) that must stay green through every change, and a browser benchmark that measures mount time and filter latency.

Iter 0 — naiveall 10k rows mounted, derived data recomputed each keystroke
mount
328 ms
filter
299 ms
Iter 1memoized derived data + debounced filter
mount
70 ms
filter
42 ms
Iter 2windowed list — render only visible rows
mount
25 ms
filter
3 ms
Real measurements across iterations. Each pass changed one thing, re-ran the benchmark, and kept the win only if the unit tests stayed green.

The skill grows two requirements — a target you can measure, and a regression gate:

# Iterate on performance
 
1. Define the metric (e.g. p50 filter latency in ms) and write a benchmark that prints it.
2. Write/keep a test suite that pins correct behavior. It must pass before and after every change.
3. Each pass: form one hypothesis, change one thing, re-run tests + benchmark.
   Keep the change only if tests pass AND the metric improved. Otherwise revert.
4. Stop when you hit the target or gains flatten. Log before/after numbers each pass.

Example 3 — Walk the onboarding like a user

The hardest case is a flow, not a screen. A single screenshot can't tell you that step three has no back button or that a validation error is a dead end. The agent has to use the product.

So we had Claude Code scaffold a five-step onboarding for Cadence, then drive it with computer use — actually filling fields, clicking through, and recording the run as a GIF. Where the agent hesitated or got stuck is exactly where a real user would.

Onboarding walkthrough iteration 0Onboarding walkthrough iteration 1Onboarding walkthrough iteration 2 — smooth
Three automated walkthroughs. The flow gains a progress indicator, inline validation, sensible defaults, and skippable steps as the agent removes the friction it kept hitting.

The final lesson for the skill: when the artifact is interactive, the success metric is "can the agent complete the task without confusion?" — so it has to exercise the feature end-to-end, not glance at it.

The skill

Put the three lessons together — explicit rubric, measurable objective with a regression gate, and exercising the artifact like a user — and you get one compact, copyable skill that works across UI, performance, and flows.

---
name: iterate
description: Improve your own work across iterations. After building anything, define how to judge it, measure the current state, fix the highest-impact problem, and re-measure — until it meets the bar or gains flatten.
when_to_use: After producing any artifact (a UI, a component, an API, a flow) where the first attempt is unlikely to be the best one.
---
 
# Iterate
 
The first version is a draft. Your job is the loop that follows.
 
## 1. Define the bar
Before improving anything, write down how this artifact will be judged:
- **Visual** → a 1–5 rubric (hierarchy, contrast, type scale, spacing, CTA clarity,
  trust signals, "does it look generic?"). Always screenshot at a fixed viewport.
- **Performance** → one measurable metric (e.g. p50 latency in ms) plus a benchmark
  that prints it, and a test suite that pins correct behavior.
- **Flow / interactive** → a concrete task to complete; success = the artifact can be
  used end-to-end without confusion. Drive it with browser/computer use, not a glance.
 
## 2. Measure the current state
Run it and capture evidence (screenshot, benchmark number, walkthrough recording).
Write down the top problems, ranked by impact.
 
## 3. Change one thing
Fix only the highest-impact problem this pass. Don't refactor opportunistically.
 
## 4. Re-measure and gate
Re-capture the same evidence at the same settings.
- Keep the change only if it improved the bar AND broke no guardrail (tests stay green).
- Otherwise revert and try the next hypothesis.
 
## 5. Stop deliberately
Stop when you meet the bar, run out of budget (cap the passes), or gains flatten.
Keep a short changelog: what was wrong, what you changed, before → after.

Closing

None of the three runs needed a smarter model — they needed a model that looks at its own output before declaring done. A screenshot, a benchmark, a walkthrough: cheap mirrors that turn one-shot generation into something closer to craft. Drop the skill above next to a long-running agent and let it grade its own homework.

Don't take our word for it — compare all three iterations live: read the landing pages, type into the dashboards to feel the speedup, and click through the onboarding flows yourself.

Acknowledgments

Thanks to the team for letting an agent redesign a landing page live, twice.

If this sparked an idea for your roadmap, let's talk.

Rubric is an applied AI lab helping teams design and ship intelligent products.