Friday, November 15, I was ready to test everything I’d built.
I had working patterns now. The second workflow proved it—one day instead of three weeks. So I built a third workflow. Then a fourth. All following the same patterns: flat coordination, workers with clear responsibilities, verification loops.
Time to see if they actually worked on real codebases.
The First Real Test
I launched the first workflow on an actual repository. Not the test files I’d been using. A real codebase with real problems.
And I waited.
And waited.
Twenty minutes later, it finished. Results looked okay. Some issues fixed. Some missed. Hard to tell if it was working correctly or not.
I ran it again on the same repo. This time: 25 minutes.
Wait, what?
The Timing Problem
Same workflow. Same repository. Different execution times. Not slightly different—vastly different.
I started keeping notes:
Run 1: 20 minutes
Run 2: 25 minutes
Run 3: 15 minutes
Run 4: 22 minutes
The inconsistency made evaluation impossible. Was the workflow actually working? Were the results correct? I couldn’t tell because I couldn’t even understand what was happening during execution.
The Evaluation Problem
Testing workflows is supposed to answer questions:
- Did it find all the issues?
- Did it fix them correctly?
- Did it miss anything?
- Are the results consistent?
But I couldn’t answer any of these. The workflows were black boxes. They’d run for an unpredictable amount of time, make some changes, and finish. I’d look at the results and have no idea if they were correct.
Issues were appearing. But were they real problems with the workflow? Or artifacts of the inconsistent execution? I couldn’t tell.
The Iteration Problem
Fast iteration requires fast feedback. You make a change, test it, see results, adjust.
But when each test run takes 15-25 minutes and gives inconsistent results, iteration grinds to a halt.
I’d make a change to a worker. Run the workflow. Wait 20 minutes. Look at results. Can’t tell if my change helped or hurt because the baseline itself was inconsistent.
This wasn’t testing. This was guessing.
Starting to Investigate
I started digging into what was actually happening during execution.
Read Claude’s output during runs. Watched which workers got invoked. Tried to understand the sequence of operations. Looked for patterns in when things were fast versus slow.
Nothing made sense. Sometimes workers would finish instantly. Sometimes the same worker on the same file would take minutes. No consistent pattern.
I needed to understand how agents actually execute. What happens when Claude invokes an agent? How does the Task tool work? Why would the same operation take vastly different amounts of time?
The Meta-Problem
Here’s what I realized: the workflows might be fine. The patterns might be sound. The architecture might be correct.
But I couldn’t evaluate any of that because the testing process itself was broken.
Before I could fix the workflows, I needed to fix how I was testing them. Before I could understand if results were correct, I needed to understand why execution was so inconsistent.
The bottleneck wasn’t the workflows. It was my ability to test them.
Friday Evening
Friday, November 15, I stopped trying to test workflows and started researching agent execution.
How do agents work? What affects their timing? Can execution be more predictable? Can I get visibility into what’s happening?
I had built workflows that theoretically worked. Now I needed to build a testing process that could actually tell me if they worked.
The patterns had scaled from one workflow to four. But the testing process hadn’t scaled at all. That was the real bottleneck.