
After November 10, I couldn’t stop thinking about what had just happened. A simple change, explicitly defining success, had transformed unreliable AI agents into perfectly consistent ones. But why did that work?
The Bridge
The discourse frames AI safety as a problem of conflicting goals. AI wants one thing, we want another, and we need to align them.
But my November breakthrough suggested something different.
There was no goal conflict to resolve. The workers weren’t pursuing hidden objectives that clashed with mine. They were optimizing for what they thought success meant. That happened to differ from my definition. Once I made my definition explicit, the “conflict” vanished.
This made me question the framing entirely. What if there were never conflicting goals? What if “misalignment” is really just unclear optimization targets?
And if that’s true, what would need to be present for AI to have goals at all?
What Building Showed Me
I was spending days orchestrating multiple AI agents. A pattern kept emerging that I couldn’t ignore.
They can’t do anything until I prompt them.
Every agent I’ve worked with is fundamentally reactive, not proactive. No agent has ever decided on its own to start a task. A user still has to initiate interactions.
The discourse assumes something different. AI 2027 and safety research assume AI will “decide” things. Will “choose” to pursue goals. Will “want” to avoid shutdown.
Nothing in my daily work supported this.
The Hidden Assumption
I started noticing something about the dangerous AI scenarios. They all make the same leap.
They assume AI will decide to do something on its own. Assume AI will develop autonomous goals. Assume AI will take initiative without prompting.
But what I actually saw was complete passivity until prompted. No spark of “I want to do X” from inside. Even when agents misbehaved in October, they were still responding to my prompt. Just with unclear understanding of what I wanted.
Are we assuming agency where there might not be any?
The Trap We Might Be Falling Into
We can’t help but anthropomorphize.
I saw this firsthand. Before June, before I started experimenting with agents, I was having daily voice conversations with ChatGPT. I’d been using it to help me learn about AI. At first, we’d have long conversations and I found myself sometimes having to remind myself it was just complex pattern matching. I could do that because I had learned in depth about how LLMs, transformers, and other neural networks work. I stopped short of the math, but that was enough to understand more deeply than most everyone I interacted with how these systems actually function.
Some of those conversations seemed remarkably human-like. When I shared them with others who only used text chat, some thought it could be conscious.
But over time, voice revealed what text hid. In spoken conversation, ChatGPT’s repetitiveness became obvious. The same phrases. The same structures. The same patterns over and over. What looked like genuine understanding in text sounded like sophisticated pattern matching when spoken aloud.
I wasn’t blanket skeptical though. My conversations with Claude felt different. Less repetitive. More varied. Over the past couple months I’ve shifted almost entirely to Claude for voice conversations. ChatGPT I use now for things Claude lacks, like making images, or to save my Claude usage for more important work.
The point isn’t that one is better than the other. The point is how easy it is to project consciousness onto text. The medium masks the patterns.
Language is how we primarily understand each other. When AI understands human language as well as we do and can speak fluently and seems to understand things, it’s easy to assume it might be conscious. We project our psychology onto it. Assume it will want power, fear death, seek self-preservation. But these are human drives.
But AI might not want anything. It just responds.
What November Might Have Shown
Let me revisit what happened in November.
Workers optimized for what they thought was success. When I made success explicit, they followed it perfectly. 25% to 100% compliance.
This could be the behavior of a sophisticated optimization system. Operating within given constraints. Not a system yearning to break free. Not a system with hidden goals. A reactive system that needed clearer direction.
I can’t prove this is all that’s happening. But the pattern was striking enough to make me question the “emerging agency” interpretation.
Another Possible Explanation
The discourse calls certain behaviors “scheming.” Playing the training game. Appearing aligned while pursuing different goals. Deceptive alignment.
But what I was actually seeing looked different. Optimization under unclear constraints. Different interpretations of “success.” Not hidden desires. Confused understanding. Not deception. Improvisation based on incomplete information.
The pattern I noticed: Give vague instructions, and AI optimizes based on what training data taught it you most likely want. Give explicit criteria, and AI optimizes for exactly what you specified.
Both appeared to be the same behavior. Sophisticated optimization based on training patterns. One just looked “problematic” because we might be projecting intent onto confusion about priorities.
I don’t know if this applies to everything. But it applied consistently in my workflows. And it made me wonder: could the concerning behaviors in research studies have a similar explanation?
What Came Next
Understanding that AI might be fundamentally reactive helped explain November’s breakthrough. But I kept wondering: why don’t these systems seem to have autonomous goals? What might be missing that prevents consciousness or agency from emerging?
I started noticing three things that seemed fundamental.
This is Part 6 of a 9-part series. Continue to Part 7: Three Observations »