A Builder’s Questions, Part 9: Reframing the Risk

If current AI systems might lack the prerequisites for consciousness, time awareness, continuous existence, real-time learning, then what does that mean for how we think about AI risk?

Questions These Observations Raise

My observations raised questions about how we frame AI safety.

About alignment. What if it’s not about aligning AI’s goals with human values? What if it’s about specifying desired behavior clearly enough that optimization produces it?

About verification. What if it’s not about peering inside to check hidden goals? What if it’s about defining success criteria that make behavior completely verifiable?

About compliance. What if it’s not about overcoming deceptive alignment? What if it’s about providing reactive systems with clear optimization targets?

These are questions, not conclusions. Just questions my experience raised. Alternative interpretations worth examining.

Where the Danger Might Actually Come From

Even if current AI lacks consciousness, persistence, and true reasoning, the danger remains. But it might shift where danger comes from.

The discourse says danger comes from AI developing autonomous harmful goals. AI deciding on its own to pursue power. Scheming to avoid shutdown. Deceptively appearing aligned while plotting.

But maybe most of the danger comes from humans deploying powerful optimization poorly. Giving AI unclear or misaligned objectives. Using AI for intentionally harmful purposes. Deploying systems without understanding their limitations. Amplifying human biases and errors at scale.

And from unintended consequences. Powerful optimization of proxy metrics. Pursuing stated goals in unexpected ways. Causing harm through misspecification. Brittleness in novel situations.

If danger comes primarily from humans misusing powerful tools rather than AI developing autonomous malicious intent, the focus shifts.

Focus on human accountability and oversight. Clear specification of objectives. Understanding system limitations. Preventing malicious use.

Rather than alignment techniques to change AI’s goals. Detecting deceptive alignment. Preventing AI from wanting wrong things.

I’m not claiming this is definitely correct. We could be facing both. But my observations made me question where the emphasis should be.

Scope and Future Systems

These observations apply to current LLM-based systems. That’s my scope. Future architectures with different properties would need different analysis.

Future systems might not have these limitations. Systems with continuous operation already exist, like robots and monitoring systems. Systems with online learning already exist, like recommendation algorithms. Systems with temporal awareness could be built.

Current systems are what’s being studied in current safety research. Current systems are what’s being deployed. Future architectures might raise different concerns.

What About the Research?

My workflows didn’t test adversarial scenarios. My agents had aligned incentives. When they misbehaved under unclear constraints, it looked like confusion, not goal-directed scheming. They were trying to accomplish what they thought was the goal.

In adversarial scenarios with conflicting objectives, AI might behave differently. But it makes me wonder: are adversarial study results showing deception or confusion under conflicting constraints?

The research claims to give clear objectives. But do the study designs actually provide the kind of explicit success criteria that worked in my November breakthrough? Do they define success the way I learned to define it? Or do they give conflicting instructions and call the resulting behavior scheming?

My observations are about capability, not alignment. Josh’s example shows AI lacks general reasoning. Safety researchers aren’t claiming AI is generally intelligent. They’re worried about narrow optimization causing harm. A system can be dangerous without being conscious.

Lack of consciousness doesn’t mean lack of danger. That’s why the focus on human misuse matters. But if we’re misdiagnosing why behaviors occur, our solutions might be wrong. If it’s optimization under unclear constraints and not goal-directed scheming, then better specification might work. Not just alignment breakthroughs.

A Hypothesis, Not a Proof

This series is a hypothesis, not a proof. I’ve raised questions based on what I observed building real systems. But I built workflows, not rigorous experiments. I don’t know if my observations generalize beyond my experience.

What I do know is that I’d like to find out. The questions feel important enough to explore formally. Does explicit success criteria produce reliable compliance in adversarial scenarios? Are the behaviors in safety research better explained by optimization under unclear constraints than by goal-directed scheming? What would it take to test these alternatives?

I’ve spent six months building with AI and thinking about these questions. I’d welcome the chance to spend the next phase exploring them with more rigor. And with people who’ve been thinking about this longer than I have.

This is Part 9, the final part of a 9-part series.