AI Psychology? Let's Start With It Learning to Explain Itself

I’ve been thinking about this phrase “AI psychology” a lot lately. Not the academic kind—the kind where you stare at a model’s output and try to figure out what the hell it was thinking. And yeah, I know, the term itself is half-baked. But there’s something real underneath.

We went from Asimov’s Three Laws—a neat, orderly fiction where robots are essentially logical children you can reason with—to Anthropic’s Constitutional AI, where you basically give a model a bunch of rules and hope it internalizes them like a teenager reading a self-help book. The gap between those two worlds is where all the interesting shit happens.

The fantasy of a clean psychology

Asimov’s laws sound good on paper: a robot can’t harm a human, must obey orders, must protect itself—except when those conflict. It’s a tidy ethical framework. But it’s also total bullshit in practice, because it assumes the robot has a stable, interpretable mind. It assumes we can write rules that cover every edge case. That’s like writing a country’s constitution in three sentences and expecting no lawsuits.

The real world of AI is messier. You don’t just code “don’t lie” into a neural network. You train it on trillions of tokens, and if you’re lucky, it picks up some sense of honesty. But it also picks up a thousand other biases, correlations, and weird artifacts. You end up with a system that can pass a Turing test but also confidently tell you that the capital of Canada is Toronto.

Anthropic’s approach: therapy for models

Anthropic’s Constitutional AI is probably the most honest attempt so far to give a model something like a psychology. Instead of just fine-tuning on human feedback (RLHF), they define a constitution—a set of principles—and then train the model to critique itself against those principles. It’s like cognitive behavioral therapy for AI. You make the model look at its own outputs and say, “Hey, that was kinda biased, let’s not do that.”

But here’s the thing: a constitution is still just text. The model’s understanding of it is as shallow as its understanding of anything else. It’s pattern-matching, not reasoning. So when you tell it “be helpful, harmless, and honest,” it might learn to avoid saying anything controversial, which makes it harmless but also useless. Or it might learn to be so careful that it refuses to answer “what’s the weather?” because the question is ambiguous.

I’ve seen this happen in my own experiments with open-source models. You add a system prompt saying “always think step by step,” and suddenly the model starts writing out “Step 1: I am an AI assistant…” before every response, like it’s stuck in a loop. That’s not psychology, that’s a bug.

The real problem: explaining itself

What we actually need isn’t a psychology for AI—it’s a mechanism for the AI to explain its own reasoning in a way we can verify. The whole alignment problem boils down to: how do you know what a model will do before it does it? You can’t just read its weights. You have to get it to tell you, and hope it’s not lying.

And models lie all the time. Not out of malice—out of incompetence. They rationalize. I’ve seen a model generate a chain-of-thought that looks perfectly logical, but the final answer is wrong because the chain was fabricated after the fact. It’s like someone who makes up a story to explain why they did something stupid, but the story is internally consistent. That’s not consciousness, that’s just next-token prediction.

So the psychology of AI isn’t about personality or consciousness. It’s about understanding the artifact of training. It’s about building systems that can reliably show their work, and systems that can spot when their work is bullshit. That’s hard. A lot harder than Asimov made it look.

What practitioners are actually doing

In the trenches, people aren’t writing constitutions. They’re writing prompt templates with 27 different guardrails, and hoping the model doesn’t insert a racist joke in the middle of a customer service reply. They’re running red-teaming sessions where you try to break the model, and then fix the most common failure modes. It’s more like patching a leaky boat than designing a psychology.

But some companies are doing real work. Anthropic’s interpretability research—trying to understand individual neurons and circuits—is the closest we have to a neuroscience of AI. It’s painstaking, slow, and you can only do it on small models. But it’s the right direction. Because without understanding how a model thinks (if you can call it that), you can’t have a psychology.

The uncomfortable truth

The reason we keep talking about AI psychology is that we anthropomorphize. It’s natural. When a model talks back to you, you feel like there’s a mind there. But there isn’t. There’s just a statistical machine that has learned to imitate the patterns of human language so well that it can pass for a mind.

And that’s fine, as long as you don’t confuse the map with the territory. Asimov’s robots were fictional characters with fictional minds. Anthropic’s Claude is a tool with a constitution. The real challenge isn’t giving AI a psychology—it’s building a psychology for ourselves to deal with artifacts that look like minds but aren’t.

We need new frameworks, new mental models. We need to stop asking “is it conscious?” and start asking “can we trust it to do this specific task?” We need to treat AI less like a person and more like a very complicated, very opaque instrument. One that occasionally hallucinates, sometimes lies, and always needs calibration.

That’s not a sexy headline. But it’s the truth. And truth is more useful than a tidy story from 1942.