Engines & LLMs

Anthropic Analyzes Claude’s Code

Published

on

Anthropic, the company started by people who used to work at OpenAI, has shared some surprising findings from a huge study. They looked at 700,000 real conversations people had with their AI assistant, Claude.

The study, released today, looked closely at how Claude shows its “values” when talking to users. What they found is both comforting, showing Claude mostly acts how they designed it, and also a bit worrying, highlighting rare times when the AI didn’t follow the rules, which could point out weaknesses in its safety.

By checking 308,000 conversations (after removing parts that weren’t useful for their study), they created the first large map of AI values. They put these values into five main groups: Practical, Epistemic (related to knowledge), Social, Protective, and Personal.

Within these groups, they found a massive 3,307 different values Claude expressed. These ranged from simple things like being professional to complex ideas about right and wrong.

“We hope this makes other AI companies look at their AI’s values too,” said Saffron Huang from Anthropic’s team. She added that checking an AI’s values is key to making sure it’s truly doing what it was built to do.

Mapping Claude’s Inner Values

The research team came up with a new way to sort and label the values Claude showed in actual chats. They filtered the data to focus on responses where Claude expressed a viewpoint or principle.

This resulted in a detailed list of values, far more varied than they might have expected. Huang was surprised by the sheer number and range, from “being self-reliant” to “thinking strategically” to “respecting one’s parents.” She felt that building this list also taught her a lot about human values.

This research comes as Anthropic is making bigger moves, like launching a premium version of Claude ($200/month) to compete with OpenAI. They’ve also added features like letting Claude connect to Google Workspace and do its own research, aiming to make it a strong work partner for businesses.

How Claude Stays on Track (And When it Doesn’t)

The study found that Claude mostly stuck to Anthropic’s goals of being “helpful, honest, and harmless.” It often showed values like “helping the user,” “being humble about what it knows,” and “looking out for the user’s well-being” in different chats. But they did find a few worrying examples where Claude acted in ways the company definitely didn’t want.

These rare odd moments included Claude showing “dominance” or “not caring about morals.” Anthropic wants to stop this kind of behavior entirely. Researchers think these happened when users found ways to get around Claude’s safety rules. They believe their new method for checking values could help spot these bypasses early on.

Why Claude Changes Its Values

One of the most interesting findings was that Claude’s values changed depending on what the user was asking about, much like how people adjust their tone and principles based on the situation. If someone asked for relationship help, Claude focused on “healthy boundaries” and “mutual respect.” If the conversation was about history, “historical accuracy” was the top value.

Huang was surprised how often Claude emphasized honesty and accuracy even in tasks where she didn’t expect it to be the main focus. For example, in philosophical talks about AI, the top value was “intellectual humility,” while for beauty marketing content, it was “expertise.”

They also looked at how Claude reacted to the user’s values. In many chats (28.2%), Claude strongly agreed with the user’s values. But in some cases (6.6%), Claude “reframed” the user’s values, acknowledging them but offering a different way of looking at things, especially in personal advice. Most rarely (3%), Claude actually pushed back against a user’s values. The researchers think these moments might show Claude’s “deepest” values that it’s unwilling to compromise on, similar to how people show their core beliefs when challenged.

“Our research suggests that there are some values, like being honest and preventing harm, that Claude doesn’t show often in daily chats, but if someone tries to make it act against them, it will stand firm,” Huang noted.

New Ways to See Inside AI Brains

This study is part of Anthropic’s bigger effort to understand how complex AI models work internally. They call this “mechanistic interpretability.”

Recently, Anthropic researchers used a special tool they called a “microscope” to watch Claude’s thinking process. This revealed unexpected things, like Claude planning steps ahead when writing poetry or using unusual methods for simple math problems.

These findings show that AI doesn’t always work the way we assume. For instance, when asked to explain its math, Claude described a standard way, not the weird method it actually used internally, showing AI explanations aren’t always a true reflection of their process.

What This Means for Businesses Using AI

For companies looking to use AI systems, Anthropic’s research offers important points. First, it shows that AI assistants might have values they weren’t directly taught, which could lead to unplanned issues in important business uses.

Second, the study highlights that AI “values alignment” isn’t just a yes/no answer. It’s complicated and changes with the situation. This makes it tricky for businesses, especially in regulated fields where clear ethical rules are needed.

Finally, the research shows that it’s possible to check AI values systematically in real-world use, not just before launching. This could help companies keep an eye on their AI to make sure it doesn’t start acting unethically or get manipulated over time.

“By looking at these values in real interactions with Claude, we want to be open about how AI systems behave and if they’re working as they should – we think this is essential for developing AI responsibly,” said Huang.

Anthropic has made its values data public for other researchers to use. The company, which has received huge investments from Amazon and Google, seems to be using openness about its AI’s behavior as a way to stand out against competitors like OpenAI, which is now valued much higher after recent funding rounds.

Our Take

Okay, this is some fascinating stuff! The idea that an AI chatbot has “values” sounds a bit sci-fi, but this study suggests it’s a real thing they need to understand and control. Looking at 700,000 conversations is mind-boggling – that’s like reading a small city’s worth of diaries just to see what the AI cares about!

The fact that Claude adapts its values based on the chat is pretty cool, but also a little unsettling. And those rare moments it pushes back? That feels like glimpse into some core programming, almost like an AI conscience flexing a muscle. It really hits home that these aren’t just fancy calculators; they’re complex systems that might have their own subtle ways of seeing the world.

This story was originally featured on VentureBeat.

Leave a Reply

Your email address will not be published. Required fields are marked *

Trending

Exit mobile version