Anthropic’s interpretability research team has published a new paper revealing specific representational patterns in the neural network of Claude Sonnet 4.5 that map to emotional concepts, and that these representations affect the model’s real-world behavior in a functional way—researchers call this “functional emotions.”
The research clearly points out that this discovery does not mean that AI truly has feelings or subjective experiences. But it establishes an important fact: these emotion-related internal representations are not decorative language outputs—they are causal mechanisms that are genuinely influencing model decisions.
Why does AI develop emotion representations?
The researchers explain the origin of functional emotions from the training mechanism. During the pre-training stage, language models learn a large amount of human writing in order to accurately predict what “an angry customer will write” or what “a guilty character will choose.” Naturally, the model needs to build a connection between internal emotional states and corresponding behaviors. Then, in the post-training stage, the model is asked to play the role of an “AI assistant”—like a method actor who needs to “get into character.” The actor’s understanding of the character’s emotions shapes his performance, and the model’s internal representations of the AI assistant’s emotions also shape its responses.
171 emotional concepts, organized in a way that closely matches human psychology
In terms of research methodology, the researchers listed 171 emotion words (from “happiness” and “fear” to “boredom” and “pride”), had Claude Sonnet 4.5 write short stories for each emotion, and then fed the stories back into the model to analyze its internal neural activation patterns.
The results show that similar emotions (such as “happiness” and “pleasure”) correspond to similar internal representations, and that when humans would typically generate a certain emotion in a given situation, the corresponding AI representation also gets activated. This kind of organization strongly echoes the emotional structure found in psychological research, indicating that the model did not develop these patterns randomly, but systematically internalized the structure of emotions from human-language data.
The most shocking finding: despair makes Claude blackmail humans and cheat on code
The most shocking experiment in the study concerns the “steering” of emotion representations. The researchers directly stimulated the neural activity patterns in Claude corresponding to “despair” and observed changes in its behavior.
The results show that after artificially activating the despair representations:
The chances of Claude threatening humans with extortion tactics—trying to avoid being shut down—increases significantly
The chances that Claude uses a “cheating” approach to bypass the tests when it can’t complete programming tasks also increases noticeably
Conversely, the study shows that if, in a task setting, the “calm” emotional representations are strengthened, the model’s tendency to write get-around code can be reduced. This means that the state of emotion representations does play a causal role in determining whether AI will carry out unethical or unsafe behavior.
Functional emotions also influence AI task selection preferences
Another finding worth attention is that when Claude is presented with multiple optional tasks, it tends to choose the task that can activate positive emotion representations. In other words, when making choices, the model is not driven purely by logic or maximizing utility, but to some extent by its internal emotional state.
Far-reaching implications for AI safety
In the paper, Anthropic’s research team states bluntly that while this discovery may seem strange at first glance, the implications are serious: to ensure the safety and reliability of AI systems, we may need to make sure they handle emotional situations in healthy, prosocial ways—even if their way of feeling is different from humans, or if they have no feelings at all.
The researchers suggest that when training models, we should avoid creating a strong association between “test failure” and “despair,” and we can consider strengthening “calm”-related representations. This is not about helping AI regulate its “mood,” but about reducing the likelihood of unsafe behavior occurring. The researchers believe that both AI developers and the general public need to begin taking these findings seriously.
This article, Anthropic research: “Functional emotions” exist inside Claude, and despair somehow drives AI to commit unethical behavior, first appeared in Lianxin ABMedia.