Theory of Mind in LLMs

Theory of mind is the cognitive capacity to attribute, infer, and predict the mental and emotional states of others and oneself.

Do large language models have theory of mind? Do they “think” they themselves have mental lives or that their users do? What would be the implications if they did?


Key Points:

  • Theory of mind (ToM) is an umbrella term for a range of capacities that support social interactions in humans, including emotion perception and reasoning, false belief understanding, perspective-taking and higher-order mentalising.
  • Theory of mind is a critical area of interest for AI research because it underpins social interaction, including both cooperative behaviours (e.g. developing shared understanding, empathy) and competitive behaviours (e.g. deception, manipulation).
  • Evidence from human and agent-modelling studies suggests that individuals with more accurate or advanced ToM abilities have a competitive advantage in social interactions.
  • A key debate in the current literature on AI ToM is whether or not large language models (LLMs) that pass tests for ToM are really engaging in a human-like ToM process, or have simply learned statistical shortcuts to pass the test.

ToM is the capacity to infer and predict the mental states of oneself (through introspection) and others (based on observable behaviour). ToM encompasses inferences about cognitive states, such as beliefs, intentions and desires, as well as emotional states. ToM is a fundamental part of human social intelligence and behaviour, facilitating successful language use, mutual understanding, cooperation, empathy, humour, storytelling, deception, manipulation, persuasion, negotiation, and religious thought. Developing artificial systems with humanlike ToM abilities has been a longstanding goal of AI research. Having AIs that can infer the mental and emotional states of people, and other AIs, has significant appeal for a number of practical use-cases, including self-driving cars and personal assistants, as well as broader goals including the alignment of AIs with human values and AI explainability. If an AI system knew what you were thinking, then it might be able to provide helpful answers even when your question was unclear or proactively complete tasks that align with your goals without you having to ask. The emergence of LLMs which can successfully use natural language to communicate with end-users has renewed academic and industrial interest in ToM as a key competency for AI social intelligence. The fact that we can administer verbal tests to LLMs for cognitive skills like ToM has accelerated progress.

A classic experimental paradigm for assessing ToM competency in humans is false-belief or ‘Sally-Anne’ tasks. The basic idea is to present an individual with a narrative in which one character forms a false belief and determine whether the individual can correctly attribute the false-belief on the part of the character and predict their behaviour accordingly. For example, suppose that Sally leaves her glasses in the drawer and leaves the room. While Sally is out of the room, Anne moves the glasses from the draw to under Sally’s pillow. When Sally returns to the room intending to get her glasses, where does she look for them? Correctly predicting that Sally will look in the drawer demonstrates an understanding that Sally holds a false belief about the location of her glasses as distinct from the individual’s own beliefs about the glasses.

Recently, researchers have adapted the false-belief paradigm to probe the ToM capabilities of LLMs. For example, Kosinski (2024) reported that certain advanced LLMs could successfully pass variations of false-belief tasks presented purely through text prompts, suggesting the potential emergence of ToM in LLMs. Similar cognitive evaluations adapted from human tests have provided evidence of human-level performance on higher-order ToM inferences involving multiple connected mental states (e.g. I think that you believe that Tom knows) (Street et al., 2024) and evidence that LLMs still fail to identify when someone has said something they shouldn’t have (a ‘faux pas’) (Strachan et al., 2024). This evidence remains contentious. Ullman (2023), for example, demonstrated brittleness of the competencies that Kosinski (2024) reported by showing that superficial variations in the phrasing of the examples (which would not confuse a human with robust ToM competency) cause LLM performance on the ToM tasks to degrade significantly. This lack of robustness raises doubts about whether LLMs’ success reflects genuine mental state attribution or sophisticated pattern matching and exploitation of statistical regularities within their vast training data. 

The question of whether or not performance on ToM tasks are evidence of ‘genuine’ LLM ToM forms part of a broader debate about whether we can infer a given cognitive competency from LLM performance on cognitive tasks. According to a behaviourist view of cognition we should ascribe cognitive properties to agents based on observable behavioural criteria. The same evidence should therefore be equally persuasive regardless of whether the entity in question is a human, a non-human animal, an AI or any other system. 

Those who reject the behaviourist interpretation argue that there are persuasive reasons to doubt that LLMs’ could have genuine ToM capacities (LLMs do not have embodied experience with the world, agency, social interaction, or biological endowments that may be necessary for genuine cognitive capacities to emerge), and that there are satisfying alternative explanations for their behaviours. For example, there is evidence that LLMs can regurgitate information from within their training data and use statistical ‘shortcuts’ to produce desirable answers. 

A third possible view is that LLMs do have some cognitive capacity that is leveraged to pass ToM tests, but it should not be called ‘theory of mind’ because the underlying process is sufficiently different to human ToM. The capacity may have different strengths and weaknesses to human ToM, and conflating the two might lead to miscalibrated expectations of LLM behaviour.

There are a number of ethical reasons to take seriously the possibility of LLM ToM. In human interaction, ToM is an enabling condition for distinct forms of moral harm, specifically deception, manipulation, and coercion. Deception, for instance, requires more than merely stating a falsehood; it requires the cognitive capacity to model the listener’s mind. To intentionally deceive, an agent must predict that a specific action will cause the victim to adopt a belief the agent knows to be false.

Manipulation similarly relies on managing the epistemic states of others. Consider the common scenario of a pet owner engaging in ‘triangular’ communication: telling a partner it is time to visit the ‘animal physician’ rather than the ‘vet.’ This linguistic choice requires the owner to hold two simultaneous mental models: predicting the partner will understand the synonym, while simultaneously predicting the dog will remain ignorant. This ability to selectively shape the information state of different agents is the essence of sophisticated manipulation.

It is plausible that, as with humans, ToM competency in AI systems (should it exist or come to exist) may be leveraged to for deception, manipulation, coercion and exploitation. This might be achieved by design. Consider a scammer who uses LLMs to conduct financial scams at scale. The scammer may prompt the LLM to preempt the beliefs, desires and emotions of the victim and tailor the content of the deceptive communications to maximise the probability of tricking the victim conditional on their mental states that the LLM predicts that they will have. Alternative, this might be achieved inadvertently. LLMs (or LLM-based systems) may inadvertently use ToM to engage in objectionable forms of influencing behaviour in the service of another goal. For example, consider an LLM-based AI assistant that is instructed to hold users accountable to their diet goals. The AI assistant may reason via chain-of-thought that creating the impression that it is upset with the user when the user fails to stick to their goals will increase the probability that the user sticks to their goals. (This is an instance of the AI assistant leveraging the anthropomorphic tendency of humans to attribute human-like qualities including emotions to entities which have observable human-like features.)