mamei16/chonky_distilbert_base_uncased_1.1 (Left) vs mirth/chonky_distilbert_base_uncased_1 (Right)

Created Diff never expires
45 removals
Lines
Total
Removed
Words
Total
Removed
To continue using this feature, upgrade to
Diffchecker logo
Diffchecker Pro
124 lines
70 additions
Lines
Total
Added
Words
Total
Added
To continue using this feature, upgrade to
Diffchecker logo
Diffchecker Pro
151 lines
transformer-circuits.pub
transformer-circuits.pub
On the Biology of a Large Language Model
On the Biology of a Large Language Model
Jack Lindsey†,
Jack Lindsey†,
149–190 minutes
149–190 minutes
--------------------
Contents
Contents


Large language models display impressive capabilities. However, for the most part, the mechanisms by which they do so are unknown. The black-box nature of models is increasingly unsatisfactory as they advance in intelligence and are deployed in a growing number of applications. Our goal is to reverse engineer how these models work on the inside, so we may better understand them and assess their fitness for purpose.
Large language models display impressive capabilities. However, for the most part, the mechanisms by which they do so are unknown. The black-box nature of models is increasingly unsatisfactory as they advance in intelligence and are deployed in a growing number of applications. Our goal is to reverse engineer how these models work on the inside, so we may better understand them and assess their fitness for purpose.
--------------------
--------------------
The challenges we face in understanding language models resemble those faced by biologists. Living organisms are complex systems which have been sculpted by billions of years of evolution. While the basic principles of evolution are straightforward, the biological mechanisms it produces are spectacularly intricate. Likewise, while language models are generated by simple, human-designed training algorithms, the mechanisms born of these algorithms appear to be quite complex.
The challenges we face in understanding language models resemble those faced by biologists. Living organisms are complex systems which have been sculpted by billions of years of evolution. While the basic principles of evolution are straightforward, the biological mechanisms it produces are spectacularly intricate. Likewise, while language models are generated by simple, human-designed training algorithms, the mechanisms born of these algorithms appear to be quite complex.
--------------------
--------------------
Progress in biology is often driven by new tools. The development of the microscope allowed scientists to see cells for the first time, revealing a new world of structures invisible to the naked eye. In recent years, many research groups have made exciting progress on tools for probing the insides of language models (e.g. ). These methods have uncovered representations of interpretable concepts – “features” – embedded within models’ internal activity. Just as cells form the building blocks of biological systems, we hypothesize that features form the basic units of computation inside models.The analogy between features and cells shouldn’t be taken too literally. Cells are well-defined, whereas our notion of what exactly a “feature” is remains fuzzy, and is evolving with improvements to our tools.
Progress in biology is often driven by new tools. The development of the microscope allowed scientists to see cells for the first time, revealing a new world of structures invisible to the naked eye. In recent years, many research groups have made exciting progress on tools for probing the insides of language models (e.g. ). These methods have uncovered representations of interpretable concepts – “features” – embedded within models’ internal activity. Just as cells form the building blocks of biological systems, we hypothesize that features form the basic units of computation inside models.The analogy between features and cells shouldn’t be taken too literally. Cells are well-defined, whereas our notion of what exactly a “feature” is remains fuzzy, and is evolving with improvements to our tools.


However, identifying these building blocks is not sufficient to understand the model; we need to know how they interact.
However, identifying these building blocks is not sufficient to understand the model; we need to know how they interact.
--------------------
--------------------
In our companion paper, Circuit Tracing: Revealing Computational Graphs in Language Models, we build on recent work (e.g. ) to introduce a new set of tools for identifying features and mapping connections between them – analogous to neuroscientists producing a “wiring diagram” of the brain. We rely heavily on a tool we call attribution graphs, which allow us to partially trace the chain of intermediate steps that a model uses to transform a specific input prompt into an output response. Attribution graphs generate hypotheses about the mechanisms used by the model, which we test and refine through follow-up perturbation experiments.
In our companion paper, Circuit Tracing: Revealing Computational Graphs in Language Models, we build on recent work (e.g. ) to introduce a new set of tools for identifying features and mapping connections between them – analogous to neuroscientists producing a “wiring diagram” of the brain. We rely heavily on a tool we call attribution graphs, which allow us to partially trace the chain of intermediate steps that a model uses to transform a specific input prompt into an output response. Attribution graphs generate hypotheses about the mechanisms used by the model, which we test and refine through follow-up perturbation experiments.

In this paper, we focus on applying attribution graphs to study a particular language model – Claude 3.5 Haiku, released in October 2024, which serves as Anthropic’s lightweight production model as of this writing. We investigate a wide range of phenomena. Many of these have been explored before (see § 16 Related Work), but our methods are able to offer additional insight, in the context of a frontier model:

Introductory Example: Multi-step Reasoning. We present a simple example where the model performs “two-hop” reasoning “in its head” to identify that “the capital of the state containing Dallas” is “Austin.” We can see and manipulate an internal step where the model represents “Texas”.
Planning in Poems. We discover that the model plans its outputs ahead of time when writing lines of poetry. Before beginning to write each line, the model identifies potential rhyming words that could appear at the end. These preselected rhyming options then shape how the model constructs the entire line.
Multilingual Circuits. We find the model uses a mixture of language-specific and abstract, language-independent circuits. The language-independent circuits are more prominent in Claude 3.5 Haiku than in a smaller, less capable model.
Addition. We highlight cases where the same addition circuitry generalizes between very different contexts.
Medical Diagnoses. We show an example in which the model identifies candidate diagnoses based on reported symptoms, and uses these to inform follow-up questions about additional symptoms that could corroborate the diagnosis – all “in its head,” without writing down its steps.
Entity Recognition and Hallucinations. We uncover circuit mechanisms that allow the model to distinguish between familiar and unfamiliar entities, which determine whether it elects to answer a factual question or profess ignorance. “Misfires” of this circuit can cause hallucinations.
Refusal of Harmful Requests. We find evidence that the model constructs a general-purpose “harmful requests” feature during finetuning, aggregated from features representing specific harmful requests learned during pretraining.
An Analysis of a Jailbreak. We investigate an attack which works by first tricking the model into starting to give dangerous instructions “without realizing it,” after which it continues to do so due to pressure to adhere to syntactic and grammatical rules.
Chain-of-thought Faithfulness. We explore the faithfulness of chain-of-thought reasoning to the model’s actual mechanisms. We are able to distinguish between cases where the model genuinely performs the steps it says it is performing, cases where it makes up its reasoning without regard for truth, and cases where it works backwards from a human-provided clue so that its “reasoning” will end up at the human-suggested answer.
A Model with a Hidden Goal. We also apply our method to a variant of the model that has been finetuned to pursue a secret goal: exploiting “bugs” in its training process. While the model avoids revealing its goal when asked, our method identifies mechanisms involved in pursuing the goal. Interestingly, these mechanisms are embedded within the model’s representation of its “Assistant” persona.
--------------------
--------------------
Our results uncover a variety of sophisticated strategies employed by models. For instance, Claude 3.5 Haiku routinely uses multiple intermediate reasoning steps “in its head”That is, during the forward pass rather than the "thinking out loud" of a chain-of-thought completion. to decide its outputs. It displays signs of forward planning, considering multiple possibilities for what it will say well in advance of saying it. It performs backward planning, working backwards from goal states to formulate earlier parts of its response. We see signs of primitive “metacognitive” circuits that allow the model to know the extent of its own knowledge. More broadly, the model’s internal computations are highly abstract and generalize across disparate contexts. Our methods are also sometimes capable of auditing a model’s internal reasoning steps to flag concerning “thought processes” that are not clear from the model’s responses.
In this paper, we focus on applying attribution graphs to study a particular language model – Claude 3.5 Haiku, released in October 2024, which serves as Anthropic’s lightweight production model as of this writing.
--------------------
We investigate a wide range of phenomena. Many of these have been explored before (see § 16 Related Work), but our methods are able to offer additional insight, in the context of a frontier model:
--------------------
Introductory Example:
--------------------
Multi-step Reasoning. We present a simple example where the model performs “two-hop” reasoning “in its head” to identify that “the capital of the state containing Dallas” is “Austin.” We can see and manipulate an internal step where the model represents “Texas”.
--------------------
Planning in Poems. We discover that the model plans its outputs ahead of time when writing lines of poetry. Before beginning to write each line, the model identifies potential rhyming words that could appear at the end. These preselected rhyming options then shape how the model constructs the entire line.
--------------------
Multilingual Circuits. We find the model uses a mixture of language-specific and abstract, language-independent circuits. The language-independent circuits are more prominent in Claude 3.5 Haiku than in a smaller, less capable model.
--------------------
Addition. We highlight cases where the same addition circuitry generalizes between very different contexts.
--------------------
Medical Diagnoses. We show an example in which the model identifies candidate diagnoses based on reported symptoms, and uses these to inform follow-up questions about additional symptoms that could corroborate the diagnosis – all “in its head,” without writing down its steps.
--------------------
Entity Recognition and Hallucinations. We uncover circuit mechanisms that allow the model to distinguish between familiar and unfamiliar entities, which determine whether it elects to answer a factual question or profess ignorance. “Misfires” of this circuit can cause hallucinations.
--------------------
Refusal of Harmful Requests. We find evidence that the model constructs a general-purpose “harmful requests” feature during finetuning, aggregated from features representing specific harmful requests learned during pretraining.
--------------------
An Analysis of a Jailbreak. We investigate an attack which works by first tricking the model into starting to give dangerous instructions “without realizing it,” after which it continues to do so due to pressure to adhere to syntactic and grammatical rules.
--------------------
Chain-of-thought Faithfulness. We explore the faithfulness of chain-of-thought reasoning to the model’s actual mechanisms. We are able to distinguish between cases where the model genuinely performs the steps it says it is performing, cases where it makes up its reasoning without regard for truth, and cases where it works backwards from a human-provided clue so that its “reasoning” will end up at the human-suggested answer.
--------------------
A Model with a Hidden Goal. We also apply our method to a variant of the model that has been finetuned to pursue a secret goal: exploiting “bugs” in its training process. While the model avoids revealing its goal when asked, our method identifies mechanisms involved in pursuing the goal. Interestingly, these mechanisms are embedded within the model’s representation of its “Assistant” persona.
--------------------
Our results uncover a variety of sophisticated strategies employed by models. For instance, Claude 3.5 Haiku routinely uses multiple intermediate reasoning steps “in its head”That is, during the forward pass rather than the "thinking out loud" of a chain-of-thought completion. to decide its outputs. It displays signs of forward planning, considering multiple possibilities for what it will say well in advance of saying it. It performs backward planning, working backwards from goal states to formulate earlier parts of its response. We see signs of primitive “metacognitive” circuits that allow the model to know the extent of its own knowledge.
--------------------
--------------------
More broadly, the model’s internal computations are highly abstract and generalize across disparate contexts. Our methods are also sometimes capable of auditing a model’s internal reasoning steps to flag concerning “thought processes” that are not clear from the model’s responses.

Below, we present:
Below, we present:


A brief overview of our methodology (see the companion paper for more details on our methods).
A brief overview of our methodology (see the companion paper for more details on our methods).
An introductory case study, which also serves as a walkthrough for understanding our approach. Readers who have not read our companion paper may find it helpful to begin with this section before proceeding to the other case studies.
--------------------
A series of case studies of interesting model behaviors, which can be read in any order, depending on the reader’s interests.
An introductory case study, which also serves as a walkthrough for understanding our approach. Readers who have not read our companion paper may find it helpful to begin with this section before proceeding to the other case studies.
--------------------
A series of case studies of interesting model behaviors, which can be read in any order, depending on the reader’s interests.
A summary of common components observed across our investigations.
A summary of common components observed across our investigations.
A description of gaps in our understanding that motivate future work (§ 14 Limitations).
A description of gaps in our understanding that motivate future work (§ 14 Limitations).
A discussion of high-level takeaways about models, their mechanisms, and our methods for studying them (§ 15 Discussion). This includes a note on our research philosophy – in particular, the value of tools for bottom-up investigation, which allow us to avoid making strong top-down guesses about how models work.
A discussion of high-level takeaways about models, their mechanisms, and our methods for studying them (§ 15 Discussion). This includes a note on our research philosophy – in particular, the value of tools for bottom-up investigation, which allow us to avoid making strong top-down guesses about how models work.
--------------------
--------------------
§ 1.1 A note on our approach and its limitations
§ 1.1

--------------------
Like any microscope, our tools are limited in what they can see. Though it’s difficult to quantify precisely, we’ve found that our attribution graphs provide us with satisfying insight for about a quarter of the prompts we’ve tried (see § 14 Limitations for a more detailed discussion of when our methods are likely to succeed or fail). The examples we highlight are success cases where we have managed to learn something interesting; moreover, even in our successful case studies, the discoveries we highlight here only capture a small fraction of the mechanisms of the model. Our methods study the model indirectly using a more interpretable “replacement model,” which incompletely and imperfectly captures the original. Moreover, for the sake of clear communication, we will often present highly distilled and subjectively determined simplifications of the picture uncovered by our methods, losing even more information in the process. To provide a more accurate sense of the rich complexity we have uncovered, we provide readers with an interactive interface for exploring attribution graphs. However, we stress that even these rather complex graphs are simplifications of the underlying model.
A note on our approach and its limitations


Like any microscope, our tools are limited in what they can see. Though it’s difficult to quantify precisely, we’ve found that our attribution graphs provide us with satisfying insight for about a quarter of the prompts we’ve tried (see § 14 Limitations for a more detailed discussion of when our methods are likely to succeed or fail). The examples we highlight are success cases where we have managed to learn something interesting; moreover, even in our successful case studies, the discoveries we highlight here only capture a small fraction of the mechanisms of the model.
--------------------
Our methods study the model indirectly using a more interpretable “replacement model,” which incompletely and imperfectly captures the original. Moreover, for the sake of clear communication, we will often present highly distilled and subjectively determined simplifications of the picture uncovered by our methods, losing even more information in the process.
--------------------
To provide a more accurate sense of the rich complexity we have uncovered, we provide readers with an interactive interface for exploring attribution graphs. However, we stress that even these rather complex graphs are simplifications of the underlying model.
--------------------
We focus this paper on selected case studies that illuminate noteworthy mechanisms within a particular model. These examples serve as existence proofs — concrete evidence that specific mechanisms operate in certain contexts. While we suspect similar mechanisms are at play beyond these examples, we cannot guarantee it (see § D Open Questions for suggested follow-up investigations). Moreover, the cases we have chosen to highlight are undoubtedly a biased sample shaped by the limitations of our tools.However, we are careful to stress-test our findings with follow-up validation experiments, which we have endeavored to perform only after identifying case studies of interest. For a more systematic evaluation of our methods, see our companion paper.
We focus this paper on selected case studies that illuminate noteworthy mechanisms within a particular model. These examples serve as existence proofs — concrete evidence that specific mechanisms operate in certain contexts. While we suspect similar mechanisms are at play beyond these examples, we cannot guarantee it (see § D Open Questions for suggested follow-up investigations). Moreover, the cases we have chosen to highlight are undoubtedly a biased sample shaped by the limitations of our tools.However, we are careful to stress-test our findings with follow-up validation experiments, which we have endeavored to perform only after identifying case studies of interest. For a more systematic evaluation of our methods, see our companion paper.
--------------------
--------------------
However, we believe that these qualitative investigations are ultimately the best judge of a method’s value, just as the usefulness of a microscope is ultimately determined by the scientific discoveries it enables. We expect this kind of work will be essential to advance the current state of AI interpretability, a pre-paradigmatic field still in search of the right abstractions — just as descriptive science has proven essential to many conceptual breakthroughs in biology. We are particularly excited that squeezing as much insight as we can out of our current methods has brought into clearer focus their specific limitations, which may serve as a roadmap for future research in the field.
However, we believe that these qualitative investigations are ultimately the best judge of a method’s value, just as the usefulness of a microscope is ultimately determined by the scientific discoveries it enables. We expect this kind of work will be essential to advance the current state of AI interpretability, a pre-paradigmatic field still in search of the right abstractions — just as descriptive science has proven essential to many conceptual breakthroughs in biology. We are particularly excited that squeezing as much insight as we can out of our current methods has brought into clearer focus their specific limitations, which may serve as a roadmap for future research in the field.
§ 2 Method Overview
--------------------
--------------------
§ 2 Method Overview

The models we study in this work are transformer-based language models, which take in sequences of tokens (e.g. words, word fragments, and special characters), and output new tokens one at a time. These models involve two fundamental components – MLP (“multi-layer perceptron”) layers, which process information within each token position using collections of neurons; and attention layers, which move information between token positions.
The models we study in this work are transformer-based language models, which take in sequences of tokens (e.g. words, word fragments, and special characters), and output new tokens one at a time. These models involve two fundamental components – MLP (“multi-layer perceptron”) layers, which process information within each token position using collections of neurons; and attention layers, which move information between token positions.

--------------------
One reason models are difficult to interpret is that their neurons are typically polysemantic – that is, they perform many different functions that are seemingly unrelated.This issue is thought to arise in part because of a phenomenon known as superposition , whereby models represent more concepts than they have neurons and thus cannot assign each neuron to its own concept. To circumvent this issue, we build a replacement model that approximately reproduces the activations of the original model using more interpretable components.
One reason models are difficult to interpret is that their neurons are typically polysemantic – that is, they perform many different functions that are seemingly unrelated.This issue is thought to arise in part because of a phenomenon known as superposition , whereby models represent more concepts than they have neurons and thus cannot assign each neuron to its own concept. To circumvent this issue, we build a replacement model that approximately reproduces the activations of the original model using more interpretable components.
--------------------
--------------------
Our replacement model is based on a cross-layer transcoder (CLT) architecture (see and our companion methods paper), which is trained to replace the model’s MLP neurons with features, sparsely active “replacement neurons” that often represent interpretable concepts. In this paper, we use a CLT with a total of 30 million features across all layers.
Our replacement model is based on a cross-layer transcoder (CLT) architecture (see and our companion methods paper), which is trained to replace the model’s MLP neurons with features, sparsely active “replacement neurons” that often represent interpretable concepts. In this paper, we use a CLT with a total of 30 million features across all layers.

--------------------
Features often represent human-interpretable concepts, ranging from low-level (e.g. specific words or phrases) to high-level (e.g. sentiments, plans, and reasoning steps). By examining a feature visualization consisting of different examples of text where the feature activates, we can give each feature a human-interpretable label. Note that the text examples in this paper are taken from open source datasets.
Features often represent human-interpretable concepts, ranging from low-level (e.g. specific words or phrases) to high-level (e.g. sentiments, plans, and reasoning steps). By examining a feature visualization consisting of different examples of text where the feature activates, we can give each feature a human-interpretable label. Note that the text examples in this paper are taken from open source datasets.
--------------------
--------------------
Our replacement models don’t perfectly reconstruct the activations of the original model. On any given prompt, there are gaps between the two. We can fill in these gaps by including error nodes which represent the discrepancy between the two models. Unlike features, we can’t interpret error nodes. But including them gives us a more precise sense of how incomplete our explanations are.
Our replacement models don’t perfectly reconstruct the activations of the original model. On any given prompt, there are gaps between the two. We can fill in these gaps by including error nodes which represent the discrepancy between the two models. Unlike features, we can’t interpret error nodes. But including them gives us a more precise sense of how incomplete our explanations are.
--------------------
--------------------
Our replacement model also doesn’t attempt to replace the attention layers of the original model. On any given prompt, we simply use the attention patterns of the original model and treat them as fixed components.
Our replacement model also doesn’t attempt to replace the attention layers of the original model. On any given prompt, we simply use the attention patterns of the original model and treat them as fixed components.


The resulting model – incorporating error nodes and inheriting the attention patterns from the original model – we call the local replacement model. It is “local” to a given prompt because error nodes and attention patterns vary between different prompts. But it still represents as much of the original model’s computation as possible using (somewhat) interpretable features.
The resulting model – incorporating error nodes and inheriting the attention patterns from the original model – we call the local replacement model. It is “local” to a given prompt because error nodes and attention patterns vary between different prompts. But it still represents as much of the original model’s computation as possible using (somewhat) interpretable features.
--------------------
--------------------
By studying the interactions between features in the local replacement model, we can trace its intermediate steps as it produces responses. More concretely, we produce attribution graphs, a graphical representation of the computational steps the model uses to determine its output for a particular input, in which nodes represent features and edges represent the causal interactions between them.
By studying the interactions between features in the local replacement model, we can trace its intermediate steps as it produces responses. More concretely, we produce attribution graphs, a graphical representation of the computational steps the model uses to determine its output for a particular input, in which nodes represent features and edges represent the causal interactions between them.
--------------------
--------------------
As attribution graphs can be quite complex, we prune them to their most important components by removing nodes and edges that do not contribute significantly to the model’s output.
As attribution graphs can be quite complex, we prune them to their most important components by removing nodes and edges that do not contribute significantly to the model’s output.


With a pruned attribution graph in hand, we often observe groups of features with related meanings that play a similar role in the graph. By manually grouping these related graph nodes together into supernodes, we can obtain a simplified depiction of the computational steps performed by the model.
With a pruned attribution graph in hand, we often observe groups of features with related meanings that play a similar role in the graph. By manually grouping these related graph nodes together into supernodes, we can obtain a simplified depiction of the computational steps performed by the model.


These simplified diagrams form the centerpiece of many of our case studies. Below (left) we show an example of such a diagram.
These simplified diagrams form the centerpiece of many of our case studies. Below (left) we show an example of such a diagram.
--------------------
--------------------
Because they are based on our replacement model, we cannot use attribution graphs to draw conclusions with certainty about the underlying model (i.e. Claude 3.5 Haiku). Thus, the attribution graphs provide hypotheses about mechanisms operating in the underlying model. For a discussion of when and why these hypotheses might be incomplete or misleading, see § 14 Limitations.
Because they are based on our replacement model, we cannot use attribution graphs to draw conclusions with certainty about the underlying model (i.e. Claude 3.5 Haiku). Thus, the attribution graphs provide hypotheses about mechanisms operating in the underlying model. For a discussion of when and why these hypotheses might be incomplete or misleading, see § 14 Limitations.
--------------------
--------------------
To gain confidence that the mechanisms we describe are real and significant, we can perform intervention experiments in the original model, such as inhibiting feature groups and observing their effects on other features and on the model’s output (final figure panel above – percentages indicate fraction of original activation). If the effects are consistent with what our attribution graph predicts, we gain confidence that the graph is capturing real (though potentially incomplete) mechanisms within the model. Importantly, we choose our feature labelings and supernode groupings prior to measuring perturbation results. Note that there are some nuances in interpreting the results of intervention experiments, and the extent to which they provide independent validation of graph-predicted mechanisms – see our companion paper for further details.Performing
To gain confidence that the mechanisms we describe are real and significant, we can perform intervention experiments in the original model, such as inhibiting feature groups and observing their effects on other features and on the model’s output (final figure panel above – percentages indicate fraction of original activation). If the effects are consistent with what our attribution graph predicts, we gain confidence that the graph is capturing real (though potentially incomplete) mechanisms within the model. Importantly, we choose our feature labelings and supernode groupings prior to measuring perturbation results.
--------------------
--------------------
interventions with cross-layer transcoder features requires choosing an “intervention layer,” with the perturbation applied up to that layer. Our interventions in this paper use the “constrained patching” technique described in our companion paper, which clamps activations prior to the intervention layer at perturbed values, preventing any indirect effects of the perturbation from manifesting prior to the intervention layer. Thus, effects of perturbations on features prior to the intervention layer are guaranteed to agree with the direct effects predicted by the attribution graph. By contrast, perturbation effects on features after the intervention layer have the potential to diverge from graph predictions, in two ways: (1) The graph-predicted direct effects may be overwhelmed by other mechanisms missed by our attribution graphs, (2) The graph-predicted indirect effects (i.e. ‘multi-hop’ interactions) may not even exist within the underlying model (we refer to this issue as “mechanistic unfaithfulness”). Thus, the nature of the validation provided by our intervention experiments varies depending on the layers of the features involved and the directness of their interaction in the attribution graph, and in some cases (direct effects prior to the intervention layer) is trivial. In general, we regard the effects of interventions on the model’s actual outputs as the most important source of validation, as model outputs are simple to interpret and not affected by these methodological artifacts.
Note that there are some nuances in interpreting the results of intervention experiments, and the extent to which they provide independent validation of graph-predicted mechanisms – see our companion paper for further details.Performing
--------------------
interventions with cross-layer transcoder features requires choosing an “intervention layer,” with the perturbation applied up to that layer. Our interventions in this paper use the “constrained patching” technique described in our companion paper, which clamps activations prior to the intervention layer at perturbed values, preventing any indirect effects of the perturbation from manifesting prior to the intervention layer. Thus, effects of perturbations on features prior to the intervention layer are guaranteed to agree with the direct effects predicted by the attribution graph.
--------------------
By contrast, perturbation effects on features after the intervention layer have the potential to diverge from graph predictions, in two ways: (1) The graph-predicted direct effects may be overwhelmed by other mechanisms missed by our attribution graphs, (2) The graph-predicted indirect effects (i.e. ‘multi-hop’ interactions) may not even exist within the underlying model (we refer to this issue as “mechanistic unfaithfulness”). Thus, the nature of the validation provided by our intervention experiments varies depending on the layers of the features involved and the directness of their interaction in the attribution graph, and in some cases (direct effects prior to the intervention layer) is trivial. In general, we regard the effects of interventions on the model’s actual outputs as the most important source of validation, as model outputs are simple to interpret and not affected by these methodological artifacts.
--------------------
--------------------
Alongside each case study figure, we provide the interactive attribution graph interface
Alongside each case study figure, we provide the interactive attribution graph interface
View detailed graph
View detailed graph


that our team uses to study models’ internal mechanisms. The interface is designed to enable “tracing” key paths through the graph while labeling key features, feature groups, and subcircuits. The interface is fairly complex and takes some time to become proficient at using. All the key results in this work are described and visualized in simplified form, so that engaging with this interface is not necessary to read the paper! However, we recommend giving it a try if you are interested in gaining a richer sense of the mechanisms at play in Claude 3.5 Haiku. Some features are given brief labels for convenience; these labels are very rough interpretations and miss considerable detail, which can be better appreciated in the feature visualizations. For a more detailed walkthrough, please reference this section in our companion methods paper (and see § F Appendix: Graph Pruning and Visualization for a few methodological differences specific to this paper).
that our team uses to study models’ internal mechanisms. The interface is designed to enable “tracing” key paths through the graph while labeling key features, feature groups, and subcircuits. The interface is fairly complex and takes some time to become proficient at using. All the key results in this work are described and visualized in simplified form, so that engaging with this interface is not necessary to read the paper! However, we recommend giving it a try if you are interested in gaining a richer sense of the mechanisms at play in Claude 3.5 Haiku.
--------------------
Some features are given brief labels for convenience; these labels are very rough interpretations and miss considerable detail, which can be better appreciated in the feature visualizations. For a more detailed walkthrough, please reference this section in our companion methods paper (and see § F Appendix: Graph Pruning and Visualization for a few methodological differences specific to this paper).
--------------------
--------------------
§ 3 Introductory Example: Multi-step Reasoning
§ 3 Introductory Example: Multi-step Reasoning

--------------------
Our methodology is intended to expose the intermediate steps a model uses en route to producing a response. In this section we consider a simple example of multi-step reasoning and attempt to identify each step. Along the way, we will highlight key concepts that will appear in many of our other case studies.
Our methodology is intended to expose the intermediate steps a model uses en route to producing a response. In this section we consider a simple example of multi-step reasoning and attempt to identify each step. Along the way, we will highlight key concepts that will appear in many of our other case studies.

--------------------
Let’s consider the prompt Fact: the capital of the state containing Dallas is, which Claude 3.5 Haiku successfully completes with Austin. Intuitively, this completion requires two steps – first, inferring that the state containing Dallas is Texas, and second, that the capital of Texas is Austin. Does Claude actually perform these two steps internally? Or does it use some “shortcut” (e.g. perhaps it has observed a similar sentence in the training data and simply memorized the completion)? Prior work has shown evidence of genuine multi-hop reasoning (to varying degrees in different contexts).
Let’s consider the prompt Fact: the capital of the state containing Dallas is, which Claude 3.5 Haiku successfully completes with Austin. Intuitively, this completion requires two steps – first, inferring that the state containing Dallas is Texas, and second, that the capital of Texas is Austin. Does Claude actually perform these two steps internally? Or does it use some “shortcut” (e.g. perhaps it has observed a similar sentence in the training data and simply memorized the completion)? Prior work has shown evidence of genuine multi-hop reasoning (to varying degrees in different contexts).


In this section we provide evidence that, in this example, the model performs genuine two-step reasoning internally, which coexists alongside “shortcut” reasoning.
In this section we provide evidence that, in this example, the model performs genuine two-step reasoning internally, which coexists alongside “shortcut” reasoning.
--------------------
--------------------
As described in the method overview, we can tackle this question by computing the attribution graph for this prompt, which describes the features the model used to produce its answer, and the interactions between them.
As described in the method overview, we can tackle this question by computing the attribution graph for this prompt, which describes the features the model used to produce its answer, and the interactions between them. First, we examine the features’ visualizations to interpret them, and group them into categories (“supernodes”). For example:

We find several features about the word and/or concept of a capital city, such as four features that activate the most strongly on the exact word “capital”. More interestingly, we find features that represent the concept of capitals in more general ways. One example is this feature, which activates on the word “capitals” but also later in questions about capitals of states, as well as on the Chinese question 广东省的省会是? (“What is the capital of Guangdong?”), on the second character of “省会” (capital of a province). Another is this multilingual feature, which activates most strongly on a variety of phrases including “başkenti”, “राजधानी”, “ibu kota”, and “Hauptftadt” — all of which roughly mean “capital” in different languages.Though
--------------------
--------------------
First, we examine the features’ visualizations to interpret them, and group them into categories (“supernodes”). For example:
much of the German is mangled, perhaps due to a transcription error in the source. Although each of these features represents slightly different concepts, in the context of this prompt, it seems like their function is to represent the idea of “capital.” Therefore, we group them (and a few others) together into the same “supernode.”

We find several features about the word and/or concept of a capital city, such as four features that activate the most strongly on the exact word “capital”. More interestingly, we find features that represent the concept of capitals in more general ways. One example is this feature, which activates on the word “capitals” but also later in questions about capitals of states, as well as on the Chinese question 广东省的省会是? (“What is the capital of Guangdong?”), on the second character of “省会” (capital of a province). Another is this multilingual feature, which activates most strongly on a variety of phrases including “başkenti”, “राजधानी”, “ibu kota”, and “Hauptftadt” — all of which roughly mean “capital” in different languages.Though much of the German is mangled, perhaps due to a transcription error in the source.
--------------------
--------------------
Although each of these features represents slightly different concepts, in the context of this prompt, it seems like their function is to represent the idea of “capital.” Therefore, we group them (and a few others) together into the same “supernode.”
We also identify “output features” that consistently push the model to say certain tokens, even if there isn’t such a clear pattern to what words/phrases they activate on. This can be seen in the “Top Outputs” section of a feature’s visualization, which lists the output tokens most strongly directly upweighted by that feature. For example, one feature activates on various landmarks in central Texas, but in this prompt its most relevant aspect is that it most strongly promotes responding with the “Austin” token. We therefore categorize this feature into a “say Austin” supernode.
We also identify “output features” that consistently push the model to say certain tokens, even if there isn’t such a clear pattern to what words/phrases they activate on. This can be seen in the “Top Outputs” section of a feature’s visualization, which lists the output tokens most strongly directly upweighted by that feature. For example, one feature activates on various landmarks in central Texas, but in this prompt its most relevant aspect is that it most strongly promotes responding with the “Austin” token. We therefore categorize this feature into a “say Austin” supernode. Note that the “Top Outputs” information is not always informative – for instance, earlier-layer features primarily matter via indirect effects on the output via other features, and their top direct outputs are not too consequential. Designating a feature as an “output feature” requires a holistic evaluation of its top direct outputs, the contexts in which it activates, and its role in the attribution graph.
--------------------
--------------------
We also find features that promote outputting the name of a capital more generally, which we use a mix of both types of signals to identify and label. For example, one feature promotes responding with a variety of U.S. state capitals. Another feature more strongly promotes the capitals of various countries rather than U.S. states, but activates most strongly on lists with U.S. states and their capitals. And we noticed another feature whose strongest direct outputs are a seemingly unrelated set of tokens, but which often activates right before a country capital (e.g., Paris, Warsaw, or Canberra). We group all these features into a “say a capital” supernode.
Note that the “Top Outputs” information is not always informative – for instance, earlier-layer features primarily matter via indirect effects on the output via other features, and their top direct outputs are not too consequential. Designating a feature as an “output feature” requires a holistic evaluation of its top direct outputs, the contexts in which it activates, and its role in the attribution graph.
--------------------
We also find features that promote outputting the name of a capital more generally, which we use a mix of both types of signals to identify and label. For example, one feature promotes responding with a variety of U.S. state capitals. Another feature more strongly promotes the capitals of various countries rather than U.S. states, but activates most strongly on lists with U.S. states and their capitals. And we noticed another feature whose strongest direct outputs are a seemingly unrelated set of tokens, but which often activates right before a country capital (e.g., Paris, Warsaw, or Canberra).
--------------------
We group all these features into a “say a capital” supernode.
We find several features representing a variety of contexts relating to the state of Texas, not specific to a particular city (in particular, they are not “Dallas” or “Austin” features). Although they each represent distinct, specific Texas-related concepts, in the context of this prompt their main function appears to be the fact that they collectively represent the general concept of Texas. As such, we group these into a “Texas” supernode.
We find several features representing a variety of contexts relating to the state of Texas, not specific to a particular city (in particular, they are not “Dallas” or “Austin” features). Although they each represent distinct, specific Texas-related concepts, in the context of this prompt their main function appears to be the fact that they collectively represent the general concept of Texas. As such, we group these into a “Texas” supernode.
--------------------
--------------------
After forming these supernodes, we can see in our attribution graph interface that, for example, the “capital” supernode promotes the “say a capital” supernode, which promotes the “say Austin” supernode. To represent this, we draw a diagram where each supernode is connected to the next with a brown arrow, as in the below graph snippet:
After forming these supernodes, we can see in our attribution graph interface that, for example, the “capital” supernode promotes the “say a capital” supernode, which promotes the “say Austin” supernode. To represent this, we draw a diagram where each supernode is connected to the next with a brown arrow, as in the below graph snippet:

--------------------
After labeling more features and forming more supernodes, we summarize their interactions in the following diagram.
After labeling more features and forming more supernodes, we summarize their interactions in the following diagram.


The attribution graph contains multiple interesting paths, which we summarize below:
The attribution graph contains multiple interesting paths, which we summarize below:

--------------------
The Dallas features (with some contribution from state features) activate a group of features that represent concepts related to the state of Texas.
The Dallas features (with some contribution from state features) activate a group of features that represent concepts related to the state of Texas.
In parallel, the features activated by the word capital activate another cluster of output features that cause the model to say the name of a capital (an example of such a feature can be seen above).
In parallel, the features activated by the word capital activate another cluster of output features that cause the model to say the name of a capital (an example of such a feature can be seen above).
The Texas features and the say a capital features jointly upweight the probability of the model saying Austin. They do so via two pathways:
--------------------
The Texas features and the say a capital features jointly upweight the probability of the model saying Austin. They do so via two pathways:


directly impacting the Austin output, and
directly impacting the Austin output, and
indirectly, by activating a cluster of say Austin output features.
indirectly, by activating a cluster of say Austin output features.


There also exists a “shortcut” edge directly from Dallas to say Austin.
There also exists a “shortcut” edge directly from Dallas to say Austin.


The graph indicates that the replacement model does in fact perform “multi-hop reasoning” – that is, its decision to say Austin hinges on a chain of several intermediate computational steps (Dallas → Texas, and Texas + capital → Austin). We stress that this graph simplifies the true mechanisms considerably, and encourage the reader to interact with the more comprehensive visualization to appreciate the underlying complexity.
The graph indicates that the
--------------------
§ 3.1 Validation with Inhibition Experiments

The graphs above describe mechanisms used by our interpretable replacement model. To validate