In the world of artificial intelligence, understanding how models make decisions is crucial. Anthropic’s Circuit Tracer is an innovative open-source tool designed to help researchers visualize the internal computations of large language models (LLMs). By generating attribution graphs, users can trace the steps a model takes to arrive at a specific output, shedding light on the often opaque processes of AI.

The Circuit Tracer library supports popular open-weight models and is complemented by an interactive frontend hosted on Neuronpedia. This allows users to explore their generated attribution graphs in a user-friendly manner. The tool was developed by participants in Anthropic’s Fellows program in collaboration with Decode Research, showcasing a community-driven effort to enhance AI interpretability.

With the Circuit Tracer, researchers can trace circuits on supported models, visualize and annotate graphs, and even test hypotheses by modifying feature values to observe changes in model outputs. This capability is not just theoretical; Anthropic has already utilized these tools to study complex behaviors like multi-step reasoning and multilingual representations in models such as Gemma-2-2b and Llama-3.2-1b. Interested users can access a demo notebook for examples and further analysis.

By open-sourcing these tools, Anthropic aims to bridge the gap in our understanding of AI’s inner workings, which currently lags behind the rapid advancements in AI capabilities. The hope is that the broader community will leverage these tools to explore model behaviors and contribute to the ongoing development of AI transparency. For those eager to dive deeper, the Neuronpedia interface allows for generating and viewing personalized attribution graphs, while the code repository offers resources for more sophisticated research.

Explore the potential of the Circuit Tracer and join the movement towards greater transparency in AI by visiting Neuronpedia .