BTOM-Transformerlens

Python ★ 16 updated 22d ago

A Python research workspace for probing how Qwen language models handle Theory of Mind reasoning, using TransformerLens to capture internal activations and trace which attention heads contribute most to nested belief-state answers.

PythonPyTorchTransformerLensHuggingFaceCUDAJupytersetup: hardcomplexity 5/5

BTOM-TransformerLens is a research workspace for studying the internal behavior of large language models, specifically models from the Qwen2.5 and Qwen3 families. The goal is to understand how these models reason about situations that require understanding what different characters in a story believe or know, a type of reasoning called Theory of Mind. The dataset used for this analysis is called Hi-ToM, which contains questions about nested belief states (what person A thinks person B thinks about something).

The analysis uses a library called TransformerLens, which is a tool designed to let researchers look inside transformer-based language models while they are processing text. Rather than just observing what answer a model produces, TransformerLens allows you to capture the values flowing through each layer and attention head at every step. This project builds on that capability to do attribution analysis, which traces which internal components contributed most to a specific output, and clustering, which groups attention heads that behave similarly across many examples.

The workflow is centered on a Jupyter notebook (test.ipynb) that walks through loading a model, feeding it Hi-ToM questions, caching internal activations, building an attribution graph, and then visualizing clusters of attention heads. Supporting Python files handle the attribution logic, hook attachment for capturing intermediate values, clustering math, and visualization. A separate file handles quantized model weights for cases where GPU memory is limited.

The README is written in Chinese and notes that the project requires Python 3.10 or newer and a CUDA-capable GPU. Specific version pins are listed for the main libraries including PyTorch, TransformerLens, and the Transformers library from HuggingFace. If GPU memory is tight, the README suggests reducing the number of samples, limiting analysis to fewer layers, or running only one of the two supported model loading paths.

Where it fits

Probe which attention heads in a Qwen model contribute most to answering nested Theory of Mind questions.
Cluster attention heads that behave similarly across many Hi-ToM belief-state examples.
Build an attribution graph tracing which internal model components drove a specific answer.
Study differences in Theory of Mind reasoning between Qwen2.5 and Qwen3 model families.

Open on GitHub → Full breakdown on explaingit →