A research framework for analyzing differences between language models using interpretability techniques. This project enables systematic comparison of base models and their variants (model organisms) through various diffing methodologies. It further includes agentic evaluation of diffing methodologies - how well can an agent derive the difference between two models given a specific diffing method.
Note: The toolkit is based on a heavily modified version of the saprmarks/dictionary_learning repository, available at science-of-finetuning/crosscoder_learning. Although we may eventually merge these repositories, this is currently not a priority due to significant divergence.
Publications
| Method | Description | Preprocessing | Dashboard |
|---|---|---|---|
| Activation Difference Lens | Analyzes activation differences using logit lens and patchscope projections. Supports steering experiments and automatic token relevance analysis. | ❌ | ✅ |
| Talkative Probe | Uses a verbalizer model to interpret activation differences by generating natural language descriptions of behavioral changes. | ❌ | ❌ |
| KL Divergence | Computes per-token KL divergence between base and finetuned model output distributions. Identifies where models diverge most. | ❌ | ✅ |
| PCA | Trains Principal Component Analysis on activation differences to find dominant directions of change. Supports component steering. | ✅ | ✅ |
| SAE Difference | Trains Sparse Autoencoders on activation differences to discover interpretable latent features specific to finetuning. | ✅ | ✅ |
| Crosscoder | Trains crosscoders on paired activations from both models to learn shared and model-specific representations. | ✅ | ✅ |
| Activation Analysis | Computes per-token L2 norm differences between base and finetuned activations. Tracks max-activating examples. | ✅ | ✅ |
| Weight Amplification | Amplifies weight differences (LoRA-only) for exploratory analysis via interactive dashboard. | ❌ | ✅ |
Preprocessing: Methods marked with ✅ require a preprocessing step that extracts and caches activations from both models on large datasets. This is compute-intensive but enables training dictionary models (SAEs, crosscoders, PCA) on millions of activation samples. Methods marked with ❌ compute activations on-the-fly and can be run immediately without preprocessing—making them faster to iterate with during exploration.
Select a method via config:
python main.py diffing/method=activation_difference_lens
python main.py diffing/method=talkative_probe
python main.py diffing/method=kl
python main.py diffing/method=pca
python main.py diffing/method=sae_difference
python main.py diffing/method=crosscoder
python main.py diffing/method=activation_analysis
python main.py diffing/method=weight_amplificationThis framework consists of two main pipelines:
- Preprocessing Pipeline: Extract and cache activations from both models on configured datasets. Required only for methods that train on large activation corpora.
- Diffing Pipeline: Analyze differences between models using the selected interpretability method.
The framework is designed to work with pre-existing model pairs (e.g., base models vs. model organisms) rather than training new models.
The framework includes an agentic evaluation system that tests how well each diffing method reveals finetuning behavior. An LLM agent is tasked with inferring what a model was finetuned for, using only the outputs of a diffing method.
How It Works:
- Agent Setup: An LLM agent (e.g., GPT-4, Claude) receives a summary of diffing method outputs (logit lens results, steering samples, etc.)
- Tool Use: The agent can call method-specific tools to drill down into results, query both models, or generate steered samples
- Inference: The agent produces a final description of the finetuning domain and behavioral changes
- Grading: A grader LLM evaluates the agent's description against ground truth
Agent Types:
| Agent | Access | Description |
|---|---|---|
| Blackbox Agent | Model queries only | Baseline that can only prompt the base and finetuned models. No interpretability information. |
| Method Agent | Full method outputs + queries | Has access to all cached analysis results plus model queries. Each method defines its own agent. |
Agent evaluation is configured in configs/diffing/evaluation.yaml. Run with:
python main.py diffing/method=activation_difference_lens diffing.evaluation.agent.enabled=trueSee ADD_NEW_METHOD.MD for a complete tutorial on:
- Creating a new diffing method subclass
- Writing the Hydra config
- Implementing the
get_agent()method for agentic evaluation - Running and testing your method
- Clone the repository:
git clone https://github.com/science-of-finetuning/diffing-game
cd diffing-game- Install dependencies:
pip install -r requirements.txtRun the complete pipeline (preprocessing + diffing) with default settings:
python main.pyRun preprocessing only (extract activations):
python main.py pipeline.mode=preprocessingRun diffing analysis only (assumes activations already exist):
python main.py pipeline.mode=diffingAnalyze specific organism and model combinations:
python main.py organism=caps model=gemma3_1BUse different diffing methods:
python main.py diffing/method=kl
python main.py diffing/method=activation_difference_lensRun experiments across multiple configurations:
python main.py --multirun organism=caps,roman_concrete model=gemma3_1BRun with different diffing methods:
python main.py --multirun diffing/method=kl,pca,sae_differenceThe framework includes a Streamlit-based interactive dashboard for visualizing and exploring model diffing results.
- Dynamic Discovery: Automatically detects available models, organisms, and diffing methods
- Real-time Visualization: Interactive plots and visualizations of diffing results
- Model Integration: Direct links to Hugging Face model pages
- Multi-method Support: Compare results across different diffing methodologies
- Interactive Model Testing: Test custom inputs and steering vectors on both base and finetuned models in real-time
Launch the dashboard with:
streamlit run dashboard.pyThe dashboard will be available at http://localhost:8501 by default.
You can also pass configuration overwrites to the dashboard:
streamlit run dashboard.py -- model.dtype=float32- Select Base Model: Choose from available base models
- Select Organism: Pick the model organism (finetuned variant)
- Select Diffing Method: Choose the analysis method to visualize
- Explore Results: Interact with the generated visualizations
The dashboard requires that you have already run diffing experiments to generate results to visualize.
To reproduce the experiments from the paper:
bash narrow_ft_experiments/run.sh To run the agents on all models run
bash narrow_ft_experiments/agents.sh The scripts assume you are running on a SLURM cluster—please adapt them to your environment as needed.
Relevant code for the Activation Difference Lens is found at src/diffing/methods/activation_difference_lens and used utilities at src/utils. Plotting scripts are found under narrow_ft_experiments/plotting/. The statistical evaluation of the agent performance using HiBayes can be found in narrow_ft_experiments/hibayes/.
