Publications & Presentations
Key research outputs. First-author publications are highlighted.
The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features
Spotlight at Mechanistic Interpretability Workshop, NeurIPS 2025.
Self-Ablating Transformers: More Interpretability, Less Sparsity
Poster at Building Trust Workshop, ICLR 2025.
World Model Agents with Change-Based Intrinsic Motivation
Oral presentation at NLDL 2025.