Publications & Presentations

Key research outputs. First-author publications are highlighted.

The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

Jeremias Lino Ferrao, et al.

Spotlight at Mechanistic Interpretability Workshop, NeurIPS 2025.

Self-Ablating Transformers: More Interpretability, Less Sparsity

Jeremias Lino Ferrao, et al.

Poster at Building Trust Workshop, ICLR 2025.

World Model Agents with Change-Based Intrinsic Motivation

Jeremias Lino Ferrao, et al.

Oral presentation at NLDL 2025.