Is This the Subspace You Are Looking For? An Interpretability Illusion for Subspace Activation Patching
We show that subspace interventions in mechanistic interpretability can be misleading - even when they successfully modify model behavior, they may do so by activating alternative pathways rather than manipulating the intended feature. We demonstrate this phenomenon in mathematical examples and real-world tasks, while also showing what successful interpretable interventions look like when guided by prior circuit analysis.
May 9, 2024