Paper accepted at XAI4CV, Workshop at CVPR 2023

Our work on “Disentangling Neuron Representations with Concept Vectors“, by Laura O’Mahony et al., will be presented at the 2nd Explainable AI for Computer Vision (XAI4CV) Workshop at CVPR, Vancouver June 19, 2023.

Breaking down a deep learning model into interpretable units allows us to better understand how models store representations. However, the occurrence of polysemantic neurons, or neurons that respond to multiple unrelated features, makes interpreting individual neurons challenging. This has led to the search for meaningful directions, known as concept vectors, in activation space instead of looking at individual neurons.

In this work, we propose an interpretability method to disentangle polysemantic neurons into concept vectors consisting of linear combinations of neurons that encapsulate distinct features. We found that monosemantic regions exist in activation space, and features are not axis aligned. Our results suggest that exploring directions, instead of neurons may lead us toward finding coherent fundamental units. We hope this work helps bridge the gap between understanding the fundamental units of models, an important goal of mechanistic interpretability, and concept discovery.

The code is available on GitHub.

Great post by Laura available on Medium.

The proposed method to obtain two concept vectors from a neuron is depicted for a neuron that activates for both apples and sports fields. See full paper for details.