Seeing Like an Activation State

Notes on Neural Network Representations - by bolu ben-adeola

Introduction

Decomposability Linearity Features Superposition Footnotes References

Introduction

Trained Neural Networks arrive at solutions that achieve superhuman performance on an increasing number of tasks. It would be at least interesting and probably important to understand these solutions.

Interesting, in the spirit of curiosity and getting answers to questions like “Are there human-understandable algorithms that capture how object-detection nets work?”1 This would add a new modality of use to our relationship with Neural Nets from just querying for answers—Oracle Models— or sending on tasks —Agent Models— to us acquiring an enriched understanding of our world by studying the interpretable internals of these networks’ solutions—Microscope Models [1]

And Important, in its use in the pursuit of the kinds of standards that we (should?) demand of increasingly powerful systems such as operational transparency and guarantees on behavioural bounds. A common example of an idealised capability we could hope for is “Lie detection” by monitoring the model’s internal state. [2]

Mechanistic interpretability (Mech Interp) is a subfield of interpretability research that seeks a granular understanding of these Networks.

One could describe two categories of mech interp inquiry as:

Representation interpretability: Understanding what a model sees and how it does. I.e. what information have models found important to look for in their inputs and how is this information represented internally?
Algorithmic interpretability: Understanding how this information is used for computation across the model to result in some observed outcome.

This post is concerned with Representation interpretability. Structured as an exposition of neural network representation research 2, it discusses various qualities of model representations which range in epistemic confidence from the obvious to the speculative and the merely desired.

<aside> <img src="/icons/circle-alternate_gray.svg" alt="/icons/circle-alternate_gray.svg" width="40px" /> Note: I’ll use ‘Models/Neural Nets’ and ‘Model Components’ interchangeably. A model component can be thought of as a layer, or some other conceptually meaningful ensemble of layers in a network.

</aside>

<aside> <img src="/icons/circle-alternate_gray.svg" alt="/icons/circle-alternate_gray.svg" width="40px" /> Note: Until properly introduced with a technical definition, I use expressions like “input-$properties$” & “input-$qualities$” in place of the more colloquially used “$feature$.”

</aside>

Table of Contents

Introduction

Introduction