In the spirit of the Mechanistic interpretability agenda for neural networks, this post (and perhaps others to follow in a series) investigates the “algorithms” learned by a transformer model for tackling a narrow technical task —a modified version of the “Valid Parentheses” Leetcode problem.

While the utility of the task is much more modest in scope than the more general next-token prediction you’d expect in an LLM, the exercise will help us explore some of the early intuitions, investigative tools, and general epistemological methodologies typically deployed to know what models are doing (and how we know they are.)

The ARENA Mechinterp monthly challenges were a huge influence on this post, and the first set of problems will come from there. (You should definitely check out the program.)


Series structure:

  1. Pick a Leetcode problem as a task. (Part 1)
  2. Train a minimum viable Transformer model on it. (Part 2)
  3. Investigate what the model learnt. (Part 3)

Part 1: Problem

The Valid Parentheses problem as seen on Leetcode:

Screenshot 2023-11-08 at 14.54.41.png

Some modified constraints on the problem we’ll be using for the task: