I wanted a problem I could dive into to sharpen what I’ve been learning through self studying the ARENA curriculum. I’m drawn to areas where I might have comparative advantage, so I was excited to find this paper by Chugtai, Chan and Nanda studying how a one layer neural net learns how to multiply elements of the 5 term symmetric group \(S_5\).

This paper is one of a few I’ve identified discussing the topic, all studying an identical architecture, very much in conversation with each other, and coming to different conclusions about what algorithms are actually learned. My plan is to go tell the story as I understand it, noodle on the parts that don’t make sense to me, and make some attacks on some interesting open questions.

So let’s get started!

The Papers

I’ll be looking at three papers:

A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations published in early 2023 by Bilal Chugtai¹, Lawrence Chan, and Neel Nanda.
Grokking Group Multiplication with Cosets puiblished in late 2023 by Dashiell Stander, Qinan Yu, Honglu Fan, and Stella Biderman.
Unifying and Verifying Mechanistic Interpretations: A Case Study with Group Operations published in late 2024 by Wilson Wu, Louis Jaburi, Jacob Drori, and Jason Gross.

Each paper proposes a different mechanism for how the model learns group multiplication, and critiques answers given by the previous papers. I’m going to go through all of them and describe their algorithm and the evidence they provide in favor of their claim.

Mathematical Prelininaries

All three papers have a nice preliminary section where they explain some of the mathematical background that you need to pose the problem. I may decide eventually to make a Part Zero where I do the same, but since I’m primarily expecting this review to be of service to me, I’ll assume that my reader is already familiar with the concepts of a group, a representation, a trace and the more elementary theorems about them. If I decide to pull in more substantial results from representation theory, I’ll make sure to define terms before I use them.

I will give a little bit of notation.

\(S_5\) is the full symmetirc group on a set of 5 elements. It has 60 elements, 7 conjugacy classes, and equivalently 7 irreducible representations.
There are two one dimensional irreducible representations: The trivial representation \(triv\) and the sign character \(sgn\).
There are two four dimensional irreducible representations: The standard representation \(std\) is the nontrivial subrepresentation of the permutation representation of \(S_5\) on a five dimensional vector spece (\(\sigma\cdot e_i = e_{\sigma(i)}\)). The other four dimensional representation, sign standard, is just \(sgn\otimes std\).
There are two 5 dimmensional irreducible representations, with I will just call \(\rho_{5a}\) and \(\rho_{5b}\) with \(\rho_{5a} = \rho_{5b}\otimes sgn\). I haven’t specified which is which, but they don’t seem to commonly show up in the analysis. Finally there is a six dimensional irreducible representation \(\rho_{6a}\).

I have been agnostic about the field, becuase actually all of the above representations are realizable over \(\mathbb{Z}\), which makes things very nice.

I’ll denote the character of a representation \(\rho\) by \(\chi_\rho\).

Model details

As I said, every paper studies the exact same model, a fully connected neural net with one hidden layer. The parameters are

-\(W_1\) and \(W_2\), the embedding matrices. Two separate \(60\times 256\) matrices which embed the one-hot encoded elements of \(S_5\) in 256 dimensional space. - \(W_1\) and \(W_2\) are not tied in any way and are learned independently

\(W_E\), the encoding matrix. A \(256\times 128\) matrix which maps the input to the hidden layer
\(W_U\), the unencoding matrix. A \(128\times 60\) matrix that maps the hidden layer to the output
There are no bias terms.

The entire forward pass can be be written explicitly as

\[(x,y)\to \text{ReLU}((xW_1 + yW_2)W_E)W_U\]

We will write \(W_L := W_1 \circ W_E\) and \(W_R := W_2\circ W_E\), since neither \(W_1\) or \(W_2\) shows up independently.

This model, trained on a fraction of all \(60\times 60 = 3600\) pairs, learns group multiplication extremely well.

In my next post(s) I’ll explain the proposed algorithms, and the evidence provided for them.

Bilal was nice enough to have a video call with me about transitioning my research focus to interpretability. He identified some of the follow ups to his paper he found interesting, but we didn’t discuss the results in detail. ↩