I wanted a problem I could dive into to sharpen what I’ve been learning through self studying the ARENA curriculum. I’m drawn to areas where I might have comparative advantage, so I was excited to find this paper by Chugtai, Chan and Nanda studying how a one layer neural net learns how to multiply elements of the 5 term symmetric group \(S_5\).

This paper is one of a few I’ve identified discussing the topic, all studying an identical architecture, very much in conversation with each other, and coming to different conclusions about what algorithms are actually learned. My plan is to go tell the story as I understand it, noodle on the parts that don’t make sense to me, and make some attacks on some interesting open questions.

So let’s get started!

The Papers

I’ll be looking at three papers:

Each paper proposes a different mechanism for how the model learns group multiplication, and critiques answers given by the previous papers. I’m going to go through all of them and describe their algorithm and the evidence they provide in favor of their claim.

Mathematical Prelininaries

All three papers have a nice preliminary section where they explain some of the mathematical background that you need to pose the problem. I may decide eventually to make a Part Zero where I do the same, but since I’m primarily expecting this review to be of service to me, I’ll assume that my reader is already familiar with the concepts of a group, a representation, a trace and the more elementary theorems about them. If I decide to pull in more substantial results from representation theory, I’ll make sure to define terms before I use them.

I will give a little bit of notation.

I have been agnostic about the field, becuase actually all of the above representations are realizable over \(\mathbb{Z}\), which makes things very nice.

I’ll denote the character of a representation \(\rho\) by \(\chi_\rho\).

Model details

As I said, every paper studies the exact same model, a fully connected neural net with one hidden layer. The parameters are

-\(W_1\) and \(W_2\), the embedding matrices. Two separate \(60\times 256\) matrices which embed the one-hot encoded elements of \(S_5\) in 256 dimensional space. - \(W_1\) and \(W_2\) are not tied in any way and are learned independently

The entire forward pass can be be written explicitly as

\[(x,y)\to \text{ReLU}((xW_1 + yW_2)W_E)W_U\]

We will write \(W_L := W_1 \circ W_E\) and \(W_R := W_2\circ W_E\), since neither \(W_1\) or \(W_2\) shows up independently.

This model, trained on a fraction of all \(60\times 60 = 3600\) pairs, learns group multiplication extremely well.

In my next post(s) I’ll explain the proposed algorithms, and the evidence provided for them.

  1. Bilal was nice enough to have a video call with me about transitioning my research focus to interpretability. He identified some of the follow ups to his paper he found interesting, but we didn’t discuss the results in detail.