Róbert Csordás, Christopher Potts, Christopher D. Manning, Atticus Geiger
{"title":"Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations","authors":"Róbert Csordás, Christopher Potts, Christopher D. Manning, Atticus Geiger","doi":"arxiv-2408.10920","DOIUrl":null,"url":null,"abstract":"The Linear Representation Hypothesis (LRH) states that neural networks learn\nto encode concepts as directions in activation space, and a strong version of\nthe LRH states that models learn only such encodings. In this paper, we present\na counterexample to this strong LRH: when trained to repeat an input token\nsequence, gated recurrent neural networks (RNNs) learn to represent the token\nat each position with a particular order of magnitude, rather than a direction.\nThese representations have layered features that are impossible to locate in\ndistinct linear subspaces. To show this, we train interventions to predict and\nmanipulate tokens by learning the scaling factor corresponding to each sequence\nposition. These interventions indicate that the smallest RNNs find only this\nmagnitude-based solution, while larger RNNs have linear representations. These\nfindings strongly indicate that interpretability research should not be\nconfined by the LRH.","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"30 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Neural and Evolutionary Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.10920","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The Linear Representation Hypothesis (LRH) states that neural networks learn
to encode concepts as directions in activation space, and a strong version of
the LRH states that models learn only such encodings. In this paper, we present
a counterexample to this strong LRH: when trained to repeat an input token
sequence, gated recurrent neural networks (RNNs) learn to represent the token
at each position with a particular order of magnitude, rather than a direction.
These representations have layered features that are impossible to locate in
distinct linear subspaces. To show this, we train interventions to predict and
manipulate tokens by learning the scaling factor corresponding to each sequence
position. These interventions indicate that the smallest RNNs find only this
magnitude-based solution, while larger RNNs have linear representations. These
findings strongly indicate that interpretability research should not be
confined by the LRH.