特征交互揭示语言模型中的语言结构

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-06-21 DOI:10.48550/arXiv.2306.12181

Jaap Jumelet, Willem H. Zuidema

{"title":"特征交互揭示语言模型中的语言结构","authors":"Jaap Jumelet, Willem H. Zuidema","doi":"10.48550/arXiv.2306.12181","DOIUrl":null,"url":null,"abstract":"We study feature interactions in the context of feature attribution methods for post-hoc interpretability. In interpretability research, getting to grips with feature interactions is increasingly recognised as an important challenge, because interacting features are key to the success of neural networks. Feature interactions allow a model to build up hierarchical representations for its input, and might provide an ideal starting point for the investigation into linguistic structure in language models. However, uncovering the exact role that these interactions play is also difficult, and a diverse range of interaction attribution methods has been proposed. In this paper, we focus on the question which of these methods most faithfully reflects the inner workings of the target models. We work out a grey box methodology, in which we train models to perfection on a formal language classification task, using PCFGs. We show that under specific configurations, some methods are indeed able to uncover the grammatical rules acquired by a model. Based on these findings we extend our evaluation to a case study on language models, providing novel insights into the linguistic structure that these models have acquired.","PeriodicalId":352845,"journal":{"name":"Annual Meeting of the Association for Computational Linguistics","volume":"130 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Feature Interactions Reveal Linguistic Structure in Language Models\",\"authors\":\"Jaap Jumelet, Willem H. Zuidema\",\"doi\":\"10.48550/arXiv.2306.12181\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We study feature interactions in the context of feature attribution methods for post-hoc interpretability. In interpretability research, getting to grips with feature interactions is increasingly recognised as an important challenge, because interacting features are key to the success of neural networks. Feature interactions allow a model to build up hierarchical representations for its input, and might provide an ideal starting point for the investigation into linguistic structure in language models. However, uncovering the exact role that these interactions play is also difficult, and a diverse range of interaction attribution methods has been proposed. In this paper, we focus on the question which of these methods most faithfully reflects the inner workings of the target models. We work out a grey box methodology, in which we train models to perfection on a formal language classification task, using PCFGs. We show that under specific configurations, some methods are indeed able to uncover the grammatical rules acquired by a model. Based on these findings we extend our evaluation to a case study on language models, providing novel insights into the linguistic structure that these models have acquired.\",\"PeriodicalId\":352845,\"journal\":{\"name\":\"Annual Meeting of the Association for Computational Linguistics\",\"volume\":\"130 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annual Meeting of the Association for Computational Linguistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2306.12181\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Meeting of the Association for Computational Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2306.12181","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

我们在特征归因方法的背景下研究特征相互作用的事后可解释性。在可解释性研究中，掌握特征交互越来越被认为是一个重要的挑战，因为交互特征是神经网络成功的关键。特征交互允许模型为其输入建立分层表示，并且可能为语言模型中的语言结构研究提供理想的起点。然而，揭示这些相互作用的确切作用也很困难，并且已经提出了各种各样的相互作用归因方法。在本文中，我们关注的问题是这些方法中哪一种最忠实地反映了目标模型的内部工作原理。我们提出了一种灰盒方法，在这种方法中，我们使用pcfg在正式语言分类任务上训练模型以达到完美。我们表明，在特定的配置下，一些方法确实能够揭示由模型获得的语法规则。基于这些发现，我们将我们的评估扩展到语言模型的案例研究，为这些模型获得的语言结构提供了新的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Feature Interactions Reveal Linguistic Structure in Language Models

We study feature interactions in the context of feature attribution methods for post-hoc interpretability. In interpretability research, getting to grips with feature interactions is increasingly recognised as an important challenge, because interacting features are key to the success of neural networks. Feature interactions allow a model to build up hierarchical representations for its input, and might provide an ideal starting point for the investigation into linguistic structure in language models. However, uncovering the exact role that these interactions play is also difficult, and a diverse range of interaction attribution methods has been proposed. In this paper, we focus on the question which of these methods most faithfully reflects the inner workings of the target models. We work out a grey box methodology, in which we train models to perfection on a formal language classification task, using PCFGs. We show that under specific configurations, some methods are indeed able to uncover the grammatical rules acquired by a model. Based on these findings we extend our evaluation to a case study on language models, providing novel insights into the linguistic structure that these models have acquired.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Annual Meeting of the Association for Computational Linguistics

自引率

0.00%

发文量