TexShape:语言模型的信息论句子嵌入

H. Kaan Kale, Homa Esfahanizadeh, Noel Elias, Oguzhan Baser, Muriel Medard, Sriram Vishwanath
{"title":"TexShape:语言模型的信息论句子嵌入","authors":"H. Kaan Kale, Homa Esfahanizadeh, Noel Elias, Oguzhan Baser, Muriel Medard, Sriram Vishwanath","doi":"arxiv-2402.05132","DOIUrl":null,"url":null,"abstract":"With the exponential growth in data volume and the emergence of\ndata-intensive applications, particularly in the field of machine learning,\nconcerns related to resource utilization, privacy, and fairness have become\nparamount. This paper focuses on the textual domain of data and addresses\nchallenges regarding encoding sentences to their optimized representations\nthrough the lens of information-theory. In particular, we use empirical\nestimates of mutual information, using the Donsker-Varadhan definition of\nKullback-Leibler divergence. Our approach leverages this estimation to train an\ninformation-theoretic sentence embedding, called TexShape, for (task-based)\ndata compression or for filtering out sensitive information, enhancing privacy\nand fairness. In this study, we employ a benchmark language model for initial\ntext representation, complemented by neural networks for information-theoretic\ncompression and mutual information estimations. Our experiments demonstrate\nsignificant advancements in preserving maximal targeted information and minimal\nsensitive information over adverse compression ratios, in terms of predictive\naccuracy of downstream models that are trained using the compressed data.","PeriodicalId":501433,"journal":{"name":"arXiv - CS - Information Theory","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TexShape: Information Theoretic Sentence Embedding for Language Models\",\"authors\":\"H. Kaan Kale, Homa Esfahanizadeh, Noel Elias, Oguzhan Baser, Muriel Medard, Sriram Vishwanath\",\"doi\":\"arxiv-2402.05132\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the exponential growth in data volume and the emergence of\\ndata-intensive applications, particularly in the field of machine learning,\\nconcerns related to resource utilization, privacy, and fairness have become\\nparamount. This paper focuses on the textual domain of data and addresses\\nchallenges regarding encoding sentences to their optimized representations\\nthrough the lens of information-theory. In particular, we use empirical\\nestimates of mutual information, using the Donsker-Varadhan definition of\\nKullback-Leibler divergence. Our approach leverages this estimation to train an\\ninformation-theoretic sentence embedding, called TexShape, for (task-based)\\ndata compression or for filtering out sensitive information, enhancing privacy\\nand fairness. In this study, we employ a benchmark language model for initial\\ntext representation, complemented by neural networks for information-theoretic\\ncompression and mutual information estimations. Our experiments demonstrate\\nsignificant advancements in preserving maximal targeted information and minimal\\nsensitive information over adverse compression ratios, in terms of predictive\\naccuracy of downstream models that are trained using the compressed data.\",\"PeriodicalId\":501433,\"journal\":{\"name\":\"arXiv - CS - Information Theory\",\"volume\":\"19 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Information Theory\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2402.05132\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Information Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2402.05132","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

随着数据量的指数级增长和数据密集型应用的出现,特别是在机器学习领域,与资源利用、隐私和公平性相关的问题变得尤为重要。本文重点关注文本数据领域,并通过信息论的视角来解决有关将句子编码为其优化表示的挑战。特别是,我们使用 Donsker-Varadhan 定义的库尔巴克-莱伯勒发散(Kullback-Leibler divergence)对互信息进行了经验性估计。我们的方法利用这种估计来训练一种信息论句子嵌入(称为 TexShape),用于(基于任务的)数据压缩或过滤敏感信息,从而提高隐私性和公平性。在这项研究中,我们采用了一个基准语言模型作为初始文本表示,并辅以神经网络进行信息理论压缩和互信息估算。我们的实验表明,在使用压缩数据训练的下游模型的预测准确性方面,我们在保留最大目标信息和最小敏感信息方面取得了显著进步。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
TexShape: Information Theoretic Sentence Embedding for Language Models
With the exponential growth in data volume and the emergence of data-intensive applications, particularly in the field of machine learning, concerns related to resource utilization, privacy, and fairness have become paramount. This paper focuses on the textual domain of data and addresses challenges regarding encoding sentences to their optimized representations through the lens of information-theory. In particular, we use empirical estimates of mutual information, using the Donsker-Varadhan definition of Kullback-Leibler divergence. Our approach leverages this estimation to train an information-theoretic sentence embedding, called TexShape, for (task-based) data compression or for filtering out sensitive information, enhancing privacy and fairness. In this study, we employ a benchmark language model for initial text representation, complemented by neural networks for information-theoretic compression and mutual information estimations. Our experiments demonstrate significant advancements in preserving maximal targeted information and minimal sensitive information over adverse compression ratios, in terms of predictive accuracy of downstream models that are trained using the compressed data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信