分布式文档表示的可压缩性

Blaž Škrlj, Matej Petković
{"title":"分布式文档表示的可压缩性","authors":"Blaž Škrlj, Matej Petković","doi":"10.1109/ICDM51629.2021.00166","DOIUrl":null,"url":null,"abstract":"Contemporary natural language processing (NLP) revolves around learning from latent document representations, generated either implicitly by neural language models or explicitly by methods such as doc2vec or similar. One of the key properties of the obtained representations is their dimension. Whilst the commonly adopted dimensions of 256 and 768 offer sufficient performance on many tasks, it is many times unclear whether the default dimension is the most suitable choice for the subsequent downstream learning tasks. Furthermore, representation dimensions are seldom subject to hyperparameter tunning due to computational constraints. The purpose of this paper is to demonstrate that a surprisingly simple and efficient recursive compression procedure can be sufficient to both significantly compress the initial representation, but also potentially improve its performance when considering the task of text classification. Having smaller and less noisy representations is the desired property during deployment, as orders of magnitude smaller models can significantly reduce the computational overload and with it the deployment costs. We propose CORE, a straightforward, compression-agnostic framework suitable for representation compression. The CORE’S performance is showcased and studied on a collection of 17 real-life corpora from biomedical, news, social media, and literary domains. We explored CORE’S behavior when considering contextual and non-contextual document representations, different compression levels, and 9 different compression algorithms. Current results based on more than 100,000 compression experiments indicate that recursive Singular Value Decomposition offers a very good trade-off between the compression efficiency and performance, making CORE useful in many existing, representation-dependent NLP pipelines.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"175 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Compressibility of Distributed Document Representations\",\"authors\":\"Blaž Škrlj, Matej Petković\",\"doi\":\"10.1109/ICDM51629.2021.00166\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Contemporary natural language processing (NLP) revolves around learning from latent document representations, generated either implicitly by neural language models or explicitly by methods such as doc2vec or similar. One of the key properties of the obtained representations is their dimension. Whilst the commonly adopted dimensions of 256 and 768 offer sufficient performance on many tasks, it is many times unclear whether the default dimension is the most suitable choice for the subsequent downstream learning tasks. Furthermore, representation dimensions are seldom subject to hyperparameter tunning due to computational constraints. The purpose of this paper is to demonstrate that a surprisingly simple and efficient recursive compression procedure can be sufficient to both significantly compress the initial representation, but also potentially improve its performance when considering the task of text classification. Having smaller and less noisy representations is the desired property during deployment, as orders of magnitude smaller models can significantly reduce the computational overload and with it the deployment costs. We propose CORE, a straightforward, compression-agnostic framework suitable for representation compression. The CORE’S performance is showcased and studied on a collection of 17 real-life corpora from biomedical, news, social media, and literary domains. We explored CORE’S behavior when considering contextual and non-contextual document representations, different compression levels, and 9 different compression algorithms. Current results based on more than 100,000 compression experiments indicate that recursive Singular Value Decomposition offers a very good trade-off between the compression efficiency and performance, making CORE useful in many existing, representation-dependent NLP pipelines.\",\"PeriodicalId\":320970,\"journal\":{\"name\":\"2021 IEEE International Conference on Data Mining (ICDM)\",\"volume\":\"175 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Data Mining (ICDM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDM51629.2021.00166\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Data Mining (ICDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM51629.2021.00166","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

当代自然语言处理(NLP)围绕着从潜在的文档表示中学习,这些表示要么是由神经语言模型隐式生成的,要么是由doc2vec或类似的方法显式生成的。获得的表示的关键属性之一是它们的维数。虽然通常采用的维度256和768在许多任务上提供了足够的性能,但通常不清楚默认维度是否是后续下游学习任务的最合适选择。此外,由于计算限制,表示维度很少受到超参数调优的影响。本文的目的是证明一个非常简单和有效的递归压缩过程足以显著压缩初始表示,而且在考虑文本分类任务时还可能提高其性能。在部署过程中,拥有更小、更少噪声的表示是期望的属性,因为数量级更小的模型可以显著减少计算过载,从而降低部署成本。我们提出了CORE,一个简单的,适合于表示压缩的压缩无关框架。CORE的性能在来自生物医学、新闻、社交媒体和文学领域的17个真实语料库上进行了展示和研究。在考虑上下文和非上下文文档表示、不同的压缩级别和9种不同的压缩算法时,我们探讨了CORE的行为。目前基于超过100,000个压缩实验的结果表明,递归奇异值分解在压缩效率和性能之间提供了一个非常好的权衡,使CORE在许多现有的依赖于表示的NLP管道中非常有用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Compressibility of Distributed Document Representations
Contemporary natural language processing (NLP) revolves around learning from latent document representations, generated either implicitly by neural language models or explicitly by methods such as doc2vec or similar. One of the key properties of the obtained representations is their dimension. Whilst the commonly adopted dimensions of 256 and 768 offer sufficient performance on many tasks, it is many times unclear whether the default dimension is the most suitable choice for the subsequent downstream learning tasks. Furthermore, representation dimensions are seldom subject to hyperparameter tunning due to computational constraints. The purpose of this paper is to demonstrate that a surprisingly simple and efficient recursive compression procedure can be sufficient to both significantly compress the initial representation, but also potentially improve its performance when considering the task of text classification. Having smaller and less noisy representations is the desired property during deployment, as orders of magnitude smaller models can significantly reduce the computational overload and with it the deployment costs. We propose CORE, a straightforward, compression-agnostic framework suitable for representation compression. The CORE’S performance is showcased and studied on a collection of 17 real-life corpora from biomedical, news, social media, and literary domains. We explored CORE’S behavior when considering contextual and non-contextual document representations, different compression levels, and 9 different compression algorithms. Current results based on more than 100,000 compression experiments indicate that recursive Singular Value Decomposition offers a very good trade-off between the compression efficiency and performance, making CORE useful in many existing, representation-dependent NLP pipelines.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信