分布式文档表示的可压缩性

2021 IEEE International Conference on Data Mining (ICDM) Pub Date : 2021-12-01 DOI:10.1109/ICDM51629.2021.00166

Blaž Škrlj, Matej Petković

{"title":"分布式文档表示的可压缩性","authors":"Blaž Škrlj, Matej Petković","doi":"10.1109/ICDM51629.2021.00166","DOIUrl":null,"url":null,"abstract":"Contemporary natural language processing (NLP) revolves around learning from latent document representations, generated either implicitly by neural language models or explicitly by methods such as doc2vec or similar. One of the key properties of the obtained representations is their dimension. Whilst the commonly adopted dimensions of 256 and 768 offer sufficient performance on many tasks, it is many times unclear whether the default dimension is the most suitable choice for the subsequent downstream learning tasks. Furthermore, representation dimensions are seldom subject to hyperparameter tunning due to computational constraints. The purpose of this paper is to demonstrate that a surprisingly simple and efficient recursive compression procedure can be sufficient to both significantly compress the initial representation, but also potentially improve its performance when considering the task of text classification. Having smaller and less noisy representations is the desired property during deployment, as orders of magnitude smaller models can significantly reduce the computational overload and with it the deployment costs. We propose CORE, a straightforward, compression-agnostic framework suitable for representation compression. The CORE’S performance is showcased and studied on a collection of 17 real-life corpora from biomedical, news, social media, and literary domains. We explored CORE’S behavior when considering contextual and non-contextual document representations, different compression levels, and 9 different compression algorithms. Current results based on more than 100,000 compression experiments indicate that recursive Singular Value Decomposition offers a very good trade-off between the compression efficiency and performance, making CORE useful in many existing, representation-dependent NLP pipelines.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"175 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Compressibility of Distributed Document Representations\",\"authors\":\"Blaž Škrlj, Matej Petković\",\"doi\":\"10.1109/ICDM51629.2021.00166\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Contemporary natural language processing (NLP) revolves around learning from latent document representations, generated either implicitly by neural language models or explicitly by methods such as doc2vec or similar. One of the key properties of the obtained representations is their dimension. Whilst the commonly adopted dimensions of 256 and 768 offer sufficient performance on many tasks, it is many times unclear whether the default dimension is the most suitable choice for the subsequent downstream learning tasks. Furthermore, representation dimensions are seldom subject to hyperparameter tunning due to computational constraints. The purpose of this paper is to demonstrate that a surprisingly simple and efficient recursive compression procedure can be sufficient to both significantly compress the initial representation, but also potentially improve its performance when considering the task of text classification. Having smaller and less noisy representations is the desired property during deployment, as orders of magnitude smaller models can significantly reduce the computational overload and with it the deployment costs. We propose CORE, a straightforward, compression-agnostic framework suitable for representation compression. The CORE’S performance is showcased and studied on a collection of 17 real-life corpora from biomedical, news, social media, and literary domains. We explored CORE’S behavior when considering contextual and non-contextual document representations, different compression levels, and 9 different compression algorithms. Current results based on more than 100,000 compression experiments indicate that recursive Singular Value Decomposition offers a very good trade-off between the compression efficiency and performance, making CORE useful in many existing, representation-dependent NLP pipelines.\",\"PeriodicalId\":320970,\"journal\":{\"name\":\"2021 IEEE International Conference on Data Mining (ICDM)\",\"volume\":\"175 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Data Mining (ICDM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDM51629.2021.00166\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Data Mining (ICDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM51629.2021.00166","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

当代自然语言处理(NLP)围绕着从潜在的文档表示中学习，这些表示要么是由神经语言模型隐式生成的，要么是由doc2vec或类似的方法显式生成的。获得的表示的关键属性之一是它们的维数。虽然通常采用的维度256和768在许多任务上提供了足够的性能，但通常不清楚默认维度是否是后续下游学习任务的最合适选择。此外，由于计算限制，表示维度很少受到超参数调优的影响。本文的目的是证明一个非常简单和有效的递归压缩过程足以显著压缩初始表示，而且在考虑文本分类任务时还可能提高其性能。在部署过程中，拥有更小、更少噪声的表示是期望的属性，因为数量级更小的模型可以显著减少计算过载，从而降低部署成本。我们提出了CORE，一个简单的，适合于表示压缩的压缩无关框架。CORE的性能在来自生物医学、新闻、社交媒体和文学领域的17个真实语料库上进行了展示和研究。在考虑上下文和非上下文文档表示、不同的压缩级别和9种不同的压缩算法时，我们探讨了CORE的行为。目前基于超过100,000个压缩实验的结果表明，递归奇异值分解在压缩效率和性能之间提供了一个非常好的权衡，使CORE在许多现有的依赖于表示的NLP管道中非常有用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Compressibility of Distributed Document Representations

Contemporary natural language processing (NLP) revolves around learning from latent document representations, generated either implicitly by neural language models or explicitly by methods such as doc2vec or similar. One of the key properties of the obtained representations is their dimension. Whilst the commonly adopted dimensions of 256 and 768 offer sufficient performance on many tasks, it is many times unclear whether the default dimension is the most suitable choice for the subsequent downstream learning tasks. Furthermore, representation dimensions are seldom subject to hyperparameter tunning due to computational constraints. The purpose of this paper is to demonstrate that a surprisingly simple and efficient recursive compression procedure can be sufficient to both significantly compress the initial representation, but also potentially improve its performance when considering the task of text classification. Having smaller and less noisy representations is the desired property during deployment, as orders of magnitude smaller models can significantly reduce the computational overload and with it the deployment costs. We propose CORE, a straightforward, compression-agnostic framework suitable for representation compression. The CORE’S performance is showcased and studied on a collection of 17 real-life corpora from biomedical, news, social media, and literary domains. We explored CORE’S behavior when considering contextual and non-contextual document representations, different compression levels, and 9 different compression algorithms. Current results based on more than 100,000 compression experiments indicate that recursive Singular Value Decomposition offers a very good trade-off between the compression efficiency and performance, making CORE useful in many existing, representation-dependent NLP pipelines.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE International Conference on Data Mining (ICDM)

自引率

0.00%

发文量