解除诅咒:探索文本聚类应用的降维

2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA) Pub Date : 2022-07-18 DOI:10.1109/IISA56318.2022.9904383

Leonidas Akritidis, Panayiotis Bozanis

{"title":"解除诅咒:探索文本聚类应用的降维","authors":"Leonidas Akritidis, Panayiotis Bozanis","doi":"10.1109/IISA56318.2022.9904383","DOIUrl":null,"url":null,"abstract":"Nowadays, huge amounts of text are being generated on the Web by a vast number of applications. Examples of such applications include instant messengers, social networks, e-mail clients, news portals, blog communities, commercial platforms, and so forth. The requirement for effectively identifying documents of similar content in these services rendered text clustering one of the most emerging problems of the machine learning discipline. Nevertheless, the high dimensionality and the natural sparseness of text introduce significant challenges that threat the feasibility of even the most successful algorithms. Consequently, the role of dimensionality reduction techniques becomes crucial for this particular problem. Motivated by these challenges, in this article we investigate the impact of dimensionality reduction on the performance of text clustering algorithms. More specifically, we experimentally analyze its effects in the effectiveness and running times of eight clustering algorithms by employing six high-dimensional text datasets. The results indicate that, in most cases, dimensionality reduction may significantly improve the algorithm execution times, by sacrificing only small amounts of clustering quality.","PeriodicalId":217519,"journal":{"name":"2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Lifting the Curse: Exploring Dimensionality Reduction on Text Clustering Applications\",\"authors\":\"Leonidas Akritidis, Panayiotis Bozanis\",\"doi\":\"10.1109/IISA56318.2022.9904383\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays, huge amounts of text are being generated on the Web by a vast number of applications. Examples of such applications include instant messengers, social networks, e-mail clients, news portals, blog communities, commercial platforms, and so forth. The requirement for effectively identifying documents of similar content in these services rendered text clustering one of the most emerging problems of the machine learning discipline. Nevertheless, the high dimensionality and the natural sparseness of text introduce significant challenges that threat the feasibility of even the most successful algorithms. Consequently, the role of dimensionality reduction techniques becomes crucial for this particular problem. Motivated by these challenges, in this article we investigate the impact of dimensionality reduction on the performance of text clustering algorithms. More specifically, we experimentally analyze its effects in the effectiveness and running times of eight clustering algorithms by employing six high-dimensional text datasets. The results indicate that, in most cases, dimensionality reduction may significantly improve the algorithm execution times, by sacrificing only small amounts of clustering quality.\",\"PeriodicalId\":217519,\"journal\":{\"name\":\"2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA)\",\"volume\":\"67 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IISA56318.2022.9904383\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISA56318.2022.9904383","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

如今，大量的应用程序正在Web上生成大量的文本。此类应用程序的示例包括即时信使、社会网络、电子邮件客户端、新闻门户、博客社区、商业平台等等。在这些服务中有效识别相似内容的文档的需求使得文本聚类成为机器学习学科中最新兴的问题之一。然而，文本的高维性和自然稀疏性带来了重大挑战，甚至威胁到最成功算法的可行性。因此，降维技术的作用对于这个特定的问题变得至关重要。在这些挑战的推动下，本文研究了降维对文本聚类算法性能的影响。更具体地说，我们利用6个高维文本数据集实验分析了它对8种聚类算法的有效性和运行时间的影响。结果表明，在大多数情况下，通过牺牲少量的聚类质量，降维可以显著提高算法的执行时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Lifting the Curse: Exploring Dimensionality Reduction on Text Clustering Applications

Nowadays, huge amounts of text are being generated on the Web by a vast number of applications. Examples of such applications include instant messengers, social networks, e-mail clients, news portals, blog communities, commercial platforms, and so forth. The requirement for effectively identifying documents of similar content in these services rendered text clustering one of the most emerging problems of the machine learning discipline. Nevertheless, the high dimensionality and the natural sparseness of text introduce significant challenges that threat the feasibility of even the most successful algorithms. Consequently, the role of dimensionality reduction techniques becomes crucial for this particular problem. Motivated by these challenges, in this article we investigate the impact of dimensionality reduction on the performance of text clustering algorithms. More specifically, we experimentally analyze its effects in the effectiveness and running times of eight clustering algorithms by employing six high-dimensional text datasets. The results indicate that, in most cases, dimensionality reduction may significantly improve the algorithm execution times, by sacrificing only small amounts of clustering quality.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA)

自引率

0.00%

发文量