双聚类和共聚类:文本挖掘的概念、算法和可行性

Research initiative, treatment action : RITA Pub Date : 2019-08-03 DOI:10.22456/2175-2745.89063

Alexandra Katiuska Ramos Diaz, S. M. Peres

{"title":"双聚类和共聚类:文本挖掘的概念、算法和可行性","authors":"Alexandra Katiuska Ramos Diaz, S. M. Peres","doi":"10.22456/2175-2745.89063","DOIUrl":null,"url":null,"abstract":"Biclustering and coclustering are data mining tasks capable of extracting relevant information from data by applying similarity criteria simultaneously to rows and columns of data matrices. Algorithms used to accomplish these tasks simultaneously cluster objects and attributes, enabling the discovery of biclusters or coclusters. Although similar, the natures and aims of these tasks are different, and coclustering can be seen as a generalization of biclustering. An accurate study on algorithms related to biclustering and coclustering is essential to achieve effectiveness when solving real-world problems. Determining the values appropriate for the parameters of these algorithms is even more difficult when complex real-world data are analyzed. For example, when biclustering or coclustering is applied to textual data (i.e., in text mining), a representation through a vector space model is required. Such representation usually generates vector spaces with a high number of dimensions and high sparsity, which influences the performance of many algorithms. This tutorial aims to didactically present concepts related to the biclustering and coclustering tasks and how two basic algorithms address these concepts. In addition, experiments are presented in data contexts with a high number of dimensions and high sparsity, represented by both a synthetic dataset and a corpus of real-world news. In general and comparative terms, the results obtained show the algorithm used for coclustering (i.e., NBVD) as the most appropriate for the experiments’ context. Although the biclustering algorithm (i.e., Cheng and Church) was responsible for producing less relevant results in textual data than NBVD, its application in data with a high number of dimensions and high sparsity provided a suitable study environment to understand its operation.","PeriodicalId":82472,"journal":{"name":"Research initiative, treatment action : RITA","volume":"105 1","pages":"81-117"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Biclustering and coclustering: concepts, algorithms and viability for text mining\",\"authors\":\"Alexandra Katiuska Ramos Diaz, S. M. Peres\",\"doi\":\"10.22456/2175-2745.89063\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Biclustering and coclustering are data mining tasks capable of extracting relevant information from data by applying similarity criteria simultaneously to rows and columns of data matrices. Algorithms used to accomplish these tasks simultaneously cluster objects and attributes, enabling the discovery of biclusters or coclusters. Although similar, the natures and aims of these tasks are different, and coclustering can be seen as a generalization of biclustering. An accurate study on algorithms related to biclustering and coclustering is essential to achieve effectiveness when solving real-world problems. Determining the values appropriate for the parameters of these algorithms is even more difficult when complex real-world data are analyzed. For example, when biclustering or coclustering is applied to textual data (i.e., in text mining), a representation through a vector space model is required. Such representation usually generates vector spaces with a high number of dimensions and high sparsity, which influences the performance of many algorithms. This tutorial aims to didactically present concepts related to the biclustering and coclustering tasks and how two basic algorithms address these concepts. In addition, experiments are presented in data contexts with a high number of dimensions and high sparsity, represented by both a synthetic dataset and a corpus of real-world news. In general and comparative terms, the results obtained show the algorithm used for coclustering (i.e., NBVD) as the most appropriate for the experiments’ context. Although the biclustering algorithm (i.e., Cheng and Church) was responsible for producing less relevant results in textual data than NBVD, its application in data with a high number of dimensions and high sparsity provided a suitable study environment to understand its operation.\",\"PeriodicalId\":82472,\"journal\":{\"name\":\"Research initiative, treatment action : RITA\",\"volume\":\"105 1\",\"pages\":\"81-117\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-08-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Research initiative, treatment action : RITA\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.22456/2175-2745.89063\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research initiative, treatment action : RITA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22456/2175-2745.89063","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

双聚类和共聚类是能够通过同时对数据矩阵的行和列应用相似性标准从数据中提取相关信息的数据挖掘任务。用于同时完成这些任务的算法对对象和属性进行聚类，从而发现双聚类或共聚类。虽然相似，但这些任务的性质和目标是不同的，共聚类可以看作是双聚类的概括。准确研究与双聚类和共聚类相关的算法对于在解决现实问题时实现有效性至关重要。当分析复杂的现实世界数据时，确定这些算法参数的合适值就更加困难了。例如，当将双聚类或共聚类应用于文本数据(即在文本挖掘中)时，需要通过向量空间模型进行表示。这种表示通常会产生具有高维数和高稀疏度的向量空间，从而影响许多算法的性能。本教程旨在讲授与双聚类和共聚类任务相关的概念，以及两种基本算法如何解决这些概念。此外，实验在高维度和高稀疏度的数据环境中呈现，由合成数据集和现实世界新闻语料库表示。总的来说和比较而言，得到的结果表明，用于共聚类的算法(即NBVD)最适合实验的背景。虽然双聚类算法(即Cheng和Church)在文本数据中产生的相关结果不如NBVD，但其在高维数和高稀疏度数据中的应用为理解其操作提供了合适的研究环境。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Biclustering and coclustering: concepts, algorithms and viability for text mining

Biclustering and coclustering are data mining tasks capable of extracting relevant information from data by applying similarity criteria simultaneously to rows and columns of data matrices. Algorithms used to accomplish these tasks simultaneously cluster objects and attributes, enabling the discovery of biclusters or coclusters. Although similar, the natures and aims of these tasks are different, and coclustering can be seen as a generalization of biclustering. An accurate study on algorithms related to biclustering and coclustering is essential to achieve effectiveness when solving real-world problems. Determining the values appropriate for the parameters of these algorithms is even more difficult when complex real-world data are analyzed. For example, when biclustering or coclustering is applied to textual data (i.e., in text mining), a representation through a vector space model is required. Such representation usually generates vector spaces with a high number of dimensions and high sparsity, which influences the performance of many algorithms. This tutorial aims to didactically present concepts related to the biclustering and coclustering tasks and how two basic algorithms address these concepts. In addition, experiments are presented in data contexts with a high number of dimensions and high sparsity, represented by both a synthetic dataset and a corpus of real-world news. In general and comparative terms, the results obtained show the algorithm used for coclustering (i.e., NBVD) as the most appropriate for the experiments’ context. Although the biclustering algorithm (i.e., Cheng and Church) was responsible for producing less relevant results in textual data than NBVD, its application in data with a high number of dimensions and high sparsity provided a suitable study environment to understand its operation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Research initiative, treatment action : RITA

自引率

0.00%

发文量