{"title":"双聚类和共聚类:文本挖掘的概念、算法和可行性","authors":"Alexandra Katiuska Ramos Diaz, S. M. Peres","doi":"10.22456/2175-2745.89063","DOIUrl":null,"url":null,"abstract":"Biclustering and coclustering are data mining tasks capable of extracting relevant information from data by applying similarity criteria simultaneously to rows and columns of data matrices. Algorithms used to accomplish these tasks simultaneously cluster objects and attributes, enabling the discovery of biclusters or coclusters. Although similar, the natures and aims of these tasks are different, and coclustering can be seen as a generalization of biclustering. An accurate study on algorithms related to biclustering and coclustering is essential to achieve effectiveness when solving real-world problems. Determining the values appropriate for the parameters of these algorithms is even more difficult when complex real-world data are analyzed. For example, when biclustering or coclustering is applied to textual data (i.e., in text mining), a representation through a vector space model is required. Such representation usually generates vector spaces with a high number of dimensions and high sparsity, which influences the performance of many algorithms. This tutorial aims to didactically present concepts related to the biclustering and coclustering tasks and how two basic algorithms address these concepts. In addition, experiments are presented in data contexts with a high number of dimensions and high sparsity, represented by both a synthetic dataset and a corpus of real-world news. In general and comparative terms, the results obtained show the algorithm used for coclustering (i.e., NBVD) as the most appropriate for the experiments’ context. Although the biclustering algorithm (i.e., Cheng and Church) was responsible for producing less relevant results in textual data than NBVD, its application in data with a high number of dimensions and high sparsity provided a suitable study environment to understand its operation.","PeriodicalId":82472,"journal":{"name":"Research initiative, treatment action : RITA","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2019-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Biclustering and coclustering: concepts, algorithms and viability for text mining\",\"authors\":\"Alexandra Katiuska Ramos Diaz, S. M. Peres\",\"doi\":\"10.22456/2175-2745.89063\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Biclustering and coclustering are data mining tasks capable of extracting relevant information from data by applying similarity criteria simultaneously to rows and columns of data matrices. Algorithms used to accomplish these tasks simultaneously cluster objects and attributes, enabling the discovery of biclusters or coclusters. Although similar, the natures and aims of these tasks are different, and coclustering can be seen as a generalization of biclustering. An accurate study on algorithms related to biclustering and coclustering is essential to achieve effectiveness when solving real-world problems. Determining the values appropriate for the parameters of these algorithms is even more difficult when complex real-world data are analyzed. For example, when biclustering or coclustering is applied to textual data (i.e., in text mining), a representation through a vector space model is required. Such representation usually generates vector spaces with a high number of dimensions and high sparsity, which influences the performance of many algorithms. This tutorial aims to didactically present concepts related to the biclustering and coclustering tasks and how two basic algorithms address these concepts. In addition, experiments are presented in data contexts with a high number of dimensions and high sparsity, represented by both a synthetic dataset and a corpus of real-world news. In general and comparative terms, the results obtained show the algorithm used for coclustering (i.e., NBVD) as the most appropriate for the experiments’ context. Although the biclustering algorithm (i.e., Cheng and Church) was responsible for producing less relevant results in textual data than NBVD, its application in data with a high number of dimensions and high sparsity provided a suitable study environment to understand its operation.\",\"PeriodicalId\":82472,\"journal\":{\"name\":\"Research initiative, treatment action : RITA\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-08-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Research initiative, treatment action : RITA\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.22456/2175-2745.89063\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research initiative, treatment action : RITA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22456/2175-2745.89063","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Biclustering and coclustering: concepts, algorithms and viability for text mining
Biclustering and coclustering are data mining tasks capable of extracting relevant information from data by applying similarity criteria simultaneously to rows and columns of data matrices. Algorithms used to accomplish these tasks simultaneously cluster objects and attributes, enabling the discovery of biclusters or coclusters. Although similar, the natures and aims of these tasks are different, and coclustering can be seen as a generalization of biclustering. An accurate study on algorithms related to biclustering and coclustering is essential to achieve effectiveness when solving real-world problems. Determining the values appropriate for the parameters of these algorithms is even more difficult when complex real-world data are analyzed. For example, when biclustering or coclustering is applied to textual data (i.e., in text mining), a representation through a vector space model is required. Such representation usually generates vector spaces with a high number of dimensions and high sparsity, which influences the performance of many algorithms. This tutorial aims to didactically present concepts related to the biclustering and coclustering tasks and how two basic algorithms address these concepts. In addition, experiments are presented in data contexts with a high number of dimensions and high sparsity, represented by both a synthetic dataset and a corpus of real-world news. In general and comparative terms, the results obtained show the algorithm used for coclustering (i.e., NBVD) as the most appropriate for the experiments’ context. Although the biclustering algorithm (i.e., Cheng and Church) was responsible for producing less relevant results in textual data than NBVD, its application in data with a high number of dimensions and high sparsity provided a suitable study environment to understand its operation.