Oussama Chabih, Sara Sbai, Mohammed Reda, Chbihi Louhdi, Hicham Behja
{"title":"优化文本聚类:确定最佳聚类数量的方法论","authors":"Oussama Chabih, Sara Sbai, Mohammed Reda, Chbihi Louhdi, Hicham Behja","doi":"10.30534/ijatcse/2024/021332024","DOIUrl":null,"url":null,"abstract":"Developing a method to determine the optimal number of clusters is a crucial endeavor, particularly in the domain of text clustering where the sheer volume of variations poses significant challenges. Recognizing this, our study is specifically tailored to address this challenge within the realm of unsupervised text analysis. We put forth an innovative approach that marries the K-means algorithm with Bregman distance, meticulously crafted to accommodate the idiosyncrasies inherent in textual data. Our iterative methodology is designed with a dual purpose: to mitigate the adverse effects of noise and to ensure the stability of the clusters formed, all underpinned by the sophisticated metric of Kullback-Leibler divergence. Through rigorous experimentation, we validated the efficacy of our method in effectively segmenting texts into coherent clusters. Notably, our approach outperformed an initial categorization, providing a more nuanced and representative depiction of the diverse array of topics present within the corpus. In essence, our study offers a promising avenue to enhance unsupervised text analysis, heralding potential advancements and avenues for further exploration in this dynamic field","PeriodicalId":483282,"journal":{"name":"International journal of advanced trends in computer science and engineering","volume":" 42","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Optimizing Text Clustering: A Methodological Approach for Determining the Optimal Number of Clusters\",\"authors\":\"Oussama Chabih, Sara Sbai, Mohammed Reda, Chbihi Louhdi, Hicham Behja\",\"doi\":\"10.30534/ijatcse/2024/021332024\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Developing a method to determine the optimal number of clusters is a crucial endeavor, particularly in the domain of text clustering where the sheer volume of variations poses significant challenges. Recognizing this, our study is specifically tailored to address this challenge within the realm of unsupervised text analysis. We put forth an innovative approach that marries the K-means algorithm with Bregman distance, meticulously crafted to accommodate the idiosyncrasies inherent in textual data. Our iterative methodology is designed with a dual purpose: to mitigate the adverse effects of noise and to ensure the stability of the clusters formed, all underpinned by the sophisticated metric of Kullback-Leibler divergence. Through rigorous experimentation, we validated the efficacy of our method in effectively segmenting texts into coherent clusters. Notably, our approach outperformed an initial categorization, providing a more nuanced and representative depiction of the diverse array of topics present within the corpus. In essence, our study offers a promising avenue to enhance unsupervised text analysis, heralding potential advancements and avenues for further exploration in this dynamic field\",\"PeriodicalId\":483282,\"journal\":{\"name\":\"International journal of advanced trends in computer science and engineering\",\"volume\":\" 42\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International journal of advanced trends in computer science and engineering\",\"FirstCategoryId\":\"0\",\"ListUrlMain\":\"https://doi.org/10.30534/ijatcse/2024/021332024\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of advanced trends in computer science and engineering","FirstCategoryId":"0","ListUrlMain":"https://doi.org/10.30534/ijatcse/2024/021332024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Optimizing Text Clustering: A Methodological Approach for Determining the Optimal Number of Clusters
Developing a method to determine the optimal number of clusters is a crucial endeavor, particularly in the domain of text clustering where the sheer volume of variations poses significant challenges. Recognizing this, our study is specifically tailored to address this challenge within the realm of unsupervised text analysis. We put forth an innovative approach that marries the K-means algorithm with Bregman distance, meticulously crafted to accommodate the idiosyncrasies inherent in textual data. Our iterative methodology is designed with a dual purpose: to mitigate the adverse effects of noise and to ensure the stability of the clusters formed, all underpinned by the sophisticated metric of Kullback-Leibler divergence. Through rigorous experimentation, we validated the efficacy of our method in effectively segmenting texts into coherent clusters. Notably, our approach outperformed an initial categorization, providing a more nuanced and representative depiction of the diverse array of topics present within the corpus. In essence, our study offers a promising avenue to enhance unsupervised text analysis, heralding potential advancements and avenues for further exploration in this dynamic field