基于半监督模糊c均值算法的文本分类

18th International Conference of the North American Fuzzy Information Processing Society - NAFIPS (Cat. No.99TH8397) Pub Date : 1999-06-10 DOI:10.1109/NAFIPS.1999.781756

M. Benkhalifa

{"title":"基于半监督模糊c均值算法的文本分类","authors":"M. Benkhalifa","doi":"10.1109/NAFIPS.1999.781756","DOIUrl":null,"url":null,"abstract":"Text categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has become very important in the information retrieval area, where information needs have tremendously increased with the rapid growth of textual information sources such as the Internet. We compare, for text categorization, two partially supervised (or semi-supervised) clustering algorithms: the Semi-Supervised Agglomerative Hierarchical Clustering (ssAHC) algorithm (A. Amar et al., 1997) and the Semi-Supervised Fuzzy-c-Means (ssFCM) algorithm (M. Amine et al., 1996). This (semi-supervised) learning paradigm falls somewhere between the fully supervised and the fully unsupervised learning schemes, in the sense that it exploits both class information contained in labeled data (training documents) and structure information possessed by unlabeled data (test documents) in order to produce better partitions for test documents. Our experiments, make use of the Reuters 21578 database of documents and consist of a binary classification for each of the ten most populous categories of the Reuters database. To convert the documents into vector form, we experiment with different numbers of features, which we select, based on an information gain criterion. We verify experimentally that ssFCM both outperforms and takes less time than the Fuzzy-c-Means (FCM) algorithm. With a smaller number of features, ssFCM's performance is also superior to that of ssAHC's. Finally ssFCM results in improved performance and faster execution time as more weight is given to training documents.","PeriodicalId":335957,"journal":{"name":"18th International Conference of the North American Fuzzy Information Processing Society - NAFIPS (Cat. No.99TH8397)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":"{\"title\":\"Text categorization using the semi-supervised fuzzy c-means algorithm\",\"authors\":\"M. Benkhalifa\",\"doi\":\"10.1109/NAFIPS.1999.781756\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has become very important in the information retrieval area, where information needs have tremendously increased with the rapid growth of textual information sources such as the Internet. We compare, for text categorization, two partially supervised (or semi-supervised) clustering algorithms: the Semi-Supervised Agglomerative Hierarchical Clustering (ssAHC) algorithm (A. Amar et al., 1997) and the Semi-Supervised Fuzzy-c-Means (ssFCM) algorithm (M. Amine et al., 1996). This (semi-supervised) learning paradigm falls somewhere between the fully supervised and the fully unsupervised learning schemes, in the sense that it exploits both class information contained in labeled data (training documents) and structure information possessed by unlabeled data (test documents) in order to produce better partitions for test documents. Our experiments, make use of the Reuters 21578 database of documents and consist of a binary classification for each of the ten most populous categories of the Reuters database. To convert the documents into vector form, we experiment with different numbers of features, which we select, based on an information gain criterion. We verify experimentally that ssFCM both outperforms and takes less time than the Fuzzy-c-Means (FCM) algorithm. With a smaller number of features, ssFCM's performance is also superior to that of ssAHC's. Finally ssFCM results in improved performance and faster execution time as more weight is given to training documents.\",\"PeriodicalId\":335957,\"journal\":{\"name\":\"18th International Conference of the North American Fuzzy Information Processing Society - NAFIPS (Cat. No.99TH8397)\",\"volume\":\"33 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-06-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"41\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"18th International Conference of the North American Fuzzy Information Processing Society - NAFIPS (Cat. No.99TH8397)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NAFIPS.1999.781756\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"18th International Conference of the North American Fuzzy Information Processing Society - NAFIPS (Cat. No.99TH8397)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NAFIPS.1999.781756","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 41

摘要

文本分类(TC)是基于文档内容将文本文档自动分配到预定义的类别。随着互联网等文本信息源的快速增长，信息需求急剧增加，TC在信息检索领域变得非常重要。对于文本分类，我们比较了两种部分监督(或半监督)聚类算法:半监督凝聚分层聚类(ssAHC)算法(A. Amar等人，1997)和半监督模糊c均值(ssFCM)算法(M. Amine等人，1996)。这种(半监督的)学习范式介于完全监督和完全无监督的学习方案之间，从某种意义上说，它既利用了标记数据(训练文档)中包含的类信息，也利用了未标记数据(测试文档)中拥有的结构信息，以便为测试文档生成更好的分区。我们的实验使用了Reuters 21578文档数据库，并对Reuters数据库中10个最常见的类别中的每一个都进行了二进制分类。为了将文档转换为向量形式，我们根据信息增益准则选择不同数量的特征进行实验。实验验证了ssFCM算法的性能优于模糊均值(Fuzzy-c-Means, FCM)算法，且耗时更短。ssFCM的特征数量较少，性能也优于ssAHC。最后，由于给培训文档赋予了更多的权重，ssFCM提高了性能，加快了执行时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Text categorization using the semi-supervised fuzzy c-means algorithm

Text categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has become very important in the information retrieval area, where information needs have tremendously increased with the rapid growth of textual information sources such as the Internet. We compare, for text categorization, two partially supervised (or semi-supervised) clustering algorithms: the Semi-Supervised Agglomerative Hierarchical Clustering (ssAHC) algorithm (A. Amar et al., 1997) and the Semi-Supervised Fuzzy-c-Means (ssFCM) algorithm (M. Amine et al., 1996). This (semi-supervised) learning paradigm falls somewhere between the fully supervised and the fully unsupervised learning schemes, in the sense that it exploits both class information contained in labeled data (training documents) and structure information possessed by unlabeled data (test documents) in order to produce better partitions for test documents. Our experiments, make use of the Reuters 21578 database of documents and consist of a binary classification for each of the ten most populous categories of the Reuters database. To convert the documents into vector form, we experiment with different numbers of features, which we select, based on an information gain criterion. We verify experimentally that ssFCM both outperforms and takes less time than the Fuzzy-c-Means (FCM) algorithm. With a smaller number of features, ssFCM's performance is also superior to that of ssAHC's. Finally ssFCM results in improved performance and faster execution time as more weight is given to training documents.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

18th International Conference of the North American Fuzzy Information Processing Society - NAFIPS (Cat. No.99TH8397)

自引率

0.00%

发文量