一种新的文档聚类特征选择方法TF-IDF

2013 3rd IEEE International Advance Computing Conference (IACC) Pub Date : 2013-05-13 DOI:10.1109/IADCC.2013.6514339

L. Patil, M. Atique

{"title":"一种新的文档聚类特征选择方法TF-IDF","authors":"L. Patil, M. Atique","doi":"10.1109/IADCC.2013.6514339","DOIUrl":null,"url":null,"abstract":"Now a day, the text document is spontaneously increasing over the internet, e-mail and web pages and they are stored in the electronic database format. To arrange and browse the document it becomes difficult. To overcome such problem the document preprocessing, term selection, attribute reduction and maintaining the relationship between the important terms using background knowledge, WordNet, becomes an important parameters in data mining. In these paper the different stages are formed, firstly the document preprocessing is done by removing stop words, stemming is performed using porter stemmer algorithm, word net thesaurus is applied for maintaining relationship between the important terms, global unique words, and frequent word sets get generated, Secondly, data matrix is formed, and thirdly terms are extracted from the documents by using term selection approaches tf-idf, tf-df, and tf2 based on their minimum threshold value. Further each and every document terms gets preprocessed, where the frequency of each term within the document is counted for representation. The purpose of this approach is to reduce the attributes and find the effective term selection method using WordNet for better clustering accuracy. Experiments are evaluated on Reuters Transcription Subsets, wheat, trade, money grain, and ship.","PeriodicalId":325901,"journal":{"name":"2013 3rd IEEE International Advance Computing Conference (IACC)","volume":"679 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"42","resultStr":"{\"title\":\"A novel approach for feature selection method TF-IDF in document clustering\",\"authors\":\"L. Patil, M. Atique\",\"doi\":\"10.1109/IADCC.2013.6514339\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Now a day, the text document is spontaneously increasing over the internet, e-mail and web pages and they are stored in the electronic database format. To arrange and browse the document it becomes difficult. To overcome such problem the document preprocessing, term selection, attribute reduction and maintaining the relationship between the important terms using background knowledge, WordNet, becomes an important parameters in data mining. In these paper the different stages are formed, firstly the document preprocessing is done by removing stop words, stemming is performed using porter stemmer algorithm, word net thesaurus is applied for maintaining relationship between the important terms, global unique words, and frequent word sets get generated, Secondly, data matrix is formed, and thirdly terms are extracted from the documents by using term selection approaches tf-idf, tf-df, and tf2 based on their minimum threshold value. Further each and every document terms gets preprocessed, where the frequency of each term within the document is counted for representation. The purpose of this approach is to reduce the attributes and find the effective term selection method using WordNet for better clustering accuracy. Experiments are evaluated on Reuters Transcription Subsets, wheat, trade, money grain, and ship.\",\"PeriodicalId\":325901,\"journal\":{\"name\":\"2013 3rd IEEE International Advance Computing Conference (IACC)\",\"volume\":\"679 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-05-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"42\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 3rd IEEE International Advance Computing Conference (IACC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IADCC.2013.6514339\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 3rd IEEE International Advance Computing Conference (IACC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IADCC.2013.6514339","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 42

摘要

如今，文本文档在互联网、电子邮件和网页上自发地增加，并以电子数据库格式存储。整理和浏览文档变得很困难。为了克服这一问题，利用背景知识WordNet对文档进行预处理、词选择、属性约简以及维护重要词之间的关系，成为数据挖掘中的一个重要参数。本文首先对文档进行预处理，去除停止词，使用波特stemmer算法进行词干提取，利用词网主题词表维护重要词与全局唯一词之间的关系，生成频繁词集;其次，形成数据矩阵;第三，采用基于最小阈值的词选择方法tf-idf、tf-df和tf2从文档中提取词。此外，每个文档术语都要进行预处理，其中计算文档中每个术语的频率以进行表示。该方法的目的是利用WordNet减少属性并找到有效的词选择方法，以获得更好的聚类精度。实验在路透社转录子集、小麦、贸易、货币谷物和船舶上进行评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A novel approach for feature selection method TF-IDF in document clustering

Now a day, the text document is spontaneously increasing over the internet, e-mail and web pages and they are stored in the electronic database format. To arrange and browse the document it becomes difficult. To overcome such problem the document preprocessing, term selection, attribute reduction and maintaining the relationship between the important terms using background knowledge, WordNet, becomes an important parameters in data mining. In these paper the different stages are formed, firstly the document preprocessing is done by removing stop words, stemming is performed using porter stemmer algorithm, word net thesaurus is applied for maintaining relationship between the important terms, global unique words, and frequent word sets get generated, Secondly, data matrix is formed, and thirdly terms are extracted from the documents by using term selection approaches tf-idf, tf-df, and tf2 based on their minimum threshold value. Further each and every document terms gets preprocessed, where the frequency of each term within the document is counted for representation. The purpose of this approach is to reduce the attributes and find the effective term selection method using WordNet for better clustering accuracy. Experiments are evaluated on Reuters Transcription Subsets, wheat, trade, money grain, and ship.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 3rd IEEE International Advance Computing Conference (IACC)

自引率

0.00%

发文量