{"title":"用于文本分类的改进TF-IDF词加权策略","authors":"R. Roul, J. Sahoo, Kushagr Arora","doi":"10.1109/INDICON.2017.8487593","DOIUrl":null,"url":null,"abstract":"Text mining is a well-known technique in the domain of information retrieval which derives high quality of information from the text. To develop strategies for such text processing, an appropriate domain representation is required. Vectorized Term Frequency and Inverse Document Frequency (TF-IDF) representation of documents is one of the current strategies in use. Traditional TF-IDF uses term frequencies and document frequencies to generate a weighted term which is used for document representation. This method works sufficiently well, however, it is quite simplistic and overlooks many details that should ideally be relevant while processing the text such as document length, frequency distribution etc. To handle those shortcomings, this paper proposes four vector representation of documents which is the modified version of the traditional TF-IDF. In order to check the performance of the proposed techniques, different state-of-the-art classifiers are used to classify a corpus of documents. Experimental results on different benchmark datasets show that the performances of different classifiers using the proposed techniques are better than the traditional TF-IDF.","PeriodicalId":263943,"journal":{"name":"2017 14th IEEE India Council International Conference (INDICON)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":"{\"title\":\"Modified TF-IDF Term Weighting Strategies for Text Categorization\",\"authors\":\"R. Roul, J. Sahoo, Kushagr Arora\",\"doi\":\"10.1109/INDICON.2017.8487593\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text mining is a well-known technique in the domain of information retrieval which derives high quality of information from the text. To develop strategies for such text processing, an appropriate domain representation is required. Vectorized Term Frequency and Inverse Document Frequency (TF-IDF) representation of documents is one of the current strategies in use. Traditional TF-IDF uses term frequencies and document frequencies to generate a weighted term which is used for document representation. This method works sufficiently well, however, it is quite simplistic and overlooks many details that should ideally be relevant while processing the text such as document length, frequency distribution etc. To handle those shortcomings, this paper proposes four vector representation of documents which is the modified version of the traditional TF-IDF. In order to check the performance of the proposed techniques, different state-of-the-art classifiers are used to classify a corpus of documents. Experimental results on different benchmark datasets show that the performances of different classifiers using the proposed techniques are better than the traditional TF-IDF.\",\"PeriodicalId\":263943,\"journal\":{\"name\":\"2017 14th IEEE India Council International Conference (INDICON)\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"18\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 14th IEEE India Council International Conference (INDICON)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/INDICON.2017.8487593\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 14th IEEE India Council International Conference (INDICON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INDICON.2017.8487593","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Modified TF-IDF Term Weighting Strategies for Text Categorization
Text mining is a well-known technique in the domain of information retrieval which derives high quality of information from the text. To develop strategies for such text processing, an appropriate domain representation is required. Vectorized Term Frequency and Inverse Document Frequency (TF-IDF) representation of documents is one of the current strategies in use. Traditional TF-IDF uses term frequencies and document frequencies to generate a weighted term which is used for document representation. This method works sufficiently well, however, it is quite simplistic and overlooks many details that should ideally be relevant while processing the text such as document length, frequency distribution etc. To handle those shortcomings, this paper proposes four vector representation of documents which is the modified version of the traditional TF-IDF. In order to check the performance of the proposed techniques, different state-of-the-art classifiers are used to classify a corpus of documents. Experimental results on different benchmark datasets show that the performances of different classifiers using the proposed techniques are better than the traditional TF-IDF.