Modified TF-IDF Term Weighting Strategies for Text Categorization

R. Roul, J. Sahoo, Kushagr Arora
{"title":"Modified TF-IDF Term Weighting Strategies for Text Categorization","authors":"R. Roul, J. Sahoo, Kushagr Arora","doi":"10.1109/INDICON.2017.8487593","DOIUrl":null,"url":null,"abstract":"Text mining is a well-known technique in the domain of information retrieval which derives high quality of information from the text. To develop strategies for such text processing, an appropriate domain representation is required. Vectorized Term Frequency and Inverse Document Frequency (TF-IDF) representation of documents is one of the current strategies in use. Traditional TF-IDF uses term frequencies and document frequencies to generate a weighted term which is used for document representation. This method works sufficiently well, however, it is quite simplistic and overlooks many details that should ideally be relevant while processing the text such as document length, frequency distribution etc. To handle those shortcomings, this paper proposes four vector representation of documents which is the modified version of the traditional TF-IDF. In order to check the performance of the proposed techniques, different state-of-the-art classifiers are used to classify a corpus of documents. Experimental results on different benchmark datasets show that the performances of different classifiers using the proposed techniques are better than the traditional TF-IDF.","PeriodicalId":263943,"journal":{"name":"2017 14th IEEE India Council International Conference (INDICON)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 14th IEEE India Council International Conference (INDICON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INDICON.2017.8487593","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 18

Abstract

Text mining is a well-known technique in the domain of information retrieval which derives high quality of information from the text. To develop strategies for such text processing, an appropriate domain representation is required. Vectorized Term Frequency and Inverse Document Frequency (TF-IDF) representation of documents is one of the current strategies in use. Traditional TF-IDF uses term frequencies and document frequencies to generate a weighted term which is used for document representation. This method works sufficiently well, however, it is quite simplistic and overlooks many details that should ideally be relevant while processing the text such as document length, frequency distribution etc. To handle those shortcomings, this paper proposes four vector representation of documents which is the modified version of the traditional TF-IDF. In order to check the performance of the proposed techniques, different state-of-the-art classifiers are used to classify a corpus of documents. Experimental results on different benchmark datasets show that the performances of different classifiers using the proposed techniques are better than the traditional TF-IDF.
用于文本分类的改进TF-IDF词加权策略
文本挖掘是信息检索领域的一项重要技术,它可以从文本中提取高质量的信息。为了开发这种文本处理策略,需要适当的域表示。向量化词频和逆文档频率(TF-IDF)表示文档是目前使用的策略之一。传统TF-IDF使用术语频率和文档频率来生成用于文档表示的加权术语。这种方法工作得很好,但是,它过于简单,忽略了许多在处理文本时应该相关的细节,如文档长度、频率分布等。针对这些不足,本文提出了一种基于传统TF-IDF的四向量文档表示方法。为了检查所提出的技术的性能,使用不同的最先进的分类器对文档语料库进行分类。在不同基准数据集上的实验结果表明,使用本文方法的不同分类器的性能都优于传统的TF-IDF。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信