Algorithm for updating n-grams word dictionary for web classification

T. Abidin, R. Ferdhiana
{"title":"Algorithm for updating n-grams word dictionary for web classification","authors":"T. Abidin, R. Ferdhiana","doi":"10.1109/IAC.2016.7905758","DOIUrl":null,"url":null,"abstract":"In this paper, we examine an algorithm to update n-grams word dictionary (thesaurus) and evaluate its effectiveness in binary classification problem. The thesaurus is used as a reference to generate the numerical feature attributes of web pages. Generally, the n-grams word dictionary is built once using a set of training data and its content is never updated. Hence, the content is static and its coverage is limited to the n-grams word found in the initial training set. Actually, the content of a thesaurus must be dynamic, especially because the n-grams word dictionary is used repeatedly as a reference in generating the numerical feature attributes of web pages. We argue that a dynamic thesaurus is better than a static one in a long-term. Thus, n-grams word dictionary should be updated frequently using new data without degrading the classification accuracy. We validate our proposed algorithm using several test sets, each of which contains one hundred web pages, except for the last one. The experimental results show that our proposed algorithm works well. On average, the accuracy of feature dataset generated using the existing (old) dictionary is 57.75%, while the accuracy of feature dataset generated using updated (new) dictionary is 76.75%. The proposed algorithm increases classification accuracy about 32.90%.","PeriodicalId":404904,"journal":{"name":"2016 International Conference on Informatics and Computing (ICIC)","volume":"1 10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Informatics and Computing (ICIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IAC.2016.7905758","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

In this paper, we examine an algorithm to update n-grams word dictionary (thesaurus) and evaluate its effectiveness in binary classification problem. The thesaurus is used as a reference to generate the numerical feature attributes of web pages. Generally, the n-grams word dictionary is built once using a set of training data and its content is never updated. Hence, the content is static and its coverage is limited to the n-grams word found in the initial training set. Actually, the content of a thesaurus must be dynamic, especially because the n-grams word dictionary is used repeatedly as a reference in generating the numerical feature attributes of web pages. We argue that a dynamic thesaurus is better than a static one in a long-term. Thus, n-grams word dictionary should be updated frequently using new data without degrading the classification accuracy. We validate our proposed algorithm using several test sets, each of which contains one hundred web pages, except for the last one. The experimental results show that our proposed algorithm works well. On average, the accuracy of feature dataset generated using the existing (old) dictionary is 57.75%, while the accuracy of feature dataset generated using updated (new) dictionary is 76.75%. The proposed algorithm increases classification accuracy about 32.90%.
用于web分类的n-grams词字典更新算法
本文研究了一种更新n-grams词词典的算法,并评估了其在二值分类问题中的有效性。该词库作为参考来生成网页的数字特征属性。通常,n-grams词字典使用一组训练数据构建一次,其内容永远不会更新。因此,内容是静态的,其覆盖范围仅限于在初始训练集中找到的n-grams单词。实际上,叙词库的内容必须是动态的,特别是在生成网页的数字特征属性时,要反复使用n-grams词字典作为参考。我们认为,从长远来看,动态词典比静态词典更好。因此,在不降低分类精度的前提下,n-grams词字典应该经常使用新数据进行更新。我们使用几个测试集来验证我们提出的算法,每个测试集包含100个网页,除了最后一个。实验结果表明,本文提出的算法效果良好。平均而言,使用现有(旧)字典生成的特征数据集的准确率为57.75%,而使用更新(新)字典生成的特征数据集的准确率为76.75%。该算法的分类准确率提高了32.90%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信