{"title":"Algorithm for updating n-grams word dictionary for web classification","authors":"T. Abidin, R. Ferdhiana","doi":"10.1109/IAC.2016.7905758","DOIUrl":null,"url":null,"abstract":"In this paper, we examine an algorithm to update n-grams word dictionary (thesaurus) and evaluate its effectiveness in binary classification problem. The thesaurus is used as a reference to generate the numerical feature attributes of web pages. Generally, the n-grams word dictionary is built once using a set of training data and its content is never updated. Hence, the content is static and its coverage is limited to the n-grams word found in the initial training set. Actually, the content of a thesaurus must be dynamic, especially because the n-grams word dictionary is used repeatedly as a reference in generating the numerical feature attributes of web pages. We argue that a dynamic thesaurus is better than a static one in a long-term. Thus, n-grams word dictionary should be updated frequently using new data without degrading the classification accuracy. We validate our proposed algorithm using several test sets, each of which contains one hundred web pages, except for the last one. The experimental results show that our proposed algorithm works well. On average, the accuracy of feature dataset generated using the existing (old) dictionary is 57.75%, while the accuracy of feature dataset generated using updated (new) dictionary is 76.75%. The proposed algorithm increases classification accuracy about 32.90%.","PeriodicalId":404904,"journal":{"name":"2016 International Conference on Informatics and Computing (ICIC)","volume":"1 10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Informatics and Computing (ICIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IAC.2016.7905758","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
In this paper, we examine an algorithm to update n-grams word dictionary (thesaurus) and evaluate its effectiveness in binary classification problem. The thesaurus is used as a reference to generate the numerical feature attributes of web pages. Generally, the n-grams word dictionary is built once using a set of training data and its content is never updated. Hence, the content is static and its coverage is limited to the n-grams word found in the initial training set. Actually, the content of a thesaurus must be dynamic, especially because the n-grams word dictionary is used repeatedly as a reference in generating the numerical feature attributes of web pages. We argue that a dynamic thesaurus is better than a static one in a long-term. Thus, n-grams word dictionary should be updated frequently using new data without degrading the classification accuracy. We validate our proposed algorithm using several test sets, each of which contains one hundred web pages, except for the last one. The experimental results show that our proposed algorithm works well. On average, the accuracy of feature dataset generated using the existing (old) dictionary is 57.75%, while the accuracy of feature dataset generated using updated (new) dictionary is 76.75%. The proposed algorithm increases classification accuracy about 32.90%.