An efficient scheme for automatic web pages categorization using the support vector machine

IF 1.4 4区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS
V. Bhalla, N. Kumar
{"title":"An efficient scheme for automatic web pages categorization using the support vector machine","authors":"V. Bhalla, N. Kumar","doi":"10.1080/13614568.2016.1152316","DOIUrl":null,"url":null,"abstract":"ABSTRACT In the past few years, with an evolution of the Internet and related technologies, the number of the Internet users grows exponentially. These users demand access to relevant web pages from the Internet within fraction of seconds. To achieve this goal, there is a requirement of an efficient categorization of web page contents. Manual categorization of these billions of web pages to achieve high accuracy is a challenging task. Most of the existing techniques reported in the literature are semi-automatic. Using these techniques, higher level of accuracy cannot be achieved. To achieve these goals, this paper proposes an automatic web pages categorization into the domain category. The proposed scheme is based on the identification of specific and relevant features of the web pages. In the proposed scheme, first extraction and evaluation of features are done followed by filtering the feature set for categorization of domain web pages. A feature extraction tool based on the HTML document object model of the web page is developed in the proposed scheme. Feature extraction and weight assignment are based on the collection of domain-specific keyword list developed by considering various domain pages. Moreover, the keyword list is reduced on the basis of ids of keywords in keyword list. Also, stemming of keywords and tag text is done to achieve a higher accuracy. An extensive feature set is generated to develop a robust classification technique. The proposed scheme was evaluated using a machine learning method in combination with feature extraction and statistical analysis using support vector machine kernel as the classification tool. The results obtained confirm the effectiveness of the proposed scheme in terms of its accuracy in different categories of web pages.","PeriodicalId":54386,"journal":{"name":"New Review of Hypermedia and Multimedia","volume":"22 1","pages":"223 - 242"},"PeriodicalIF":1.4000,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/13614568.2016.1152316","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"New Review of Hypermedia and Multimedia","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1080/13614568.2016.1152316","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 13

Abstract

ABSTRACT In the past few years, with an evolution of the Internet and related technologies, the number of the Internet users grows exponentially. These users demand access to relevant web pages from the Internet within fraction of seconds. To achieve this goal, there is a requirement of an efficient categorization of web page contents. Manual categorization of these billions of web pages to achieve high accuracy is a challenging task. Most of the existing techniques reported in the literature are semi-automatic. Using these techniques, higher level of accuracy cannot be achieved. To achieve these goals, this paper proposes an automatic web pages categorization into the domain category. The proposed scheme is based on the identification of specific and relevant features of the web pages. In the proposed scheme, first extraction and evaluation of features are done followed by filtering the feature set for categorization of domain web pages. A feature extraction tool based on the HTML document object model of the web page is developed in the proposed scheme. Feature extraction and weight assignment are based on the collection of domain-specific keyword list developed by considering various domain pages. Moreover, the keyword list is reduced on the basis of ids of keywords in keyword list. Also, stemming of keywords and tag text is done to achieve a higher accuracy. An extensive feature set is generated to develop a robust classification technique. The proposed scheme was evaluated using a machine learning method in combination with feature extraction and statistical analysis using support vector machine kernel as the classification tool. The results obtained confirm the effectiveness of the proposed scheme in terms of its accuracy in different categories of web pages.
一种基于支持向量机的网页自动分类方法
近年来,随着互联网及相关技术的发展,互联网用户数量呈指数级增长。这些用户要求在几秒钟内从互联网上访问相关网页。为了实现这一目标,需要对网页内容进行有效的分类。对这数十亿个网页进行人工分类以达到高准确率是一项具有挑战性的任务。文献中报道的大多数现有技术都是半自动的。使用这些技术,无法达到更高的精度水平。为了实现这一目标,本文提出了一种基于领域分类的网页自动分类方法。建议的方案是基于对网页的具体和相关特征的识别。在该方案中,首先对特征进行提取和评价,然后对特征集进行过滤,用于领域网页的分类。提出了一种基于网页HTML文档对象模型的特征提取工具。特征提取和权值分配是基于考虑各个域页面的特定域关键字列表的集合。并根据关键字列表中关键字的id来缩减关键字列表。此外,对关键词和标签文本进行词干提取以达到更高的准确性。生成一个广泛的特征集来开发一个健壮的分类技术。采用特征提取与统计分析相结合的机器学习方法,以支持向量机核作为分类工具对所提方案进行评估。实验结果表明,该方法在不同类别的网页上的准确率是有效的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
New Review of Hypermedia and Multimedia
New Review of Hypermedia and Multimedia COMPUTER SCIENCE, INFORMATION SYSTEMS-
CiteScore
3.40
自引率
0.00%
发文量
4
审稿时长
>12 weeks
期刊介绍: The New Review of Hypermedia and Multimedia (NRHM) is an interdisciplinary journal providing a focus for research covering practical and theoretical developments in hypermedia, hypertext, and interactive multimedia.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信