Web-Page Content Classification on Entropy Classifiers using Machine Learning

S. Siddiqha, M. Islabudeen
{"title":"Web-Page Content Classification on Entropy Classifiers using Machine Learning","authors":"S. Siddiqha, M. Islabudeen","doi":"10.1109/ICONAT57137.2023.10080462","DOIUrl":null,"url":null,"abstract":"In recent years, the World Wide Web (WWW) has become a global data center, which permits people to store and distribute their information. The information in Web Pages may be related to be personal, official, commercial and business. The users of Web would like to access such information for their needs. Therefore, to use the Web data for any specific purpose, it is necessary to have techniques which will classify the Web Pages so that the suitable data available in Web Page are provided to users. This paper proposes a new technique for classification of Web Pages using level based classification and hierarchical indexing model based on predefined domains: Sports, Politics and education. The method works in two important phases: Training phase and Testing phase. During training phase the dynamic Feature Extraction and Knowledge Representation is performed. During testing phase the features extracted from the Web Pages are used for content matching for Classification. The technique comprises three steps namely: Dynamic Feature Extraction, Knowledge Representation and Classification for randomly distributed Web Pages. During Feature Extraction the important keywords are extracted from Headers and Paragraphs of Web Pages. The Frequency Occurrence of Key Words is determined and the frequency values are multiplied with weights so as to segregate the keywords at different priority levels. The Represented Knowledge is further used for content matching for classification of Web Pages. The percentage of belongingness of the webpage for each such category is calculated using Maximum Entropy Classifier. Maximum Entropy Classifier is considered due to its advantage in search based optimizations. The method is evaluated with three different categories of Web Page such as Sports, Politics and Education. The technique has achieved the Classification accuracy of 91% which is higher than conventional Classification technique.","PeriodicalId":250587,"journal":{"name":"2023 International Conference for Advancement in Technology (ICONAT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference for Advancement in Technology (ICONAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICONAT57137.2023.10080462","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In recent years, the World Wide Web (WWW) has become a global data center, which permits people to store and distribute their information. The information in Web Pages may be related to be personal, official, commercial and business. The users of Web would like to access such information for their needs. Therefore, to use the Web data for any specific purpose, it is necessary to have techniques which will classify the Web Pages so that the suitable data available in Web Page are provided to users. This paper proposes a new technique for classification of Web Pages using level based classification and hierarchical indexing model based on predefined domains: Sports, Politics and education. The method works in two important phases: Training phase and Testing phase. During training phase the dynamic Feature Extraction and Knowledge Representation is performed. During testing phase the features extracted from the Web Pages are used for content matching for Classification. The technique comprises three steps namely: Dynamic Feature Extraction, Knowledge Representation and Classification for randomly distributed Web Pages. During Feature Extraction the important keywords are extracted from Headers and Paragraphs of Web Pages. The Frequency Occurrence of Key Words is determined and the frequency values are multiplied with weights so as to segregate the keywords at different priority levels. The Represented Knowledge is further used for content matching for classification of Web Pages. The percentage of belongingness of the webpage for each such category is calculated using Maximum Entropy Classifier. Maximum Entropy Classifier is considered due to its advantage in search based optimizations. The method is evaluated with three different categories of Web Page such as Sports, Politics and Education. The technique has achieved the Classification accuracy of 91% which is higher than conventional Classification technique.
基于机器学习熵分类器的网页内容分类
近年来,万维网(WWW)已经成为一个全球性的数据中心,它允许人们存储和分发他们的信息。网页上的信息可能涉及个人、官方、商业和商业。Web用户希望访问这些信息以满足他们的需要。因此,要将Web数据用于任何特定目的,就必须具备对Web页面进行分类的技术,以便将Web页面中可用的适当数据提供给用户。本文提出了一种基于层次分类和层次索引模型的网页分类技术,该技术基于预定义的领域:体育、政治和教育。该方法分为两个重要阶段:训练阶段和测试阶段。在训练阶段进行动态特征提取和知识表示。在测试阶段,从Web页面中提取的特征用于分类的内容匹配。该技术包括三个步骤:随机分布网页的动态特征提取、知识表示和分类。在特征提取过程中,从网页的标题和段落中提取重要的关键字。确定关键词的出现频率,并将频率值与权重相乘,从而分离出不同优先级的关键词。表示的知识进一步用于网页分类的内容匹配。使用最大熵分类器计算每个此类类别的网页的归属百分比。最大熵分类器由于其在基于搜索的优化中的优势而被考虑。该方法用体育、政治、教育等3个不同类别的网页进行评价。该方法的分类准确率达到91%,高于传统的分类技术。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信