Identification and classification of emerging genres in WebPages

K. Kumari, A. Reddy
{"title":"Identification and classification of emerging genres in WebPages","authors":"K. Kumari, A. Reddy","doi":"10.1109/ICCCT2.2014.7066692","DOIUrl":null,"url":null,"abstract":"The information in World Wide Web is dynamic and growing faster. Existing topic based search engines are not adequate to retrieve information required by the users. So there is a necessity to develop genre based search engines. Firstly, web genres have to be identified to develop genre based search engines. Presently, there exist a few genre corpuses which include web genres like articles, online news, journalistic etc. The active nature of the web allows new genres to come into existence and these genres are called as emerging genres. In this paper, two novel algorithms are proposed namely Identification of Emerging Genres (IEG) algorithm and Adjustable Centroid Classification (ACC) algorithm. The IEG algorithm is used to identify emerging genres from the web pages that are collected randomly from the web and ACC algorithm is used to evaluate the performance of genre corpus. In this paper, the IEG algorithm has identified three emerging genres from 339 randomly selected web pages from World Wide Web by considering balanced 7-genre corpus for single label and unbalanced 20-genre corpus for multi-label respectively. The performance of the resultant datasets (10-genre single label and 23-genre multi-label) obtained during the identification process is evaluated using ACC algorithm and compared with SVM classifier, random forest classifier for single label classification and binary relevance random forest classifier, binary relevance SVM classifier for multi-label classification respectively. The classification results show that ACC algorithm gave better results when compared to existing classification algorithms.","PeriodicalId":6860,"journal":{"name":"2021 RIVF International Conference on Computing and Communication Technologies (RIVF)","volume":"48 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 RIVF International Conference on Computing and Communication Technologies (RIVF)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCT2.2014.7066692","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The information in World Wide Web is dynamic and growing faster. Existing topic based search engines are not adequate to retrieve information required by the users. So there is a necessity to develop genre based search engines. Firstly, web genres have to be identified to develop genre based search engines. Presently, there exist a few genre corpuses which include web genres like articles, online news, journalistic etc. The active nature of the web allows new genres to come into existence and these genres are called as emerging genres. In this paper, two novel algorithms are proposed namely Identification of Emerging Genres (IEG) algorithm and Adjustable Centroid Classification (ACC) algorithm. The IEG algorithm is used to identify emerging genres from the web pages that are collected randomly from the web and ACC algorithm is used to evaluate the performance of genre corpus. In this paper, the IEG algorithm has identified three emerging genres from 339 randomly selected web pages from World Wide Web by considering balanced 7-genre corpus for single label and unbalanced 20-genre corpus for multi-label respectively. The performance of the resultant datasets (10-genre single label and 23-genre multi-label) obtained during the identification process is evaluated using ACC algorithm and compared with SVM classifier, random forest classifier for single label classification and binary relevance random forest classifier, binary relevance SVM classifier for multi-label classification respectively. The classification results show that ACC algorithm gave better results when compared to existing classification algorithms.
网页中新兴类型的识别和分类
万维网上的信息是动态的,而且增长得更快。现有的基于主题的搜索引擎不足以检索用户所需的信息。所以有必要开发基于类型的搜索引擎。首先,必须确定网络类型,以开发基于类型的搜索引擎。目前,存在一些体裁语料库,包括文章、网络新闻、新闻等网络体裁。网络的活跃特性允许新类型出现,这些类型被称为新兴类型。本文提出了两种新算法,即新兴类型识别(IEG)算法和可调质心分类(ACC)算法。使用IEG算法从随机收集的网页中识别新出现的体裁,使用ACC算法评估体裁语料库的性能。在本文中,IEG算法从万维网上随机抽取的339个网页中,分别考虑单标签下均衡的7个体裁语料库和多标签下不均衡的20个体裁语料库,识别出3个新兴的体裁。利用ACC算法对识别过程中得到的结果数据集(10个类型的单标签和23个类型的多标签)的性能进行评价,并分别与SVM分类器、随机森林分类器进行单标签分类,与二元相关随机森林分类器、二元相关SVM分类器进行多标签分类进行比较。分类结果表明,与现有的分类算法相比,ACC算法具有更好的分类效果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信