A Template Independent Approach for Web News and Blog Content Extraction

Xueyang Ma, Hongli Zhang, Xiangzhan Yu, Yingjun Li
{"title":"A Template Independent Approach for Web News and Blog Content Extraction","authors":"Xueyang Ma, Hongli Zhang, Xiangzhan Yu, Yingjun Li","doi":"10.1109/ICISCE.2016.36","DOIUrl":null,"url":null,"abstract":"The Web has become a large platform for information publishing and consuming. Web news and blog are both representative information sources providing convenient ways to keep informed. In addition to the main content, most web pages also contain navigation panels, advertisements, recommended articles etc. Effectively extracting news and blog content and filtering these noises is necessary and challenging. In this paper we propose a news and blog content extraction approach that is portable to different languages and various domains. Our extensive case studies shows that characters which are not anchor texts but contain stop words are more likely to be the genuine content. Our method first traverses the entire DOM tree and count these valid characters attached to each DOM node. Then we step into the most representative child node based on valid characters recursively. And we finally stop at the main content node with a predefined criterion. To validate the approach, we conduct experiments by using online news and blog files randomly selected from well-known Chinese and English websites. Experimental result shows that our method achieves 96% F1-measure on average and outperforms CETR.","PeriodicalId":6882,"journal":{"name":"2016 3rd International Conference on Information Science and Control Engineering (ICISCE)","volume":"53 1","pages":"120-125"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 3rd International Conference on Information Science and Control Engineering (ICISCE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICISCE.2016.36","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

The Web has become a large platform for information publishing and consuming. Web news and blog are both representative information sources providing convenient ways to keep informed. In addition to the main content, most web pages also contain navigation panels, advertisements, recommended articles etc. Effectively extracting news and blog content and filtering these noises is necessary and challenging. In this paper we propose a news and blog content extraction approach that is portable to different languages and various domains. Our extensive case studies shows that characters which are not anchor texts but contain stop words are more likely to be the genuine content. Our method first traverses the entire DOM tree and count these valid characters attached to each DOM node. Then we step into the most representative child node based on valid characters recursively. And we finally stop at the main content node with a predefined criterion. To validate the approach, we conduct experiments by using online news and blog files randomly selected from well-known Chinese and English websites. Experimental result shows that our method achieves 96% F1-measure on average and outperforms CETR.
一种独立于模板的网络新闻和博客内容提取方法
网络已经成为信息发布和消费的大平台。网络新闻和博客都是具有代表性的信息来源,提供了方便的获取信息的方式。除了主要内容外,大多数网页还包含导航面板、广告、推荐文章等。有效地提取新闻和博客内容并过滤这些噪音是必要的,也是具有挑战性的。在本文中,我们提出了一种可移植到不同语言和不同领域的新闻和博客内容提取方法。我们广泛的案例研究表明,那些不是锚文本但包含停顿词的字符更有可能是真正的内容。我们的方法首先遍历整个DOM树,并对附加到每个DOM节点的有效字符进行计数。然后根据有效字符递归进入最具代表性的子节点。最后,我们在带有预定义标准的主内容节点处停下来。为了验证这一方法,我们从知名的中英文网站中随机选取了在线新闻和博客文件进行实验。实验结果表明,该方法平均达到96%的f1度量,优于ctr。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信