A Template Independent Approach for Web News and Blog Content Extraction

2016 3rd International Conference on Information Science and Control Engineering (ICISCE) Pub Date : 2016-07-08 DOI:10.1109/ICISCE.2016.36

Xueyang Ma, Hongli Zhang, Xiangzhan Yu, Yingjun Li

{"title":"A Template Independent Approach for Web News and Blog Content Extraction","authors":"Xueyang Ma, Hongli Zhang, Xiangzhan Yu, Yingjun Li","doi":"10.1109/ICISCE.2016.36","DOIUrl":null,"url":null,"abstract":"The Web has become a large platform for information publishing and consuming. Web news and blog are both representative information sources providing convenient ways to keep informed. In addition to the main content, most web pages also contain navigation panels, advertisements, recommended articles etc. Effectively extracting news and blog content and filtering these noises is necessary and challenging. In this paper we propose a news and blog content extraction approach that is portable to different languages and various domains. Our extensive case studies shows that characters which are not anchor texts but contain stop words are more likely to be the genuine content. Our method first traverses the entire DOM tree and count these valid characters attached to each DOM node. Then we step into the most representative child node based on valid characters recursively. And we finally stop at the main content node with a predefined criterion. To validate the approach, we conduct experiments by using online news and blog files randomly selected from well-known Chinese and English websites. Experimental result shows that our method achieves 96% F1-measure on average and outperforms CETR.","PeriodicalId":6882,"journal":{"name":"2016 3rd International Conference on Information Science and Control Engineering (ICISCE)","volume":"53 1","pages":"120-125"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 3rd International Conference on Information Science and Control Engineering (ICISCE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICISCE.2016.36","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The Web has become a large platform for information publishing and consuming. Web news and blog are both representative information sources providing convenient ways to keep informed. In addition to the main content, most web pages also contain navigation panels, advertisements, recommended articles etc. Effectively extracting news and blog content and filtering these noises is necessary and challenging. In this paper we propose a news and blog content extraction approach that is portable to different languages and various domains. Our extensive case studies shows that characters which are not anchor texts but contain stop words are more likely to be the genuine content. Our method first traverses the entire DOM tree and count these valid characters attached to each DOM node. Then we step into the most representative child node based on valid characters recursively. And we finally stop at the main content node with a predefined criterion. To validate the approach, we conduct experiments by using online news and blog files randomly selected from well-known Chinese and English websites. Experimental result shows that our method achieves 96% F1-measure on average and outperforms CETR.

查看原文本刊更多论文

一种独立于模板的网络新闻和博客内容提取方法

网络已经成为信息发布和消费的大平台。网络新闻和博客都是具有代表性的信息来源，提供了方便的获取信息的方式。除了主要内容外，大多数网页还包含导航面板、广告、推荐文章等。有效地提取新闻和博客内容并过滤这些噪音是必要的，也是具有挑战性的。在本文中，我们提出了一种可移植到不同语言和不同领域的新闻和博客内容提取方法。我们广泛的案例研究表明，那些不是锚文本但包含停顿词的字符更有可能是真正的内容。我们的方法首先遍历整个DOM树，并对附加到每个DOM节点的有效字符进行计数。然后根据有效字符递归进入最具代表性的子节点。最后，我们在带有预定义标准的主内容节点处停下来。为了验证这一方法，我们从知名的中英文网站中随机选取了在线新闻和博客文件进行实验。实验结果表明，该方法平均达到96%的f1度量，优于ctr。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 3rd International Conference on Information Science and Control Engineering (ICISCE)

自引率

0.00%

发文量