qRead:使用分区特征优化从网页中快速准确地提取文章的方法

2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K) Pub Date : 2015-11-12 DOI:10.5220/0005613603640371

Jingwen Wang, Jie Wang

{"title":"qRead:使用分区特征优化从网页中快速准确地提取文章的方法","authors":"Jingwen Wang, Jie Wang","doi":"10.5220/0005613603640371","DOIUrl":null,"url":null,"abstract":"We present a new method called qRead to achieve real-time content extractions from web pages with high accuracy. Early approaches to content extractions include empirical filtering rules, Document Object Model (DOM) trees, and machine learning models. These methods, while having met with certain success, may not meet the requirements of real-time extraction with high accuracy. For example, constructing a DOM-tree on a complex web page is time-consuming, and using machine learning models could make things unnecessarily more complicated. Different from previous approaches, qRead uses segment densities and similarities to identify main contents. In particular, qRead first filters obvious junk contents, eliminates HTML tags, and partitions the remaining text into natural segments. It then uses the highest ratio of words over the number of lines in a segment combined with similarity between the segment and the title to identify main contents. We show that, through extensive experiments, qRead achieves a 96.8% accuracy on Chinese web pages with an average extraction time of 13.20 milliseconds, and a 93.6% accuracy on English web pages with an average extraction time of 11.37 milliseconds, providing substantial improvements on accuracy over previous approaches and meeting the real-time extraction requirement.","PeriodicalId":102743,"journal":{"name":"2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"qRead: A fast and accurate article extraction method from web pages using partition features optimizations\",\"authors\":\"Jingwen Wang, Jie Wang\",\"doi\":\"10.5220/0005613603640371\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a new method called qRead to achieve real-time content extractions from web pages with high accuracy. Early approaches to content extractions include empirical filtering rules, Document Object Model (DOM) trees, and machine learning models. These methods, while having met with certain success, may not meet the requirements of real-time extraction with high accuracy. For example, constructing a DOM-tree on a complex web page is time-consuming, and using machine learning models could make things unnecessarily more complicated. Different from previous approaches, qRead uses segment densities and similarities to identify main contents. In particular, qRead first filters obvious junk contents, eliminates HTML tags, and partitions the remaining text into natural segments. It then uses the highest ratio of words over the number of lines in a segment combined with similarity between the segment and the title to identify main contents. We show that, through extensive experiments, qRead achieves a 96.8% accuracy on Chinese web pages with an average extraction time of 13.20 milliseconds, and a 93.6% accuracy on English web pages with an average extraction time of 11.37 milliseconds, providing substantial improvements on accuracy over previous approaches and meeting the real-time extraction requirement.\",\"PeriodicalId\":102743,\"journal\":{\"name\":\"2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)\",\"volume\":\"40 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5220/0005613603640371\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5220/0005613603640371","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

提出了一种新的方法qRead，实现了网页内容的实时提取，提取精度高。早期的内容提取方法包括经验过滤规则、文档对象模型(DOM)树和机器学习模型。这些方法虽然取得了一定的成功，但可能无法满足实时、高精度的提取要求。例如，在复杂的网页上构建dom树非常耗时，而使用机器学习模型可能会使事情变得不必要地更加复杂。与之前的方法不同，qRead使用片段密度和相似度来识别主要内容。特别是，qRead首先过滤明显的垃圾内容，消除HTML标记，并将剩余的文本划分为自然片段。然后，它使用一个片段中字数与行数的最高比率，结合片段与标题之间的相似性来识别主要内容。我们通过大量的实验表明，qRead在中文网页上的准确率为96.8%，平均提取时间为13.20毫秒，在英文网页上的准确率为93.6%，平均提取时间为11.37毫秒，比以前的方法有了很大的提高，满足了实时提取的要求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

qRead: A fast and accurate article extraction method from web pages using partition features optimizations

We present a new method called qRead to achieve real-time content extractions from web pages with high accuracy. Early approaches to content extractions include empirical filtering rules, Document Object Model (DOM) trees, and machine learning models. These methods, while having met with certain success, may not meet the requirements of real-time extraction with high accuracy. For example, constructing a DOM-tree on a complex web page is time-consuming, and using machine learning models could make things unnecessarily more complicated. Different from previous approaches, qRead uses segment densities and similarities to identify main contents. In particular, qRead first filters obvious junk contents, eliminates HTML tags, and partitions the remaining text into natural segments. It then uses the highest ratio of words over the number of lines in a segment combined with similarity between the segment and the title to identify main contents. We show that, through extensive experiments, qRead achieves a 96.8% accuracy on Chinese web pages with an average extraction time of 13.20 milliseconds, and a 93.6% accuracy on English web pages with an average extraction time of 11.37 milliseconds, providing substantial improvements on accuracy over previous approaches and meeting the real-time extraction requirement.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)

自引率

0.00%

发文量