Content extraction based on statistic and position relationship between title and content

2014 IEEE/CIC International Conference on Communications in China (ICCC) Pub Date : 2014-10-01 DOI:10.1109/ICCCHINA.2014.7008295

Mingdong Li, P. Xu, Chencheng Yang

{"title":"Content extraction based on statistic and position relationship between title and content","authors":"Mingdong Li, P. Xu, Chencheng Yang","doi":"10.1109/ICCCHINA.2014.7008295","DOIUrl":null,"url":null,"abstract":"Web page content extraction is a fundamental step in the application of data mining which supplies pure data source with little noise. The original web page with fully embedded with contentirrelevant information such as JavaScript and advertisements is mixed with noise. The purity of the data makes a difference in application. Consequently, a web information extraction model based on statistical and positional relationship between the title and content is proposed in this paper. The exact localization of title will promote the precision of content extraction and inversely the accurate extracted content will have a positive feedback to ensure the right title is extracted. First and foremost, each text node is compared to the content selected from the tag of title to get the score of similarity. We can get the final score of each separate node by summing up its node attribute score. The node with the highest score will be regarded as the accurate title at present. According to the position of title, we narrow the scope of main content which is distributed after the title. With the help of statistical information of the web page we then traverse the DOM tree to obtain the content contained in the node with maximal weight. Experimental results prove that the algorithm is much better than that of previous extraction rules and applicable to extract main content from web pages.","PeriodicalId":353402,"journal":{"name":"2014 IEEE/CIC International Conference on Communications in China (ICCC)","volume":"123 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE/CIC International Conference on Communications in China (ICCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCHINA.2014.7008295","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Web page content extraction is a fundamental step in the application of data mining which supplies pure data source with little noise. The original web page with fully embedded with contentirrelevant information such as JavaScript and advertisements is mixed with noise. The purity of the data makes a difference in application. Consequently, a web information extraction model based on statistical and positional relationship between the title and content is proposed in this paper. The exact localization of title will promote the precision of content extraction and inversely the accurate extracted content will have a positive feedback to ensure the right title is extracted. First and foremost, each text node is compared to the content selected from the tag of title to get the score of similarity. We can get the final score of each separate node by summing up its node attribute score. The node with the highest score will be regarded as the accurate title at present. According to the position of title, we narrow the scope of main content which is distributed after the title. With the help of statistical information of the web page we then traverse the DOM tree to obtain the content contained in the node with maximal weight. Experimental results prove that the algorithm is much better than that of previous extraction rules and applicable to extract main content from web pages.

查看原文本刊更多论文

基于标题和内容之间的统计关系和位置关系的内容提取

网页内容提取是数据挖掘应用的基础步骤，它提供了纯净的、噪声小的数据源。原来的网页完全嵌入了JavaScript和广告等有争议的相关信息，却夹杂着噪音。数据的纯度在应用中起着重要作用。因此，本文提出了一种基于标题和内容之间的统计关系和位置关系的web信息提取模型。标题的准确定位会提高内容提取的精度，反之，准确提取的内容会产生正反馈，确保提取出正确的标题。首先，将每个文本节点与从title标签中选择的内容进行比较，得到相似度分数。将每个单独节点的属性得分相加，就可以得到每个单独节点的最终得分。得分最高的节点将被视为当前准确的标题。根据标题的位置，缩小标题后分布的主要内容范围。然后借助网页的统计信息遍历DOM树，得到权重最大的节点所包含的内容。实验结果表明，该算法比以往的提取规则要好得多，适用于从网页中提取主要内容。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE/CIC International Conference on Communications in China (ICCC)

自引率

0.00%

发文量