Extraction of Relevant Snippets from Web Pages Using Hybrid Features

2012 IIAI International Conference on Advanced Applied Informatics Pub Date : 2012-09-20 DOI:10.1109/IIAI-AAI.2012.50

Jun Zeng, Qingyu Xiong, Junhao Wen, S. Hirokawa

{"title":"Extraction of Relevant Snippets from Web Pages Using Hybrid Features","authors":"Jun Zeng, Qingyu Xiong, Junhao Wen, S. Hirokawa","doi":"10.1109/IIAI-AAI.2012.50","DOIUrl":null,"url":null,"abstract":"As the amount of web pages increase, identifying and retrieving distinct contents from the web has increasingly become more and more difficult. The traditional approach for extracting data from web page documents is to analyze the DOM (Document Object Model) structure of a HTML page and find a common pattern. However, the number of possible DOM layout patterns is virtually infinite, which means that there is no common pattern that can be used for all kinds of web pages. In this paper, we focus on the pages that are linked to a search engine and aim to analyze the features of relevant and meaningful contents instead of a common pattern. Three features of relevant snippets are introduced. They are: quantity of text, correlation between snippet and query that is inputted into a search engine, and HTML structure. Nine parameters are used to describe the three features. Also, a SVM learning experiment is conducted to verify the effectiveness of the three features. The results show that the HTML structure feature is the most effective feature which can determine whether a snippet is relevant or not.","PeriodicalId":103053,"journal":{"name":"2012 IIAI International Conference on Advanced Applied Informatics","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IIAI International Conference on Advanced Applied Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IIAI-AAI.2012.50","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

As the amount of web pages increase, identifying and retrieving distinct contents from the web has increasingly become more and more difficult. The traditional approach for extracting data from web page documents is to analyze the DOM (Document Object Model) structure of a HTML page and find a common pattern. However, the number of possible DOM layout patterns is virtually infinite, which means that there is no common pattern that can be used for all kinds of web pages. In this paper, we focus on the pages that are linked to a search engine and aim to analyze the features of relevant and meaningful contents instead of a common pattern. Three features of relevant snippets are introduced. They are: quantity of text, correlation between snippet and query that is inputted into a search engine, and HTML structure. Nine parameters are used to describe the three features. Also, a SVM learning experiment is conducted to verify the effectiveness of the three features. The results show that the HTML structure feature is the most effective feature which can determine whether a snippet is relevant or not.

查看原文本刊更多论文

使用混合特征从网页中提取相关片段

随着网页数量的增加，从网页中识别和检索不同的内容变得越来越困难。从网页文档中提取数据的传统方法是分析HTML页面的DOM (Document Object Model，文档对象模型)结构，并找到一个通用的模式。然而，可能的DOM布局模式的数量实际上是无限的，这意味着没有一种通用的模式可以用于所有类型的网页。在本文中，我们将重点放在链接到搜索引擎的页面上，旨在分析相关和有意义的内容的特征，而不是常见的模式。介绍了相关片段的三个特点。它们是:文本的数量，输入到搜索引擎的片段和查询之间的相关性，以及HTML结构。九个参数用来描述这三个特征。并通过SVM学习实验验证了这三个特征的有效性。结果表明，HTML结构特征是判断片段是否相关的最有效特征。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IIAI International Conference on Advanced Applied Informatics

自引率

0.00%

发文量