Extraction of Relevant Snippets from Web Pages Using Hybrid Features

Jun Zeng, Qingyu Xiong, Junhao Wen, S. Hirokawa
{"title":"Extraction of Relevant Snippets from Web Pages Using Hybrid Features","authors":"Jun Zeng, Qingyu Xiong, Junhao Wen, S. Hirokawa","doi":"10.1109/IIAI-AAI.2012.50","DOIUrl":null,"url":null,"abstract":"As the amount of web pages increase, identifying and retrieving distinct contents from the web has increasingly become more and more difficult. The traditional approach for extracting data from web page documents is to analyze the DOM (Document Object Model) structure of a HTML page and find a common pattern. However, the number of possible DOM layout patterns is virtually infinite, which means that there is no common pattern that can be used for all kinds of web pages. In this paper, we focus on the pages that are linked to a search engine and aim to analyze the features of relevant and meaningful contents instead of a common pattern. Three features of relevant snippets are introduced. They are: quantity of text, correlation between snippet and query that is inputted into a search engine, and HTML structure. Nine parameters are used to describe the three features. Also, a SVM learning experiment is conducted to verify the effectiveness of the three features. The results show that the HTML structure feature is the most effective feature which can determine whether a snippet is relevant or not.","PeriodicalId":103053,"journal":{"name":"2012 IIAI International Conference on Advanced Applied Informatics","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IIAI International Conference on Advanced Applied Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IIAI-AAI.2012.50","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

As the amount of web pages increase, identifying and retrieving distinct contents from the web has increasingly become more and more difficult. The traditional approach for extracting data from web page documents is to analyze the DOM (Document Object Model) structure of a HTML page and find a common pattern. However, the number of possible DOM layout patterns is virtually infinite, which means that there is no common pattern that can be used for all kinds of web pages. In this paper, we focus on the pages that are linked to a search engine and aim to analyze the features of relevant and meaningful contents instead of a common pattern. Three features of relevant snippets are introduced. They are: quantity of text, correlation between snippet and query that is inputted into a search engine, and HTML structure. Nine parameters are used to describe the three features. Also, a SVM learning experiment is conducted to verify the effectiveness of the three features. The results show that the HTML structure feature is the most effective feature which can determine whether a snippet is relevant or not.
使用混合特征从网页中提取相关片段
随着网页数量的增加,从网页中识别和检索不同的内容变得越来越困难。从网页文档中提取数据的传统方法是分析HTML页面的DOM (Document Object Model,文档对象模型)结构,并找到一个通用的模式。然而,可能的DOM布局模式的数量实际上是无限的,这意味着没有一种通用的模式可以用于所有类型的网页。在本文中,我们将重点放在链接到搜索引擎的页面上,旨在分析相关和有意义的内容的特征,而不是常见的模式。介绍了相关片段的三个特点。它们是:文本的数量,输入到搜索引擎的片段和查询之间的相关性,以及HTML结构。九个参数用来描述这三个特征。并通过SVM学习实验验证了这三个特征的有效性。结果表明,HTML结构特征是判断片段是否相关的最有效特征。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信