Automatic labeling of hidden web data using Multi-Heuristics Annotator

Umamageswari Baskaran, Kalpana Ramanujam
{"title":"Automatic labeling of hidden web data using Multi-Heuristics Annotator","authors":"Umamageswari Baskaran,&nbsp;Kalpana Ramanujam","doi":"10.1016/j.fcij.2018.11.004","DOIUrl":null,"url":null,"abstract":"<div><p>Hidden web contains huge amount of high quality data which are not indexed to search engines. Hidden web refers to web pages which are generated dynamically by embedding backend data matching the search keywords, in server-side templates. They are created for human consumption and makes automated processing cumbersome since structured data is embedded within unstructured HTML tags. In order to enable machine processing, structured data must be detected, extracted and annotated. Many heuristic based approaches DeLa [1], MSAA [2] are available in the literature to perform automatic annotation. Most of these techniques fail if data values didn't contain labels present as part of the attribute value itself or if it is not available explicitly as part of the form interface or query response pages. The proposed technique addresses this issue by collecting domain keywords from multiple websites belonging to the business domain of interest and then, it captures the pattern in the form of semantic rules. Experimental results show that single heuristics is not sufficient to label all the data value groups. The annotators are applied one after the other according to their capability of assigning the most appropriate label. Experiments show that this technique has improved the precision and recall values compared to the existing annotation techniques.</p></div>","PeriodicalId":100561,"journal":{"name":"Future Computing and Informatics Journal","volume":"3 2","pages":"Pages 417-423"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.fcij.2018.11.004","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Computing and Informatics Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2314728818300394","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Hidden web contains huge amount of high quality data which are not indexed to search engines. Hidden web refers to web pages which are generated dynamically by embedding backend data matching the search keywords, in server-side templates. They are created for human consumption and makes automated processing cumbersome since structured data is embedded within unstructured HTML tags. In order to enable machine processing, structured data must be detected, extracted and annotated. Many heuristic based approaches DeLa [1], MSAA [2] are available in the literature to perform automatic annotation. Most of these techniques fail if data values didn't contain labels present as part of the attribute value itself or if it is not available explicitly as part of the form interface or query response pages. The proposed technique addresses this issue by collecting domain keywords from multiple websites belonging to the business domain of interest and then, it captures the pattern in the form of semantic rules. Experimental results show that single heuristics is not sufficient to label all the data value groups. The annotators are applied one after the other according to their capability of assigning the most appropriate label. Experiments show that this technique has improved the precision and recall values compared to the existing annotation techniques.

使用多启发式注释器自动标记隐藏的web数据
隐藏网络包含大量高质量的数据,这些数据没有被搜索引擎索引。隐藏网页是指通过在服务器端模板中嵌入匹配搜索关键字的后端数据来动态生成的网页。它们是为人类使用而创建的,由于结构化数据嵌入在非结构化HTML标记中,因此使自动化处理变得麻烦。为了实现机器处理,必须检测、提取和注释结构化数据。文献中有许多基于启发式的方法DeLa[1]、MSAA[2]来执行自动标注。如果数据值不包含作为属性值本身的一部分呈现的标签,或者作为表单接口或查询响应页面的一部分不显式地可用,那么大多数这些技术都会失败。提出的技术通过从属于感兴趣的业务领域的多个网站收集领域关键字,然后以语义规则的形式捕获模式来解决这个问题。实验结果表明,单一的启发式方法不足以标记所有的数据值组。根据它们分配最合适标签的能力,一个接一个地应用注释器。实验结果表明,与现有标注技术相比,该方法提高了标注精度和查全率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信