heteroHarvest: Harvesting information from heterogeneous sources

Abdul Rasool Qureshi, N. Memon, U. Wiil, P. Karampelas, Jose Ignacio Nieto Sancheze
{"title":"heteroHarvest: Harvesting information from heterogeneous sources","authors":"Abdul Rasool Qureshi, N. Memon, U. Wiil, P. Karampelas, Jose Ignacio Nieto Sancheze","doi":"10.1109/ISI.2011.5984780","DOIUrl":null,"url":null,"abstract":"The abundance of information regarding any topic makes the Internet a very good resource. Even though searching the Internet is very easy, what remains difficult is to automate the process of information extraction from the available online information due to the lack of structure and the diversity in the sharing methods. Most of the times, information is stored in different proprietary formats, complying with different standards and protocols which makes tasks like data mining and information harvesting very difficult. In this paper, an information harvesting tool (heteroHarvest) is presented with objectives to address these problems by filtering the useful information and then normalizing the information in a singular non hypertext format. Finally we describe the results of experimental evaluation. The results are found promising with an overall error rate equal to 6.5% across heterogeneous formats.","PeriodicalId":220165,"journal":{"name":"Proceedings of 2011 IEEE International Conference on Intelligence and Security Informatics","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of 2011 IEEE International Conference on Intelligence and Security Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISI.2011.5984780","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The abundance of information regarding any topic makes the Internet a very good resource. Even though searching the Internet is very easy, what remains difficult is to automate the process of information extraction from the available online information due to the lack of structure and the diversity in the sharing methods. Most of the times, information is stored in different proprietary formats, complying with different standards and protocols which makes tasks like data mining and information harvesting very difficult. In this paper, an information harvesting tool (heteroHarvest) is presented with objectives to address these problems by filtering the useful information and then normalizing the information in a singular non hypertext format. Finally we describe the results of experimental evaluation. The results are found promising with an overall error rate equal to 6.5% across heterogeneous formats.
heteroHarvest:从异构源获取信息
关于任何主题的丰富信息使互联网成为一个非常好的资源。尽管互联网搜索很容易,但由于网络信息的缺乏结构化和共享方式的多样性,使信息提取过程自动化仍然是一个困难的问题。大多数时候,信息以不同的专有格式存储,遵循不同的标准和协议,这使得数据挖掘和信息收集等任务非常困难。本文提出了一种信息收集工具(heteroHarvest),通过过滤有用的信息,然后以单一的非超文本格式对信息进行规范化来解决这些问题。最后给出了实验评价结果。结果显示,跨异构格式的总体错误率为6.5%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信