heteroHarvest: Harvesting information from heterogeneous sources

Proceedings of 2011 IEEE International Conference on Intelligence and Security Informatics Pub Date : 2011-07-10 DOI:10.1109/ISI.2011.5984780

Abdul Rasool Qureshi, N. Memon, U. Wiil, P. Karampelas, Jose Ignacio Nieto Sancheze

引用次数: 0

Abstract

The abundance of information regarding any topic makes the Internet a very good resource. Even though searching the Internet is very easy, what remains difficult is to automate the process of information extraction from the available online information due to the lack of structure and the diversity in the sharing methods. Most of the times, information is stored in different proprietary formats, complying with different standards and protocols which makes tasks like data mining and information harvesting very difficult. In this paper, an information harvesting tool (heteroHarvest) is presented with objectives to address these problems by filtering the useful information and then normalizing the information in a singular non hypertext format. Finally we describe the results of experimental evaluation. The results are found promising with an overall error rate equal to 6.5% across heterogeneous formats.

查看原文本刊更多论文

heteroHarvest:从异构源获取信息

关于任何主题的丰富信息使互联网成为一个非常好的资源。尽管互联网搜索很容易，但由于网络信息的缺乏结构化和共享方式的多样性，使信息提取过程自动化仍然是一个困难的问题。大多数时候，信息以不同的专有格式存储，遵循不同的标准和协议，这使得数据挖掘和信息收集等任务非常困难。本文提出了一种信息收集工具(heteroHarvest)，通过过滤有用的信息，然后以单一的非超文本格式对信息进行规范化来解决这些问题。最后给出了实验评价结果。结果显示，跨异构格式的总体错误率为6.5%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of 2011 IEEE International Conference on Intelligence and Security Informatics

自引率

0.00%

发文量