Abdul Rasool Qureshi, N. Memon, U. Wiil, P. Karampelas, Jose Ignacio Nieto Sancheze
{"title":"heteroHarvest: Harvesting information from heterogeneous sources","authors":"Abdul Rasool Qureshi, N. Memon, U. Wiil, P. Karampelas, Jose Ignacio Nieto Sancheze","doi":"10.1109/ISI.2011.5984780","DOIUrl":null,"url":null,"abstract":"The abundance of information regarding any topic makes the Internet a very good resource. Even though searching the Internet is very easy, what remains difficult is to automate the process of information extraction from the available online information due to the lack of structure and the diversity in the sharing methods. Most of the times, information is stored in different proprietary formats, complying with different standards and protocols which makes tasks like data mining and information harvesting very difficult. In this paper, an information harvesting tool (heteroHarvest) is presented with objectives to address these problems by filtering the useful information and then normalizing the information in a singular non hypertext format. Finally we describe the results of experimental evaluation. The results are found promising with an overall error rate equal to 6.5% across heterogeneous formats.","PeriodicalId":220165,"journal":{"name":"Proceedings of 2011 IEEE International Conference on Intelligence and Security Informatics","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of 2011 IEEE International Conference on Intelligence and Security Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISI.2011.5984780","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The abundance of information regarding any topic makes the Internet a very good resource. Even though searching the Internet is very easy, what remains difficult is to automate the process of information extraction from the available online information due to the lack of structure and the diversity in the sharing methods. Most of the times, information is stored in different proprietary formats, complying with different standards and protocols which makes tasks like data mining and information harvesting very difficult. In this paper, an information harvesting tool (heteroHarvest) is presented with objectives to address these problems by filtering the useful information and then normalizing the information in a singular non hypertext format. Finally we describe the results of experimental evaluation. The results are found promising with an overall error rate equal to 6.5% across heterogeneous formats.