N. Bourbakis, W. Meng, Zonghuan Wu, J. Salerno, S. Borek
{"title":"Removal of redundancy in documents retrieved from different resources","authors":"N. Bourbakis, W. Meng, Zonghuan Wu, J. Salerno, S. Borek","doi":"10.1109/TAI.1998.744799","DOIUrl":null,"url":null,"abstract":"This paper describes a methodology for removing (partially or totally) redundant information received from different documents in an effort to synthesize new documents. In particular, information retrieved from different databases may have various forms, such as images, natural language text, data, etc. These pieces of information may be parts of one or more documents related with a specific subject. This means that a number of text-paragraphs and images may occur (or retrieved) more than once, by creating redundancy in the storage space. Thus, in order to create a new redundant-less document the duplicated parts of information have to be removed. The methodology presented analyzes text-paragraphs and images received from different DBs by using a set of similarity criteria in order to make a decision for the removal of the duplicated ones. Illustrative examples are provided.","PeriodicalId":424568,"journal":{"name":"Proceedings Tenth IEEE International Conference on Tools with Artificial Intelligence (Cat. No.98CH36294)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings Tenth IEEE International Conference on Tools with Artificial Intelligence (Cat. No.98CH36294)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TAI.1998.744799","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
This paper describes a methodology for removing (partially or totally) redundant information received from different documents in an effort to synthesize new documents. In particular, information retrieved from different databases may have various forms, such as images, natural language text, data, etc. These pieces of information may be parts of one or more documents related with a specific subject. This means that a number of text-paragraphs and images may occur (or retrieved) more than once, by creating redundancy in the storage space. Thus, in order to create a new redundant-less document the duplicated parts of information have to be removed. The methodology presented analyzes text-paragraphs and images received from different DBs by using a set of similarity criteria in order to make a decision for the removal of the duplicated ones. Illustrative examples are provided.