{"title":"A Near-Duplicate Detection Algorithm to Facilitate Document Clustering","authors":"Lavanya Pamulaparty, D. C. V. G. Rao, D. S. Rao","doi":"10.5121/IJDKP.2014.4604","DOIUrl":null,"url":null,"abstract":"Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near Duplicates is very difficult in large collection of data like ”internet”. The presence of these web pages plays an important role in the performance degradation while integrating data from heterogeneous sources. These pages either increase the index storage space or increase the serving costs. Detecting these pages has many potential applications for example may indicate plagiarism or copyright infringement. This paper concerns detecting, and optionally removing duplicate and near duplicate documents which are used to perform clustering of documents .We demonstrated our approach in web news articles domain. The experimental results show that our algorithm outperforms in terms of similarity measures. The near duplicate and duplicate document identification has resulted reduced memory in repositories.","PeriodicalId":131153,"journal":{"name":"International Journal of Data Mining & Knowledge Management Process","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Data Mining & Knowledge Management Process","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5121/IJDKP.2014.4604","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near Duplicates is very difficult in large collection of data like ”internet”. The presence of these web pages plays an important role in the performance degradation while integrating data from heterogeneous sources. These pages either increase the index storage space or increase the serving costs. Detecting these pages has many potential applications for example may indicate plagiarism or copyright infringement. This paper concerns detecting, and optionally removing duplicate and near duplicate documents which are used to perform clustering of documents .We demonstrated our approach in web news articles domain. The experimental results show that our algorithm outperforms in terms of similarity measures. The near duplicate and duplicate document identification has resulted reduced memory in repositories.