Manu Agrawal, Kartik Manchanda, Ribhav Soni, A. Lal, C. R. Chowdary
{"title":"基于前缀过滤的非结构化文本局部相似度搜索并行实现","authors":"Manu Agrawal, Kartik Manchanda, Ribhav Soni, A. Lal, C. R. Chowdary","doi":"10.1109/PDCAT.2017.00025","DOIUrl":null,"url":null,"abstract":"Identifying partially duplicated text segments among documents is an important research problem with applications in plagiarism detection and near-duplicate web page detection. We investigate the problem of local similarity search for finding partially replicated text, focusing on its parallel implementation. Our aim is to find text windows that are approximately similar in two documents, using a filter verification framework. We present various parallel approaches to the problem, of which input data partitioning along with the reduction of individual index maps was found to be most suitable. We analyzed the effect of varying similarity threshold and number of processes on speedup, and also performed cost analysis. Experimental results show that the proposed method achieves up to 13x speedup on a 24-core processor.","PeriodicalId":119197,"journal":{"name":"2017 18th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Parallel Implementation of Local Similarity Search for Unstructured Text Using Prefix Filtering\",\"authors\":\"Manu Agrawal, Kartik Manchanda, Ribhav Soni, A. Lal, C. R. Chowdary\",\"doi\":\"10.1109/PDCAT.2017.00025\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Identifying partially duplicated text segments among documents is an important research problem with applications in plagiarism detection and near-duplicate web page detection. We investigate the problem of local similarity search for finding partially replicated text, focusing on its parallel implementation. Our aim is to find text windows that are approximately similar in two documents, using a filter verification framework. We present various parallel approaches to the problem, of which input data partitioning along with the reduction of individual index maps was found to be most suitable. We analyzed the effect of varying similarity threshold and number of processes on speedup, and also performed cost analysis. Experimental results show that the proposed method achieves up to 13x speedup on a 24-core processor.\",\"PeriodicalId\":119197,\"journal\":{\"name\":\"2017 18th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 18th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDCAT.2017.00025\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 18th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDCAT.2017.00025","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Parallel Implementation of Local Similarity Search for Unstructured Text Using Prefix Filtering
Identifying partially duplicated text segments among documents is an important research problem with applications in plagiarism detection and near-duplicate web page detection. We investigate the problem of local similarity search for finding partially replicated text, focusing on its parallel implementation. Our aim is to find text windows that are approximately similar in two documents, using a filter verification framework. We present various parallel approaches to the problem, of which input data partitioning along with the reduction of individual index maps was found to be most suitable. We analyzed the effect of varying similarity threshold and number of processes on speedup, and also performed cost analysis. Experimental results show that the proposed method achieves up to 13x speedup on a 24-core processor.