{"title":"Detecting Text Similarity on a Scalable No-SQL Database Platform","authors":"S. Butakov, S. Murzintsev, A. Tskhai","doi":"10.1109/PLATCON.2016.7456789","DOIUrl":null,"url":null,"abstract":"The paper looks at the platform scalability problem for near-to-similar document detection tasks. The application areas for the proposed approach include plagiarism detection and text filtering in data leak prevention systems. The paper reviews limitations of the current solutions based on the relational DBMS and suggests data structure suitable for implementation in no-SQL databases on the highly scalable clustered platforms. The proposed data structure is based on \"key-value\" model and it does not depend on the shingling method used to encode the text. The proposed model was implemented on the clustered MongoDB platform and tested with the large dataset on the platform that was scaled up horizontally during the experiment. The experiments indicated the applicability of the proposed approach to near-to-similar document detection.","PeriodicalId":247342,"journal":{"name":"2016 International Conference on Platform Technology and Service (PlatCon)","volume":"138 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Platform Technology and Service (PlatCon)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PLATCON.2016.7456789","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
The paper looks at the platform scalability problem for near-to-similar document detection tasks. The application areas for the proposed approach include plagiarism detection and text filtering in data leak prevention systems. The paper reviews limitations of the current solutions based on the relational DBMS and suggests data structure suitable for implementation in no-SQL databases on the highly scalable clustered platforms. The proposed data structure is based on "key-value" model and it does not depend on the shingling method used to encode the text. The proposed model was implemented on the clustered MongoDB platform and tested with the large dataset on the platform that was scaled up horizontally during the experiment. The experiments indicated the applicability of the proposed approach to near-to-similar document detection.