{"title":"在可扩展的No-SQL数据库平台上检测文本相似度","authors":"S. Butakov, S. Murzintsev, A. Tskhai","doi":"10.1109/PLATCON.2016.7456789","DOIUrl":null,"url":null,"abstract":"The paper looks at the platform scalability problem for near-to-similar document detection tasks. The application areas for the proposed approach include plagiarism detection and text filtering in data leak prevention systems. The paper reviews limitations of the current solutions based on the relational DBMS and suggests data structure suitable for implementation in no-SQL databases on the highly scalable clustered platforms. The proposed data structure is based on \"key-value\" model and it does not depend on the shingling method used to encode the text. The proposed model was implemented on the clustered MongoDB platform and tested with the large dataset on the platform that was scaled up horizontally during the experiment. The experiments indicated the applicability of the proposed approach to near-to-similar document detection.","PeriodicalId":247342,"journal":{"name":"2016 International Conference on Platform Technology and Service (PlatCon)","volume":"138 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Detecting Text Similarity on a Scalable No-SQL Database Platform\",\"authors\":\"S. Butakov, S. Murzintsev, A. Tskhai\",\"doi\":\"10.1109/PLATCON.2016.7456789\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The paper looks at the platform scalability problem for near-to-similar document detection tasks. The application areas for the proposed approach include plagiarism detection and text filtering in data leak prevention systems. The paper reviews limitations of the current solutions based on the relational DBMS and suggests data structure suitable for implementation in no-SQL databases on the highly scalable clustered platforms. The proposed data structure is based on \\\"key-value\\\" model and it does not depend on the shingling method used to encode the text. The proposed model was implemented on the clustered MongoDB platform and tested with the large dataset on the platform that was scaled up horizontally during the experiment. The experiments indicated the applicability of the proposed approach to near-to-similar document detection.\",\"PeriodicalId\":247342,\"journal\":{\"name\":\"2016 International Conference on Platform Technology and Service (PlatCon)\",\"volume\":\"138 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 International Conference on Platform Technology and Service (PlatCon)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PLATCON.2016.7456789\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Platform Technology and Service (PlatCon)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PLATCON.2016.7456789","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Detecting Text Similarity on a Scalable No-SQL Database Platform
The paper looks at the platform scalability problem for near-to-similar document detection tasks. The application areas for the proposed approach include plagiarism detection and text filtering in data leak prevention systems. The paper reviews limitations of the current solutions based on the relational DBMS and suggests data structure suitable for implementation in no-SQL databases on the highly scalable clustered platforms. The proposed data structure is based on "key-value" model and it does not depend on the shingling method used to encode the text. The proposed model was implemented on the clustered MongoDB platform and tested with the large dataset on the platform that was scaled up horizontally during the experiment. The experiments indicated the applicability of the proposed approach to near-to-similar document detection.