{"title":"Web-service for finding cloned files using b-bit minwise hashing","authors":"Kaoru Ito, T. Ishio, Katsuro Inoue","doi":"10.1109/IWSC.2017.7880504","DOIUrl":null,"url":null,"abstract":"Source code reuse is a common practice in software development. Since industrial developers may accidentally reuse source files developed by open source software, clone detection tools are used to detect open source files in their closed source project. To execute a clone detection, developers need a database of existing open source software. While a web-service providing clone detection using a centralized database is likely useful, industrial developers are not allowed to submit their source code to a public server on the Internet. To solve the problem, we employ b-bit minwise hashing technique that enables to estimate similarity of documents using only hash values of the documents. Using the method, we implemented a file-clone detection web service; it takes as input a hash value of a source file and returns a list of similar source files in existing open source software. Our hash comparison method is efficient, although an estimated similarity may have a margin of error.","PeriodicalId":222231,"journal":{"name":"2017 IEEE 11th International Workshop on Software Clones (IWSC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 11th International Workshop on Software Clones (IWSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IWSC.2017.7880504","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Source code reuse is a common practice in software development. Since industrial developers may accidentally reuse source files developed by open source software, clone detection tools are used to detect open source files in their closed source project. To execute a clone detection, developers need a database of existing open source software. While a web-service providing clone detection using a centralized database is likely useful, industrial developers are not allowed to submit their source code to a public server on the Internet. To solve the problem, we employ b-bit minwise hashing technique that enables to estimate similarity of documents using only hash values of the documents. Using the method, we implemented a file-clone detection web service; it takes as input a hash value of a source file and returns a list of similar source files in existing open source software. Our hash comparison method is efficient, although an estimated similarity may have a margin of error.