{"title":"Increase the Speed of Search Index in the Duplicate Text Detection Systems","authors":"E. Sharapova","doi":"10.1109/SYNCHROINFO49631.2020.9166107","DOIUrl":null,"url":null,"abstract":"The work is devoted to the organization of the search index of the duplicate text detection systems Author.NET. The paper considers the structure of the search index. The search index can be divided into several groups of files - terms, documents, TF*IDF index, signatures, shingles. In the article the ways of organizing the search index are considered - index storage in RAM, compression of index file, tree index, index table of contents. It is given a proposal for organizing a search index for duplicate text detection system. During the experiments, it was found that the most profitable option is to use compressed index files stored on the SSD.","PeriodicalId":255578,"journal":{"name":"2020 Systems of Signal Synchronization, Generating and Processing in Telecommunications (SYNCHROINFO)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Systems of Signal Synchronization, Generating and Processing in Telecommunications (SYNCHROINFO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNCHROINFO49631.2020.9166107","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The work is devoted to the organization of the search index of the duplicate text detection systems Author.NET. The paper considers the structure of the search index. The search index can be divided into several groups of files - terms, documents, TF*IDF index, signatures, shingles. In the article the ways of organizing the search index are considered - index storage in RAM, compression of index file, tree index, index table of contents. It is given a proposal for organizing a search index for duplicate text detection system. During the experiments, it was found that the most profitable option is to use compressed index files stored on the SSD.