{"title":"A Static Dictionary-Based Approach To Compressing Short Texts","authors":"Murat Aslanyürek, A. Mesut","doi":"10.1109/UBMK52708.2021.9559035","DOIUrl":null,"url":null,"abstract":"In this study, Static Dictionary Compression (SDC) method, which is an approach developed to compress short texts, is proposed. The word-based static dictionaries used in this approach were obtained from clusters formed as a result of running a clustering method repeatedly until certain criteria are met. Short text is compressed with the dictionary that has the largest number of words in common with it. It has been shown by tests conducted with datasets containing short texts in 6 different languages that the proposed method compresses better than the general purpose compression methods Gzip, Bzip2, Zstd and PPMd. In the tests made with the data set containing only English short texts, it has been shown that the SDC method can compress better than the smza, shoco and b64pack methods used to compress short texts, and Brotli, which gives good results in short texts because it uses a static dictionary.","PeriodicalId":106516,"journal":{"name":"2021 6th International Conference on Computer Science and Engineering (UBMK)","volume":"148 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 6th International Conference on Computer Science and Engineering (UBMK)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/UBMK52708.2021.9559035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this study, Static Dictionary Compression (SDC) method, which is an approach developed to compress short texts, is proposed. The word-based static dictionaries used in this approach were obtained from clusters formed as a result of running a clustering method repeatedly until certain criteria are met. Short text is compressed with the dictionary that has the largest number of words in common with it. It has been shown by tests conducted with datasets containing short texts in 6 different languages that the proposed method compresses better than the general purpose compression methods Gzip, Bzip2, Zstd and PPMd. In the tests made with the data set containing only English short texts, it has been shown that the SDC method can compress better than the smza, shoco and b64pack methods used to compress short texts, and Brotli, which gives good results in short texts because it uses a static dictionary.