{"title":"在阿拉伯语文档中使用n-gram采样进行字符串匹配的高效倒排索引","authors":"El Moatez Billah Nagoudi, A. Khorsi, H. Cherroun","doi":"10.1109/AICCSA.2016.7945743","DOIUrl":null,"url":null,"abstract":"Text search is the basis of countless applications and techniques. It is constrained by space and time resource limitations inherent in different contexts and scenarios. A common approach to minimize the cost of the general search task is to start by the characteristics which are particular to the targeted entity. In this paper, we propose an approximative index-based text searching algorithm that performances can be customized respect to both time/memory user constraints. The main idea is to exploit the uneven distribution of frequencies of letters and n-grams in natural language text, to reduce the index size and the search time, where we store only the less frequent letters and n-grams. Moreover, our technique can also provide to the user the flexibility to choose the tradeoff between index size and query performance. Search time and the index size can be balanced by varying three parameters in our approach. This makes our approach flexible to different settings. The tests described in this paper are driven on an Arabic collection of more than 450 documents and more than 20 million words. Experimental results show that the size of our n-gram inverted index is reduced by up to 40%–85% (with tolerable performance penalties) compared with those of the full n-gram inverted index. Generalization to other languages should be straightforward as long as the underlying statistical property applies.","PeriodicalId":448329,"journal":{"name":"2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Efficient inverted index with n-gram sampling for string matching in Arabic documents\",\"authors\":\"El Moatez Billah Nagoudi, A. Khorsi, H. Cherroun\",\"doi\":\"10.1109/AICCSA.2016.7945743\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text search is the basis of countless applications and techniques. It is constrained by space and time resource limitations inherent in different contexts and scenarios. A common approach to minimize the cost of the general search task is to start by the characteristics which are particular to the targeted entity. In this paper, we propose an approximative index-based text searching algorithm that performances can be customized respect to both time/memory user constraints. The main idea is to exploit the uneven distribution of frequencies of letters and n-grams in natural language text, to reduce the index size and the search time, where we store only the less frequent letters and n-grams. Moreover, our technique can also provide to the user the flexibility to choose the tradeoff between index size and query performance. Search time and the index size can be balanced by varying three parameters in our approach. This makes our approach flexible to different settings. The tests described in this paper are driven on an Arabic collection of more than 450 documents and more than 20 million words. Experimental results show that the size of our n-gram inverted index is reduced by up to 40%–85% (with tolerable performance penalties) compared with those of the full n-gram inverted index. Generalization to other languages should be straightforward as long as the underlying statistical property applies.\",\"PeriodicalId\":448329,\"journal\":{\"name\":\"2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AICCSA.2016.7945743\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICCSA.2016.7945743","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Efficient inverted index with n-gram sampling for string matching in Arabic documents
Text search is the basis of countless applications and techniques. It is constrained by space and time resource limitations inherent in different contexts and scenarios. A common approach to minimize the cost of the general search task is to start by the characteristics which are particular to the targeted entity. In this paper, we propose an approximative index-based text searching algorithm that performances can be customized respect to both time/memory user constraints. The main idea is to exploit the uneven distribution of frequencies of letters and n-grams in natural language text, to reduce the index size and the search time, where we store only the less frequent letters and n-grams. Moreover, our technique can also provide to the user the flexibility to choose the tradeoff between index size and query performance. Search time and the index size can be balanced by varying three parameters in our approach. This makes our approach flexible to different settings. The tests described in this paper are driven on an Arabic collection of more than 450 documents and more than 20 million words. Experimental results show that the size of our n-gram inverted index is reduced by up to 40%–85% (with tolerable performance penalties) compared with those of the full n-gram inverted index. Generalization to other languages should be straightforward as long as the underlying statistical property applies.