在阿拉伯语文档中使用n-gram采样进行字符串匹配的高效倒排索引

2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA) Pub Date : 2016-11-01 DOI:10.1109/AICCSA.2016.7945743

El Moatez Billah Nagoudi, A. Khorsi, H. Cherroun

{"title":"在阿拉伯语文档中使用n-gram采样进行字符串匹配的高效倒排索引","authors":"El Moatez Billah Nagoudi, A. Khorsi, H. Cherroun","doi":"10.1109/AICCSA.2016.7945743","DOIUrl":null,"url":null,"abstract":"Text search is the basis of countless applications and techniques. It is constrained by space and time resource limitations inherent in different contexts and scenarios. A common approach to minimize the cost of the general search task is to start by the characteristics which are particular to the targeted entity. In this paper, we propose an approximative index-based text searching algorithm that performances can be customized respect to both time/memory user constraints. The main idea is to exploit the uneven distribution of frequencies of letters and n-grams in natural language text, to reduce the index size and the search time, where we store only the less frequent letters and n-grams. Moreover, our technique can also provide to the user the flexibility to choose the tradeoff between index size and query performance. Search time and the index size can be balanced by varying three parameters in our approach. This makes our approach flexible to different settings. The tests described in this paper are driven on an Arabic collection of more than 450 documents and more than 20 million words. Experimental results show that the size of our n-gram inverted index is reduced by up to 40%–85% (with tolerable performance penalties) compared with those of the full n-gram inverted index. Generalization to other languages should be straightforward as long as the underlying statistical property applies.","PeriodicalId":448329,"journal":{"name":"2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Efficient inverted index with n-gram sampling for string matching in Arabic documents\",\"authors\":\"El Moatez Billah Nagoudi, A. Khorsi, H. Cherroun\",\"doi\":\"10.1109/AICCSA.2016.7945743\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text search is the basis of countless applications and techniques. It is constrained by space and time resource limitations inherent in different contexts and scenarios. A common approach to minimize the cost of the general search task is to start by the characteristics which are particular to the targeted entity. In this paper, we propose an approximative index-based text searching algorithm that performances can be customized respect to both time/memory user constraints. The main idea is to exploit the uneven distribution of frequencies of letters and n-grams in natural language text, to reduce the index size and the search time, where we store only the less frequent letters and n-grams. Moreover, our technique can also provide to the user the flexibility to choose the tradeoff between index size and query performance. Search time and the index size can be balanced by varying three parameters in our approach. This makes our approach flexible to different settings. The tests described in this paper are driven on an Arabic collection of more than 450 documents and more than 20 million words. Experimental results show that the size of our n-gram inverted index is reduced by up to 40%–85% (with tolerable performance penalties) compared with those of the full n-gram inverted index. Generalization to other languages should be straightforward as long as the underlying statistical property applies.\",\"PeriodicalId\":448329,\"journal\":{\"name\":\"2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AICCSA.2016.7945743\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICCSA.2016.7945743","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

文本搜索是无数应用程序和技术的基础。它受到不同背景和场景中固有的空间和时间资源限制的约束。最小化一般搜索任务成本的一种常用方法是从目标实体特有的特征开始。在本文中，我们提出了一种近似的基于索引的文本搜索算法，该算法的性能可以根据时间/内存用户约束进行定制。主要思想是利用自然语言文本中字母和n-gram频率的不均匀分布，减少索引大小和搜索时间，我们只存储频率较低的字母和n-gram。此外，我们的技术还可以为用户提供在索引大小和查询性能之间进行权衡的灵活性。在我们的方法中，可以通过改变三个参数来平衡搜索时间和索引大小。这使得我们的方法可以灵活地适应不同的设置。本文中描述的测试是在超过450个文档和超过2000万单词的阿拉伯语集合上驱动的。实验结果表明，与完整的n-gram倒排索引相比，我们的n-gram倒排索引的大小减少了40%-85%(性能损失可以容忍)。只要应用底层统计属性，推广到其他语言应该很简单。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient inverted index with n-gram sampling for string matching in Arabic documents

Text search is the basis of countless applications and techniques. It is constrained by space and time resource limitations inherent in different contexts and scenarios. A common approach to minimize the cost of the general search task is to start by the characteristics which are particular to the targeted entity. In this paper, we propose an approximative index-based text searching algorithm that performances can be customized respect to both time/memory user constraints. The main idea is to exploit the uneven distribution of frequencies of letters and n-grams in natural language text, to reduce the index size and the search time, where we store only the less frequent letters and n-grams. Moreover, our technique can also provide to the user the flexibility to choose the tradeoff between index size and query performance. Search time and the index size can be balanced by varying three parameters in our approach. This makes our approach flexible to different settings. The tests described in this paper are driven on an Arabic collection of more than 450 documents and more than 20 million words. Experimental results show that the size of our n-gram inverted index is reduced by up to 40%–85% (with tolerable performance penalties) compared with those of the full n-gram inverted index. Generalization to other languages should be straightforward as long as the underlying statistical property applies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)

自引率

0.00%

发文量