用Zipf的统计方法发现乌尔都语停顿词

2019 13th International Conference on Mathematics, Actuarial Science, Computer Science and Statistics (MACS) Pub Date : 2019-12-01 DOI:10.1109/MACS48846.2019.9024817

Nuzhat Khan, Muhammad Paend Bakht, Muhammad Junaid Khan, Abdul Samad, Gul Sahar

{"title":"用Zipf的统计方法发现乌尔都语停顿词","authors":"Nuzhat Khan, Muhammad Paend Bakht, Muhammad Junaid Khan, Abdul Samad, Gul Sahar","doi":"10.1109/MACS48846.2019.9024817","DOIUrl":null,"url":null,"abstract":"This paper presents innovative method to extract stop words from large Urdu text. Stop words are less meaningful words in natural language that slow down language processing and affect language analysis negatively. For language analysis, stop words are removed first to ensure fast data processing. But for Urdu language, there is no reliable stop words removal method. In this work, we applied Zipf's law of two factors dependency with least effort approach to spot stop words in Urdu language corpus. Urdu corpus is specifically created for this research. All Urdu text processing and investigation is carried out in Python 3. 4. Previous work for stop words removal is also investigated and proved less helpful. By using Zipfian approach, out of 500 high frequency words, 358 words are identified as stop words. It is observed that by only focusing on 0.01% of large corpus, almost all the stop words can be spotted to create a stop words list with least manual effort. Furthermore, statistical patterns in stop words, content words, stop words vs content words ratio in data samples and dependency of stop words and content words over data size is also examined. In terms of data size, frequency and ranks, Zipf's law and Heap's law coexist in Urdu stop words. Stop words tend to follow some predictable and measurable patterns that can lead to reliable probabilistic methods for Urdu processing. This deterministic approach provides a strong research ground to explore stop words in Urdu text statistically.","PeriodicalId":434612,"journal":{"name":"2019 13th International Conference on Mathematics, Actuarial Science, Computer Science and Statistics (MACS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Spotting Urdu Stop Words By Zipf's Statistical Approach\",\"authors\":\"Nuzhat Khan, Muhammad Paend Bakht, Muhammad Junaid Khan, Abdul Samad, Gul Sahar\",\"doi\":\"10.1109/MACS48846.2019.9024817\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents innovative method to extract stop words from large Urdu text. Stop words are less meaningful words in natural language that slow down language processing and affect language analysis negatively. For language analysis, stop words are removed first to ensure fast data processing. But for Urdu language, there is no reliable stop words removal method. In this work, we applied Zipf's law of two factors dependency with least effort approach to spot stop words in Urdu language corpus. Urdu corpus is specifically created for this research. All Urdu text processing and investigation is carried out in Python 3. 4. Previous work for stop words removal is also investigated and proved less helpful. By using Zipfian approach, out of 500 high frequency words, 358 words are identified as stop words. It is observed that by only focusing on 0.01% of large corpus, almost all the stop words can be spotted to create a stop words list with least manual effort. Furthermore, statistical patterns in stop words, content words, stop words vs content words ratio in data samples and dependency of stop words and content words over data size is also examined. In terms of data size, frequency and ranks, Zipf's law and Heap's law coexist in Urdu stop words. Stop words tend to follow some predictable and measurable patterns that can lead to reliable probabilistic methods for Urdu processing. This deterministic approach provides a strong research ground to explore stop words in Urdu text statistically.\",\"PeriodicalId\":434612,\"journal\":{\"name\":\"2019 13th International Conference on Mathematics, Actuarial Science, Computer Science and Statistics (MACS)\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 13th International Conference on Mathematics, Actuarial Science, Computer Science and Statistics (MACS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MACS48846.2019.9024817\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 13th International Conference on Mathematics, Actuarial Science, Computer Science and Statistics (MACS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MACS48846.2019.9024817","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

提出了一种新颖的乌尔都语大文本停止词提取方法。停顿词是自然语言中意义较低的词，它会减慢语言处理速度，对语言分析产生负面影响。对于语言分析，首先删除停止词，以确保快速处理数据。但对于乌尔都语来说，目前还没有可靠的停用词去除方法。本文将Zipf的两因素依赖定律应用于乌尔都语语料库中停止词的查找。乌尔都语语料库是专门为这项研究创建的。所有乌尔都语文本处理和调查都在Python 3中进行。4. 先前的停止词删除工作也被调查并证明没有多大帮助。通过Zipfian方法，在500个高频词中，识别出358个高频词为停止词。观察到，只需要关注0.01%的大型语料库，就可以发现几乎所有的停止词，从而以最少的人工工作量创建一个停止词列表。此外，本文还分析了数据样本中停词、实词、停词与实词的比值以及停词与实词对数据大小的依赖关系。在数据量、频率和排名方面，Zipf定律和Heap定律在乌尔都语停止词中共存。停止词倾向于遵循一些可预测和可测量的模式，这些模式可以为乌尔都语处理提供可靠的概率方法。这种决定论方法为乌尔都语文本中停止词的统计研究提供了强有力的研究基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Spotting Urdu Stop Words By Zipf's Statistical Approach

This paper presents innovative method to extract stop words from large Urdu text. Stop words are less meaningful words in natural language that slow down language processing and affect language analysis negatively. For language analysis, stop words are removed first to ensure fast data processing. But for Urdu language, there is no reliable stop words removal method. In this work, we applied Zipf's law of two factors dependency with least effort approach to spot stop words in Urdu language corpus. Urdu corpus is specifically created for this research. All Urdu text processing and investigation is carried out in Python 3. 4. Previous work for stop words removal is also investigated and proved less helpful. By using Zipfian approach, out of 500 high frequency words, 358 words are identified as stop words. It is observed that by only focusing on 0.01% of large corpus, almost all the stop words can be spotted to create a stop words list with least manual effort. Furthermore, statistical patterns in stop words, content words, stop words vs content words ratio in data samples and dependency of stop words and content words over data size is also examined. In terms of data size, frequency and ranks, Zipf's law and Heap's law coexist in Urdu stop words. Stop words tend to follow some predictable and measurable patterns that can lead to reliable probabilistic methods for Urdu processing. This deterministic approach provides a strong research ground to explore stop words in Urdu text statistically.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 13th International Conference on Mathematics, Actuarial Science, Computer Science and Statistics (MACS)

自引率

0.00%

发文量