Refined Urdu Lexicon Development K-Means Clustering Based Computational Model Using Colloquial Romanized Urdu Dataset

2020 International Conference on Computational Intelligence (ICCI) Pub Date : 2020-10-08 DOI:10.1109/ICCI51257.2020.9247814

Faisal Baseer, J. Jaafar, I. Aziz, Asad Habib

{"title":"Refined Urdu Lexicon Development K-Means Clustering Based Computational Model Using Colloquial Romanized Urdu Dataset","authors":"Faisal Baseer, J. Jaafar, I. Aziz, Asad Habib","doi":"10.1109/ICCI51257.2020.9247814","DOIUrl":null,"url":null,"abstract":"Urdu is among the most widely used languages in the world for verbal and written communication. Due to lack of optimized and user friendly native Urdu-script support on various platforms, it is mostly written in Romanized script in soft form. In our research, we have developed a refined Urdu lexicon using tokens with the highest frequency of occurrence in the data set. This data set is basically a raw corpus of colloquial Urdu written in Romanized script. The corpus was collected from volunteer participants who used this language as a mode of communication on the Internet and text massaging. The raw corpus is passed through a series of steps such as Prepossessing, Tokenization and Annotation before passing it to computationally extensive subsequent steps. Edit Distance and K-means Clustering techniques are used for identification of candidate tokens and their potential selection/ inclusion in the refined lexicon. We have also identified most commonly used tokens, candidate tokens and other lingual attributes from the data collected. Based on analysis, we have proposed a computational model for refined colloquial Romanized Urdu lexicon development.","PeriodicalId":194158,"journal":{"name":"2020 International Conference on Computational Intelligence (ICCI)","volume":"109 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Computational Intelligence (ICCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCI51257.2020.9247814","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Urdu is among the most widely used languages in the world for verbal and written communication. Due to lack of optimized and user friendly native Urdu-script support on various platforms, it is mostly written in Romanized script in soft form. In our research, we have developed a refined Urdu lexicon using tokens with the highest frequency of occurrence in the data set. This data set is basically a raw corpus of colloquial Urdu written in Romanized script. The corpus was collected from volunteer participants who used this language as a mode of communication on the Internet and text massaging. The raw corpus is passed through a series of steps such as Prepossessing, Tokenization and Annotation before passing it to computationally extensive subsequent steps. Edit Distance and K-means Clustering techniques are used for identification of candidate tokens and their potential selection/ inclusion in the refined lexicon. We have also identified most commonly used tokens, candidate tokens and other lingual attributes from the data collected. Based on analysis, we have proposed a computational model for refined colloquial Romanized Urdu lexicon development.

查看原文本刊更多论文

基于口语化乌尔都语数据集的K-Means聚类计算模型

乌尔都语是世界上使用最广泛的口头和书面交流语言之一。由于在各种平台上缺乏优化和用户友好的本地乌尔都语脚本支持，它主要是用软形式的罗马化脚本编写的。在我们的研究中，我们使用数据集中出现频率最高的标记开发了一个精炼的乌尔都语词典。这个数据集基本上是用罗马化脚本编写的乌尔都语口语的原始语料库。语料库是从使用这种语言作为互联网和短信交流模式的志愿者参与者中收集的。在将原始语料库传递给计算广泛的后续步骤之前，要经过一系列步骤，例如prepossession、Tokenization和Annotation。编辑距离和K-means聚类技术用于识别候选标记及其在精炼词典中的潜在选择/包含。我们还从收集的数据中确定了最常用的令牌、候选令牌和其他语言属性。在此基础上，我们提出了一个精化乌尔都语口语罗马化词汇发展的计算模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 International Conference on Computational Intelligence (ICCI)

自引率

0.00%

发文量