MorphUz: Morphological Analyzer for the Uzbek Language

2022 7th International Conference on Computer Science and Engineering (UBMK) Pub Date : 2022-09-14 DOI:10.1109/UBMK55850.2022.9919579

N. Abdurakhmonova, Ismailov Alisher, Rano Sayfulleyeva

{"title":"MorphUz: Morphological Analyzer for the Uzbek Language","authors":"N. Abdurakhmonova, Ismailov Alisher, Rano Sayfulleyeva","doi":"10.1109/UBMK55850.2022.9919579","DOIUrl":null,"url":null,"abstract":"The Uzbek language is an agglutinative language in that words are derived from stems (root) by concatenating affixes. This property makes a large number of combinations of morphemes, and greatly increases the word-vocabulary size. Therefore, words are split into certain sub-word units and applied to text and speech processing applications. Proper sub-word units not only provide high coverage and smaller lexicon size, but also provide semantic and syntactic information that is necessary for downstream applications. This paper discusses a morphological analyzer tool for natural language processing and machine learning purpose. The tool named MorphUz, which can split a text of words into a sequence of morphemes. Morphological analyzer is one of the main part of the natural language processing. MorphUz analyzer is an open-source morphological analyzer for the Uzbek language. The MorphUz analyzer is available as a website for exploration. MorphUz analyzer implements the morphology of the Uzbek language following a two-level approach using stemming and suffix analyzer. The implementation of MorphUz analyzer done by using PHP and JavaScript scripts and MySQL database.","PeriodicalId":417604,"journal":{"name":"2022 7th International Conference on Computer Science and Engineering (UBMK)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Computer Science and Engineering (UBMK)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/UBMK55850.2022.9919579","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The Uzbek language is an agglutinative language in that words are derived from stems (root) by concatenating affixes. This property makes a large number of combinations of morphemes, and greatly increases the word-vocabulary size. Therefore, words are split into certain sub-word units and applied to text and speech processing applications. Proper sub-word units not only provide high coverage and smaller lexicon size, but also provide semantic and syntactic information that is necessary for downstream applications. This paper discusses a morphological analyzer tool for natural language processing and machine learning purpose. The tool named MorphUz, which can split a text of words into a sequence of morphemes. Morphological analyzer is one of the main part of the natural language processing. MorphUz analyzer is an open-source morphological analyzer for the Uzbek language. The MorphUz analyzer is available as a website for exploration. MorphUz analyzer implements the morphology of the Uzbek language following a two-level approach using stemming and suffix analyzer. The implementation of MorphUz analyzer done by using PHP and JavaScript scripts and MySQL database.

查看原文本刊更多论文

MorphUz:乌兹别克语的形态分析器

乌兹别克语是一种粘连的语言，因为单词是由词干(词根)通过连接词缀而产生的。这一特性产生了大量的语素组合，极大地增加了词汇量。因此，单词被分成一定的子单词单位，并应用于文本和语音处理应用。适当的子词单位不仅可以提供高覆盖率和更小的词典大小，还可以提供下游应用程序所需的语义和语法信息。本文讨论了一种用于自然语言处理和机器学习的形态分析工具。这款名为MorphUz的工具可以将一段文字分割成一系列的语素。形态分析器是自然语言处理的重要组成部分之一。MorphUz分析器是乌兹别克语的开源形态分析器。MorphUz分析仪可作为一个网站进行探索。MorphUz分析器使用词干和后缀分析器实现了乌兹别克语的两级方法。MorphUz分析仪的实现是通过使用PHP和JavaScript脚本以及MySQL数据库完成的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 7th International Conference on Computer Science and Engineering (UBMK)

自引率

0.00%

发文量