使用YAMCHA机器学习工具的阿拉伯词自动POS标记

2022 20th International Conference on Language Engineering (ESOLEC) Pub Date : 2022-10-12 DOI:10.1109/ESOLEC54569.2022.10009473

Alaa Elnily, Ahmed Abdelghany

{"title":"使用YAMCHA机器学习工具的阿拉伯词自动POS标记","authors":"Alaa Elnily, Ahmed Abdelghany","doi":"10.1109/ESOLEC54569.2022.10009473","DOIUrl":null,"url":null,"abstract":"The process of automatically giving the proper POS tag to each word in a text based on context is known as automatic POS tagging. The majority of NLP applications require this process as a crucial step. This study intends to propose a machine learning-based Arabic POS tagger. YAMCHA tool is the machine learning system employed in this study. YAMCHA utilizes Support Vector Machines as a machine learning algorithm. SVM classifies data with high accuracy because it makes use of part of data in training process. As a result, in order to train the system, a substantial amount of annotated data must be evaluated at the POS level. A corpus of 100,039 words is utilized in this study. It was divided into training and testing parts, totaling 64,608 and 35,431 words, respectively. A tag set of 48 morphological tags were used in training and testing. To reach the best result in the automatic POS tagging, the system was trained multiple times with changing the range of linguistic information used in training process, and then new texts were tested and evaluated. The least error rate achieved was 11.4%. This rate was reached when the preceding word of the target one was considered in the training process without considering its POS tag (F: −1‥0: 0‥).","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatic POS tagging of Arabic words using the YAMCHA machine learning tool\",\"authors\":\"Alaa Elnily, Ahmed Abdelghany\",\"doi\":\"10.1109/ESOLEC54569.2022.10009473\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The process of automatically giving the proper POS tag to each word in a text based on context is known as automatic POS tagging. The majority of NLP applications require this process as a crucial step. This study intends to propose a machine learning-based Arabic POS tagger. YAMCHA tool is the machine learning system employed in this study. YAMCHA utilizes Support Vector Machines as a machine learning algorithm. SVM classifies data with high accuracy because it makes use of part of data in training process. As a result, in order to train the system, a substantial amount of annotated data must be evaluated at the POS level. A corpus of 100,039 words is utilized in this study. It was divided into training and testing parts, totaling 64,608 and 35,431 words, respectively. A tag set of 48 morphological tags were used in training and testing. To reach the best result in the automatic POS tagging, the system was trained multiple times with changing the range of linguistic information used in training process, and then new texts were tested and evaluated. The least error rate achieved was 11.4%. This rate was reached when the preceding word of the target one was considered in the training process without considering its POS tag (F: −1‥0: 0‥).\",\"PeriodicalId\":179850,\"journal\":{\"name\":\"2022 20th International Conference on Language Engineering (ESOLEC)\",\"volume\":\"90 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 20th International Conference on Language Engineering (ESOLEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ESOLEC54569.2022.10009473\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 20th International Conference on Language Engineering (ESOLEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESOLEC54569.2022.10009473","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

根据上下文为文本中的每个单词自动提供适当的词性标记的过程称为自动词性标记。大多数NLP应用程序都需要这个过程作为关键步骤。本研究拟提出一种基于机器学习的阿拉伯语POS标注器。YAMCHA工具是本研究使用的机器学习系统。YAMCHA利用支持向量机作为机器学习算法。支持向量机由于利用了训练过程中的部分数据，对数据的分类精度较高。因此，为了训练系统，必须在POS级别评估大量带注释的数据。本研究使用的语料库为100,039个单词。分为训练部分和测试部分，分别有64608和35431个单词。使用48个形态学标签集进行训练和测试。为了达到最佳的自动词性标注效果，系统在训练过程中通过改变语言信息的范围进行多次训练，然后对新文本进行测试和评价。最低错误率为11.4%。当在训练过程中考虑目标单词的前一个单词而不考虑它的POS标签时，达到了这个速率(F:−1‥0:0‥)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automatic POS tagging of Arabic words using the YAMCHA machine learning tool

The process of automatically giving the proper POS tag to each word in a text based on context is known as automatic POS tagging. The majority of NLP applications require this process as a crucial step. This study intends to propose a machine learning-based Arabic POS tagger. YAMCHA tool is the machine learning system employed in this study. YAMCHA utilizes Support Vector Machines as a machine learning algorithm. SVM classifies data with high accuracy because it makes use of part of data in training process. As a result, in order to train the system, a substantial amount of annotated data must be evaluated at the POS level. A corpus of 100,039 words is utilized in this study. It was divided into training and testing parts, totaling 64,608 and 35,431 words, respectively. A tag set of 48 morphological tags were used in training and testing. To reach the best result in the automatic POS tagging, the system was trained multiple times with changing the range of linguistic information used in training process, and then new texts were tested and evaluated. The least error rate achieved was 11.4%. This rate was reached when the preceding word of the target one was considered in the training process without considering its POS tag (F: −1‥0: 0‥).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 20th International Conference on Language Engineering (ESOLEC)

自引率

0.00%

发文量