古兰经阿拉伯语词性标注的最佳分词层次选择

British journal of applied science & technology Pub Date : 2017-01-10 DOI:10.9734/bjast/2017/29754

F. Ba-Alwi, M. Albared, Tareq Al-Moslmi

{"title":"古兰经阿拉伯语词性标注的最佳分词层次选择","authors":"F. Ba-Alwi, M. Albared, Tareq Al-Moslmi","doi":"10.9734/bjast/2017/29754","DOIUrl":null,"url":null,"abstract":"As a morphologically rich language, Arabic poses special challenges to Part-of-Speech (POS) tagging. Words in Arabic texts often contain several segments; each has its own POS category. The choice of the segmentation level or the input unit, word-based or morpheme-based, is a major issue in designing any Arabic natural language processing system. In word-based approaches, words are used the atomic units of the language. In this case, composite POS tags are assigned to words. Therefore, large amounts of training data are required in order to ensure statistical significance. They suffer from the problems of data sparseness and unknown words. In case of morpheme-based approaches, morpheme components of words are used as the atomic units. This, however, results in high level of ambiguity rate and also small size of context for resolving such ambiguity because the span of the n-gram might be limited to a single word. This paper compares and contrasts the morpheme-based and word-based statistical POS tagging strategies. This paper evaluates the tagging performance of three statistical models, namely, the Arabic HMM POS tagger with the prefix guessing models, the Arabic HMM POS tagger with the linear interpolation guessing models and the TnT tagger, given training data from both morphemebased and word-based tokenization levels. It also studies the influence of each choice on the Original Research Article Ba-Alwi et al.; BJAST, 19(1): 1-10, 2017; Article no.BJAST.29754 2 tagging performance of the Arabic POS tagging models, in terms of the tagging accuracy and the time complexity. In addition, this paper also evaluates the tagging performance of several stochastic models, given training data from both segmentation levels. Results show that the morpheme-based POS tagging strategy is more adequate for the purpose of training statistical POS tagging models as it provides a better overall tagging accuracy and a much faster training and tagging time.","PeriodicalId":91221,"journal":{"name":"British journal of applied science & technology","volume":"19 1","pages":"1-10"},"PeriodicalIF":0.0000,"publicationDate":"2017-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Choosing the Optimal Segmentation Level for POS Tagging of the Quranic Arabic\",\"authors\":\"F. Ba-Alwi, M. Albared, Tareq Al-Moslmi\",\"doi\":\"10.9734/bjast/2017/29754\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As a morphologically rich language, Arabic poses special challenges to Part-of-Speech (POS) tagging. Words in Arabic texts often contain several segments; each has its own POS category. The choice of the segmentation level or the input unit, word-based or morpheme-based, is a major issue in designing any Arabic natural language processing system. In word-based approaches, words are used the atomic units of the language. In this case, composite POS tags are assigned to words. Therefore, large amounts of training data are required in order to ensure statistical significance. They suffer from the problems of data sparseness and unknown words. In case of morpheme-based approaches, morpheme components of words are used as the atomic units. This, however, results in high level of ambiguity rate and also small size of context for resolving such ambiguity because the span of the n-gram might be limited to a single word. This paper compares and contrasts the morpheme-based and word-based statistical POS tagging strategies. This paper evaluates the tagging performance of three statistical models, namely, the Arabic HMM POS tagger with the prefix guessing models, the Arabic HMM POS tagger with the linear interpolation guessing models and the TnT tagger, given training data from both morphemebased and word-based tokenization levels. It also studies the influence of each choice on the Original Research Article Ba-Alwi et al.; BJAST, 19(1): 1-10, 2017; Article no.BJAST.29754 2 tagging performance of the Arabic POS tagging models, in terms of the tagging accuracy and the time complexity. In addition, this paper also evaluates the tagging performance of several stochastic models, given training data from both segmentation levels. Results show that the morpheme-based POS tagging strategy is more adequate for the purpose of training statistical POS tagging models as it provides a better overall tagging accuracy and a much faster training and tagging time.\",\"PeriodicalId\":91221,\"journal\":{\"name\":\"British journal of applied science & technology\",\"volume\":\"19 1\",\"pages\":\"1-10\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-01-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"British journal of applied science & technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.9734/bjast/2017/29754\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"British journal of applied science & technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.9734/bjast/2017/29754","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

阿拉伯语作为一种形态丰富的语言，对词性标注提出了特殊的挑战。阿拉伯语文本中的单词通常包含几个片段;每个都有自己的POS类别。在设计任何阿拉伯语自然语言处理系统时，选择分词级别或输入单元是基于词还是基于语素。在基于单词的方法中，单词是语言的原子单位。在本例中，将复合POS标记分配给单词。因此，为了保证统计显著性，需要大量的训练数据。它们面临着数据稀疏和未知词的问题。在基于语素的方法中，语素成分被用作单词的原子单位。然而，这导致了高水平的歧义率和小的上下文大小来解决这种歧义，因为n-gram的跨度可能仅限于一个单词。本文比较了基于语素和基于词的统计词性标注策略。本文在给定基于词素和基于词的标记化水平的训练数据的情况下，评估了三种统计模型的标记性能，即前缀猜测模型的阿拉伯语HMM POS标记器、线性插值猜测模型的阿拉伯语HMM POS标记器和TnT标记器。并研究了各选择对原研究文章Ba-Alwi等的影响;地球物理学报，19(1):1-10,2017;文章no.BJAST。29754 2阿拉伯文POS标注模型的标注性能，在标注精度和时间复杂度方面。此外，本文还在给定两个分割层次的训练数据的情况下，评估了几种随机模型的标记性能。结果表明，基于语素的词性标注策略更适合训练统计词性标注模型，因为它提供了更好的整体标注精度和更快的训练和标注时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Choosing the Optimal Segmentation Level for POS Tagging of the Quranic Arabic

As a morphologically rich language, Arabic poses special challenges to Part-of-Speech (POS) tagging. Words in Arabic texts often contain several segments; each has its own POS category. The choice of the segmentation level or the input unit, word-based or morpheme-based, is a major issue in designing any Arabic natural language processing system. In word-based approaches, words are used the atomic units of the language. In this case, composite POS tags are assigned to words. Therefore, large amounts of training data are required in order to ensure statistical significance. They suffer from the problems of data sparseness and unknown words. In case of morpheme-based approaches, morpheme components of words are used as the atomic units. This, however, results in high level of ambiguity rate and also small size of context for resolving such ambiguity because the span of the n-gram might be limited to a single word. This paper compares and contrasts the morpheme-based and word-based statistical POS tagging strategies. This paper evaluates the tagging performance of three statistical models, namely, the Arabic HMM POS tagger with the prefix guessing models, the Arabic HMM POS tagger with the linear interpolation guessing models and the TnT tagger, given training data from both morphemebased and word-based tokenization levels. It also studies the influence of each choice on the Original Research Article Ba-Alwi et al.; BJAST, 19(1): 1-10, 2017; Article no.BJAST.29754 2 tagging performance of the Arabic POS tagging models, in terms of the tagging accuracy and the time complexity. In addition, this paper also evaluates the tagging performance of several stochastic models, given training data from both segmentation levels. Results show that the morpheme-based POS tagging strategy is more adequate for the purpose of training statistical POS tagging models as it provides a better overall tagging accuracy and a much faster training and tagging time.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

British journal of applied science & technology

自引率

0.00%

发文量