古兰经阿拉伯语词性标注的最佳分词层次选择

F. Ba-Alwi, M. Albared, Tareq Al-Moslmi
{"title":"古兰经阿拉伯语词性标注的最佳分词层次选择","authors":"F. Ba-Alwi, M. Albared, Tareq Al-Moslmi","doi":"10.9734/bjast/2017/29754","DOIUrl":null,"url":null,"abstract":"As a morphologically rich language, Arabic poses special challenges to Part-of-Speech (POS) tagging. Words in Arabic texts often contain several segments; each has its own POS category. The choice of the segmentation level or the input unit, word-based or morpheme-based, is a major issue in designing any Arabic natural language processing system. In word-based approaches, words are used the atomic units of the language. In this case, composite POS tags are assigned to words. Therefore, large amounts of training data are required in order to ensure statistical significance. They suffer from the problems of data sparseness and unknown words. In case of morpheme-based approaches, morpheme components of words are used as the atomic units. This, however, results in high level of ambiguity rate and also small size of context for resolving such ambiguity because the span of the n-gram might be limited to a single word. This paper compares and contrasts the morpheme-based and word-based statistical POS tagging strategies. This paper evaluates the tagging performance of three statistical models, namely, the Arabic HMM POS tagger with the prefix guessing models, the Arabic HMM POS tagger with the linear interpolation guessing models and the TnT tagger, given training data from both morphemebased and word-based tokenization levels. It also studies the influence of each choice on the Original Research Article Ba-Alwi et al.; BJAST, 19(1): 1-10, 2017; Article no.BJAST.29754 2 tagging performance of the Arabic POS tagging models, in terms of the tagging accuracy and the time complexity. In addition, this paper also evaluates the tagging performance of several stochastic models, given training data from both segmentation levels. Results show that the morpheme-based POS tagging strategy is more adequate for the purpose of training statistical POS tagging models as it provides a better overall tagging accuracy and a much faster training and tagging time.","PeriodicalId":91221,"journal":{"name":"British journal of applied science & technology","volume":"19 1","pages":"1-10"},"PeriodicalIF":0.0000,"publicationDate":"2017-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Choosing the Optimal Segmentation Level for POS Tagging of the Quranic Arabic\",\"authors\":\"F. Ba-Alwi, M. Albared, Tareq Al-Moslmi\",\"doi\":\"10.9734/bjast/2017/29754\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As a morphologically rich language, Arabic poses special challenges to Part-of-Speech (POS) tagging. Words in Arabic texts often contain several segments; each has its own POS category. The choice of the segmentation level or the input unit, word-based or morpheme-based, is a major issue in designing any Arabic natural language processing system. In word-based approaches, words are used the atomic units of the language. In this case, composite POS tags are assigned to words. Therefore, large amounts of training data are required in order to ensure statistical significance. They suffer from the problems of data sparseness and unknown words. In case of morpheme-based approaches, morpheme components of words are used as the atomic units. This, however, results in high level of ambiguity rate and also small size of context for resolving such ambiguity because the span of the n-gram might be limited to a single word. This paper compares and contrasts the morpheme-based and word-based statistical POS tagging strategies. This paper evaluates the tagging performance of three statistical models, namely, the Arabic HMM POS tagger with the prefix guessing models, the Arabic HMM POS tagger with the linear interpolation guessing models and the TnT tagger, given training data from both morphemebased and word-based tokenization levels. It also studies the influence of each choice on the Original Research Article Ba-Alwi et al.; BJAST, 19(1): 1-10, 2017; Article no.BJAST.29754 2 tagging performance of the Arabic POS tagging models, in terms of the tagging accuracy and the time complexity. In addition, this paper also evaluates the tagging performance of several stochastic models, given training data from both segmentation levels. Results show that the morpheme-based POS tagging strategy is more adequate for the purpose of training statistical POS tagging models as it provides a better overall tagging accuracy and a much faster training and tagging time.\",\"PeriodicalId\":91221,\"journal\":{\"name\":\"British journal of applied science & technology\",\"volume\":\"19 1\",\"pages\":\"1-10\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-01-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"British journal of applied science & technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.9734/bjast/2017/29754\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"British journal of applied science & technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.9734/bjast/2017/29754","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

阿拉伯语作为一种形态丰富的语言,对词性标注提出了特殊的挑战。阿拉伯语文本中的单词通常包含几个片段;每个都有自己的POS类别。在设计任何阿拉伯语自然语言处理系统时,选择分词级别或输入单元是基于词还是基于语素。在基于单词的方法中,单词是语言的原子单位。在本例中,将复合POS标记分配给单词。因此,为了保证统计显著性,需要大量的训练数据。它们面临着数据稀疏和未知词的问题。在基于语素的方法中,语素成分被用作单词的原子单位。然而,这导致了高水平的歧义率和小的上下文大小来解决这种歧义,因为n-gram的跨度可能仅限于一个单词。本文比较了基于语素和基于词的统计词性标注策略。本文在给定基于词素和基于词的标记化水平的训练数据的情况下,评估了三种统计模型的标记性能,即前缀猜测模型的阿拉伯语HMM POS标记器、线性插值猜测模型的阿拉伯语HMM POS标记器和TnT标记器。并研究了各选择对原研究文章Ba-Alwi等的影响;地球物理学报,19(1):1-10,2017;文章no.BJAST。29754 2阿拉伯文POS标注模型的标注性能,在标注精度和时间复杂度方面。此外,本文还在给定两个分割层次的训练数据的情况下,评估了几种随机模型的标记性能。结果表明,基于语素的词性标注策略更适合训练统计词性标注模型,因为它提供了更好的整体标注精度和更快的训练和标注时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Choosing the Optimal Segmentation Level for POS Tagging of the Quranic Arabic
As a morphologically rich language, Arabic poses special challenges to Part-of-Speech (POS) tagging. Words in Arabic texts often contain several segments; each has its own POS category. The choice of the segmentation level or the input unit, word-based or morpheme-based, is a major issue in designing any Arabic natural language processing system. In word-based approaches, words are used the atomic units of the language. In this case, composite POS tags are assigned to words. Therefore, large amounts of training data are required in order to ensure statistical significance. They suffer from the problems of data sparseness and unknown words. In case of morpheme-based approaches, morpheme components of words are used as the atomic units. This, however, results in high level of ambiguity rate and also small size of context for resolving such ambiguity because the span of the n-gram might be limited to a single word. This paper compares and contrasts the morpheme-based and word-based statistical POS tagging strategies. This paper evaluates the tagging performance of three statistical models, namely, the Arabic HMM POS tagger with the prefix guessing models, the Arabic HMM POS tagger with the linear interpolation guessing models and the TnT tagger, given training data from both morphemebased and word-based tokenization levels. It also studies the influence of each choice on the Original Research Article Ba-Alwi et al.; BJAST, 19(1): 1-10, 2017; Article no.BJAST.29754 2 tagging performance of the Arabic POS tagging models, in terms of the tagging accuracy and the time complexity. In addition, this paper also evaluates the tagging performance of several stochastic models, given training data from both segmentation levels. Results show that the morpheme-based POS tagging strategy is more adequate for the purpose of training statistical POS tagging models as it provides a better overall tagging accuracy and a much faster training and tagging time.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信