Impacts of Homophone Normalization on Semantic Models for Amharic

2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA) Pub Date : 2021-11-22 DOI:10.1109/ict4da53266.2021.9672229

Tadesse Destaw Belay, A. Ayele, G. Gelaye, Seid Muhie Yimam, Chris Biemann

{"title":"Impacts of Homophone Normalization on Semantic Models for Amharic","authors":"Tadesse Destaw Belay, A. Ayele, G. Gelaye, Seid Muhie Yimam, Chris Biemann","doi":"10.1109/ict4da53266.2021.9672229","DOIUrl":null,"url":null,"abstract":"Amharic is the second-most spoken Semitic language after Arabic and serves as the official working language of the government of Ethiopia. In Amharic writing, there are different characters with the same sound, which are called homophones. The current trend in Amharic NLP research is to normalize homophones into a single representation. This means, instead of character 11We have used the IPA notation for Amharic character transliteration, , and , the character will be used; instead of , and , the character will be replaced; and so on. This was done by the assumption that they are repetitive alphabets as they have the same sound. However, the impact of homophone normalization for Amharic NLP applications is not well studied. When one homophone character is substituted by another, there will be a meaning change and it is against the Amharic writing regulation. For example, the word is “poverty” while means “salvage”. These two words are homophones, but they have different meanings. To study the impacts of homophone normalization, we develop different general-purpose pre-trained embedding models for Amharic using regular and normalized homophone characters. We fine-tune the pre-trained models and build some Amharic NLP applications. For PoS tagging, a model that employs a regular FLAIR embedding model performs better, achieving an F1-score of 77%. For sentiment analysis, the model from regular RoBERTa embedding outperforms the other models with an F1-score of 60%. For IR systems, we achieve an F1-score of 90% using the normalized document. The results show that normalization is highly dependent on the NLP applications. For sentiment analysis and PoS tagging, normalization has negative impacts while it is essential for IR. Our research indicates that normalization should be applied with caution and more effort towards standardization should be given.","PeriodicalId":371663,"journal":{"name":"2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)","volume":"518 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ict4da53266.2021.9672229","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Amharic is the second-most spoken Semitic language after Arabic and serves as the official working language of the government of Ethiopia. In Amharic writing, there are different characters with the same sound, which are called homophones. The current trend in Amharic NLP research is to normalize homophones into a single representation. This means, instead of character 11We have used the IPA notation for Amharic character transliteration, , and , the character will be used; instead of , and , the character will be replaced; and so on. This was done by the assumption that they are repetitive alphabets as they have the same sound. However, the impact of homophone normalization for Amharic NLP applications is not well studied. When one homophone character is substituted by another, there will be a meaning change and it is against the Amharic writing regulation. For example, the word is “poverty” while means “salvage”. These two words are homophones, but they have different meanings. To study the impacts of homophone normalization, we develop different general-purpose pre-trained embedding models for Amharic using regular and normalized homophone characters. We fine-tune the pre-trained models and build some Amharic NLP applications. For PoS tagging, a model that employs a regular FLAIR embedding model performs better, achieving an F1-score of 77%. For sentiment analysis, the model from regular RoBERTa embedding outperforms the other models with an F1-score of 60%. For IR systems, we achieve an F1-score of 90% using the normalized document. The results show that normalization is highly dependent on the NLP applications. For sentiment analysis and PoS tagging, normalization has negative impacts while it is essential for IR. Our research indicates that normalization should be applied with caution and more effort towards standardization should be given.

查看原文本刊更多论文

同音字规范化对阿姆哈拉语语义模型的影响

阿姆哈拉语是继阿拉伯语之后第二大使用的闪族语言，也是埃塞俄比亚政府的官方工作语言。在阿姆哈拉语的文字中，有不同的字有相同的发音，这被称为同音异义字。目前阿姆哈拉语自然语言处理研究的趋势是将同音异义词归一化为单一的表示。这意味着，我们已经使用国际音标法来转写阿姆哈拉语字符，而不是字符11，并且，该字符将被使用;而不是，和，字符将被替换;等等......这是假设它们是重复的字母，因为它们有相同的发音。然而，同音字归一化对阿姆哈拉语自然语言处理应用的影响还没有得到很好的研究。当一个同音字被另一个同音字取代时，会有一个意义的改变，这是违反阿姆哈拉语写作规则的。例如，这个词是“贫穷”，而意思是“救助”。这两个词是同音异义词，但它们的意思不同。为了研究同音字归一化对阿姆哈拉语的影响，我们采用正则同音字和归一化同音字建立了不同的通用预训练嵌入模型。我们对预训练模型进行了微调，并构建了一些阿姆哈拉语NLP应用程序。对于词性标注，采用常规FLAIR嵌入模型的模型表现更好，f1得分为77%。对于情感分析，来自常规RoBERTa嵌入的模型优于其他模型，f1得分为60%。对于红外系统，我们使用规范化文档实现了90%的f1得分。结果表明，归一化程度高度依赖于自然语言处理的应用。对于情感分析和词性标注，归一化会产生负面影响，而对于情感分析则是必不可少的。我们的研究表明，应谨慎应用规范化，并应在标准化方面付出更多努力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)

自引率

0.00%

发文量