使用自然语言处理技术的波斯语非正式到正式词转换

Proceedings of the 2021 2nd International Conference on Computing, Networks and Internet of Things Pub Date : 2021-05-20 DOI:10.1145/3468691.3468710

A. Naemi, Marjan Mansourvar, Mostafa Naemi, Bahman Damirchilu, A. Ebrahimi, U. Wiil

{"title":"使用自然语言处理技术的波斯语非正式到正式词转换","authors":"A. Naemi, Marjan Mansourvar, Mostafa Naemi, Bahman Damirchilu, A. Ebrahimi, U. Wiil","doi":"10.1145/3468691.3468710","DOIUrl":null,"url":null,"abstract":"A vast amount of text data is available today on the Internet due to the extensive use of social media. Valuable information can be extracted from this data through natural language processing. However, the process of information extraction can be difficult due to the informal forms of these texts. This paper aims to address this challenge by focusing on the conversion of Persian informal words to formal words by using the spell-checking approach. For this purpose, two datasets for formal and informal words were extracted from the four most visited news websites in Persian. Then Persian informal words were divided into multiple categories based on the level of changes required to build the formal equivalents. These were then converted to the formal forms according to their features. Statistical analyses combined with correction rules were used to produce a “candidate list” to find the best formal candidate equivalents. The performance of our conversion system was evaluated through people's comments extracted from the four most visited Persian (Farsi) news agencies. Results show that our proposed system can detect approximately 94% of the Persian informal words, with the ability to detect 85% of the best equivalent formal words. In addition, the comparison between the proposed system and two well-known Persian spell-checkers, Virastyar and Vafa, shows that in terms of detection and correction, the proposed system outperforms significantly. Further analysis shows that the time complexity of the proposed system is linear.","PeriodicalId":112143,"journal":{"name":"Proceedings of the 2021 2nd International Conference on Computing, Networks and Internet of Things","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Informal-to-Formal Word Conversion for Persian Language Using Natural Language Processing Techniques\",\"authors\":\"A. Naemi, Marjan Mansourvar, Mostafa Naemi, Bahman Damirchilu, A. Ebrahimi, U. Wiil\",\"doi\":\"10.1145/3468691.3468710\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A vast amount of text data is available today on the Internet due to the extensive use of social media. Valuable information can be extracted from this data through natural language processing. However, the process of information extraction can be difficult due to the informal forms of these texts. This paper aims to address this challenge by focusing on the conversion of Persian informal words to formal words by using the spell-checking approach. For this purpose, two datasets for formal and informal words were extracted from the four most visited news websites in Persian. Then Persian informal words were divided into multiple categories based on the level of changes required to build the formal equivalents. These were then converted to the formal forms according to their features. Statistical analyses combined with correction rules were used to produce a “candidate list” to find the best formal candidate equivalents. The performance of our conversion system was evaluated through people's comments extracted from the four most visited Persian (Farsi) news agencies. Results show that our proposed system can detect approximately 94% of the Persian informal words, with the ability to detect 85% of the best equivalent formal words. In addition, the comparison between the proposed system and two well-known Persian spell-checkers, Virastyar and Vafa, shows that in terms of detection and correction, the proposed system outperforms significantly. Further analysis shows that the time complexity of the proposed system is linear.\",\"PeriodicalId\":112143,\"journal\":{\"name\":\"Proceedings of the 2021 2nd International Conference on Computing, Networks and Internet of Things\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2021 2nd International Conference on Computing, Networks and Internet of Things\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3468691.3468710\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 2nd International Conference on Computing, Networks and Internet of Things","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3468691.3468710","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

由于社交媒体的广泛使用，今天在互联网上可以获得大量的文本数据。通过自然语言处理，可以从这些数据中提取有价值的信息。然而，由于这些文本的非正式形式，信息提取过程可能很困难。本文旨在通过使用拼写检查方法将波斯语非正式词转换为正式词来解决这一挑战。为此，从四个访问量最大的波斯语新闻网站中提取了正式词和非正式词两个数据集。然后，波斯语非正式词被分为多个类别，根据需要的变化程度来建立正式的等价物。然后根据它们的特征将它们转换成正式形式。结合校正规则的统计分析被用来产生一个“候选名单”，以找到最佳的正式候选等同物。我们的转换系统的性能是通过从四个访问量最大的波斯语新闻机构中提取的人们的评论来评估的。结果表明，我们提出的系统可以检测大约94%的波斯语非正式词，能够检测85%的最佳等效正式词。此外，将所提出的系统与两个著名的波斯语拼写检查器Virastyar和Vafa进行比较，表明所提出的系统在检测和纠正方面明显优于所提出的系统。进一步分析表明，所提系统的时间复杂度是线性的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Informal-to-Formal Word Conversion for Persian Language Using Natural Language Processing Techniques

A vast amount of text data is available today on the Internet due to the extensive use of social media. Valuable information can be extracted from this data through natural language processing. However, the process of information extraction can be difficult due to the informal forms of these texts. This paper aims to address this challenge by focusing on the conversion of Persian informal words to formal words by using the spell-checking approach. For this purpose, two datasets for formal and informal words were extracted from the four most visited news websites in Persian. Then Persian informal words were divided into multiple categories based on the level of changes required to build the formal equivalents. These were then converted to the formal forms according to their features. Statistical analyses combined with correction rules were used to produce a “candidate list” to find the best formal candidate equivalents. The performance of our conversion system was evaluated through people's comments extracted from the four most visited Persian (Farsi) news agencies. Results show that our proposed system can detect approximately 94% of the Persian informal words, with the ability to detect 85% of the best equivalent formal words. In addition, the comparison between the proposed system and two well-known Persian spell-checkers, Virastyar and Vafa, shows that in terms of detection and correction, the proposed system outperforms significantly. Further analysis shows that the time complexity of the proposed system is linear.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2021 2nd International Conference on Computing, Networks and Internet of Things

自引率

0.00%

发文量