利用网络资源开发波斯语语料库的挑战

Masood Ghayoomi, S. Momtazi
{"title":"利用网络资源开发波斯语语料库的挑战","authors":"Masood Ghayoomi, S. Momtazi","doi":"10.1109/IALP.2009.31","DOIUrl":null,"url":null,"abstract":"Persian is one of the Indo-European languages which has borrowed its script from Arabic, a member of Semitic language family. Since Persian and Arabic scripts are so similar, problems arise when we want to process an electronic text. In this paper, some of the common problems faced experimentally in developing a corpus for Persian from on-line materials are discussed. The sources of the problems are the Persian script itself; mixture with the Arabic script; Persian orthography; the typists’ typing styles; and mixing Persian code pages with Arabic code pages in operating systems.","PeriodicalId":156840,"journal":{"name":"2009 International Conference on Asian Language Processing","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Challenges in Developing Persian Corpora from Online Resources\",\"authors\":\"Masood Ghayoomi, S. Momtazi\",\"doi\":\"10.1109/IALP.2009.31\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Persian is one of the Indo-European languages which has borrowed its script from Arabic, a member of Semitic language family. Since Persian and Arabic scripts are so similar, problems arise when we want to process an electronic text. In this paper, some of the common problems faced experimentally in developing a corpus for Persian from on-line materials are discussed. The sources of the problems are the Persian script itself; mixture with the Arabic script; Persian orthography; the typists’ typing styles; and mixing Persian code pages with Arabic code pages in operating systems.\",\"PeriodicalId\":156840,\"journal\":{\"name\":\"2009 International Conference on Asian Language Processing\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 International Conference on Asian Language Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IALP.2009.31\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 International Conference on Asian Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP.2009.31","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

摘要

波斯语是印欧语系的一种,它从闪米特语系的成员阿拉伯语中借用了它的文字。由于波斯语和阿拉伯语脚本非常相似,当我们想要处理电子文本时就会出现问题。本文讨论了从在线材料中开发波斯语语料库所面临的一些常见问题。问题的根源在于波斯文字本身;与阿拉伯文字的混合;波斯正字法;打字员的打字风格;以及在操作系统中混合波斯语代码页和阿拉伯语代码页。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Challenges in Developing Persian Corpora from Online Resources
Persian is one of the Indo-European languages which has borrowed its script from Arabic, a member of Semitic language family. Since Persian and Arabic scripts are so similar, problems arise when we want to process an electronic text. In this paper, some of the common problems faced experimentally in developing a corpus for Persian from on-line materials are discussed. The sources of the problems are the Persian script itself; mixture with the Arabic script; Persian orthography; the typists’ typing styles; and mixing Persian code pages with Arabic code pages in operating systems.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信