WikiDocsAligner:一个现成的维基百科文档对齐工具

Motaz Saad, B. Alijla
{"title":"WikiDocsAligner:一个现成的维基百科文档对齐工具","authors":"Motaz Saad, B. Alijla","doi":"10.1109/PICICT.2017.27","DOIUrl":null,"url":null,"abstract":"Wikipedia encyclopedia is an attractive source for comparable corpora in many languages. Most researchers develop their own script to perform document alignment task, which requires efforts and time. In this paper, we present WikiDocsAligner, an off-the-shelf Wikipedia Articles alignment handy tool. The implementation of WikiDocsAligner does not require the researchers to import/export of interlanguage links databases. The user just need to download Wikipedia dumps (interlanguage links and articles), then provide them to the tool, which performs the alignment. This software can be used easily to align Wikipedia documents in any language pair. Finally, we use WikiDocsAligner to align comparable documents from Arabic Wikipedia and Egyptian Wikipedia. So we shed the light on Wikipedia as a source of Arabic dialects language resources. The produced resources is interesting and useful as the demand on Arabic/dialects language resources increased in the last decade.","PeriodicalId":259869,"journal":{"name":"2017 Palestinian International Conference on Information and Communication Technology (PICICT)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"WikiDocsAligner: An Off-the-Shelf Wikipedia Documents Alignment Tool\",\"authors\":\"Motaz Saad, B. Alijla\",\"doi\":\"10.1109/PICICT.2017.27\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Wikipedia encyclopedia is an attractive source for comparable corpora in many languages. Most researchers develop their own script to perform document alignment task, which requires efforts and time. In this paper, we present WikiDocsAligner, an off-the-shelf Wikipedia Articles alignment handy tool. The implementation of WikiDocsAligner does not require the researchers to import/export of interlanguage links databases. The user just need to download Wikipedia dumps (interlanguage links and articles), then provide them to the tool, which performs the alignment. This software can be used easily to align Wikipedia documents in any language pair. Finally, we use WikiDocsAligner to align comparable documents from Arabic Wikipedia and Egyptian Wikipedia. So we shed the light on Wikipedia as a source of Arabic dialects language resources. The produced resources is interesting and useful as the demand on Arabic/dialects language resources increased in the last decade.\",\"PeriodicalId\":259869,\"journal\":{\"name\":\"2017 Palestinian International Conference on Information and Communication Technology (PICICT)\",\"volume\":\"64 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 Palestinian International Conference on Information and Communication Technology (PICICT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PICICT.2017.27\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 Palestinian International Conference on Information and Communication Technology (PICICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PICICT.2017.27","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

维基百科是许多语言中具有可比性的语料库的有吸引力的来源。大多数研究人员开发自己的脚本来执行文档对齐任务,这需要时间和精力。在本文中,我们介绍了WikiDocsAligner,一个现成的维基百科文章对齐工具。WikiDocsAligner的实现不需要研究人员导入/导出语言间链接数据库。用户只需要下载维基百科转储(跨语言链接和文章),然后将它们提供给执行对齐的工具。这个软件可以很容易地用任何语言对对齐维基百科文档。最后,我们使用WikiDocsAligner来对齐来自阿拉伯语维基百科和埃及语维基百科的可比文档。所以我们把维基百科作为阿拉伯方言语言资源的来源。随着过去十年对阿拉伯语/方言语言资源的需求增加,所生产的资源是有趣和有用的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
WikiDocsAligner: An Off-the-Shelf Wikipedia Documents Alignment Tool
Wikipedia encyclopedia is an attractive source for comparable corpora in many languages. Most researchers develop their own script to perform document alignment task, which requires efforts and time. In this paper, we present WikiDocsAligner, an off-the-shelf Wikipedia Articles alignment handy tool. The implementation of WikiDocsAligner does not require the researchers to import/export of interlanguage links databases. The user just need to download Wikipedia dumps (interlanguage links and articles), then provide them to the tool, which performs the alignment. This software can be used easily to align Wikipedia documents in any language pair. Finally, we use WikiDocsAligner to align comparable documents from Arabic Wikipedia and Egyptian Wikipedia. So we shed the light on Wikipedia as a source of Arabic dialects language resources. The produced resources is interesting and useful as the demand on Arabic/dialects language resources increased in the last decade.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信