使用预训练的语言标识符处理多语言和异构文档

Q2 Computer Science

International Journal of Computers and Applications Pub Date : 2023-05-04 DOI:10.1080/1206212X.2023.2218236

Mohamed Raouf Kanfoud, Abdelkrim Bouramoul

{"title":"使用预训练的语言标识符处理多语言和异构文档","authors":"Mohamed Raouf Kanfoud, Abdelkrim Bouramoul","doi":"10.1080/1206212X.2023.2218236","DOIUrl":null,"url":null,"abstract":"The Web has become one of the most important data sources, and the content shared is most often multilingual, as users belong to different cultures and speak different languages. Multilingual content (document) is not suitable for many people who only need content in one language. Furthermore, dividing a multilingual document into monolingual documents helps researchers extract only the text of the desired language to use in different tasks such as training or model testing. Therefore, it is challenging to clean and divide the raw content manually. This paper presents an automatic approach to dividing a multilingual document and reassembling it into monolingual documents by examining three existing state-of-the-art tools for Language Identification (LI). We prepared different corpora with different heterogeneity characteristics for the evaluation and evaluated their code-switching pattern using three different code-switching metrics. The proposed approach reached 99% as the best accuracy result for the long segment (long text) and 90% for the mixed segment. In addition, a good correlation was found between the I-Index and accuracy with Pearson’s r = −0.998.","PeriodicalId":39673,"journal":{"name":"International Journal of Computers and Applications","volume":"24 2 1","pages":"391 - 402"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Tackling the multilingual and heterogeneous documents with the pre-trained language identifiers\",\"authors\":\"Mohamed Raouf Kanfoud, Abdelkrim Bouramoul\",\"doi\":\"10.1080/1206212X.2023.2218236\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Web has become one of the most important data sources, and the content shared is most often multilingual, as users belong to different cultures and speak different languages. Multilingual content (document) is not suitable for many people who only need content in one language. Furthermore, dividing a multilingual document into monolingual documents helps researchers extract only the text of the desired language to use in different tasks such as training or model testing. Therefore, it is challenging to clean and divide the raw content manually. This paper presents an automatic approach to dividing a multilingual document and reassembling it into monolingual documents by examining three existing state-of-the-art tools for Language Identification (LI). We prepared different corpora with different heterogeneity characteristics for the evaluation and evaluated their code-switching pattern using three different code-switching metrics. The proposed approach reached 99% as the best accuracy result for the long segment (long text) and 90% for the mixed segment. In addition, a good correlation was found between the I-Index and accuracy with Pearson’s r = −0.998.\",\"PeriodicalId\":39673,\"journal\":{\"name\":\"International Journal of Computers and Applications\",\"volume\":\"24 2 1\",\"pages\":\"391 - 402\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Computers and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/1206212X.2023.2218236\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computers and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/1206212X.2023.2218236","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

摘要

Web已经成为最重要的数据源之一，共享的内容通常是多语言的，因为用户属于不同的文化，使用不同的语言。多语言内容(文档)不适合许多只需要一种语言内容的人。此外，将多语言文档划分为单语言文档有助于研究人员仅提取所需语言的文本以用于不同的任务，例如训练或模型测试。因此，手工清理和划分原始内容是一项挑战。本文提出了一种自动划分多语言文档并将其重组为单语言文档的方法，通过检查现有的三种最先进的语言识别(LI)工具。我们准备了具有不同异质性特征的语料库进行评价，并使用三种不同的语料库语码转换指标评价其语码转换模式。对于长段(长文本)，该方法的准确率达到99%，对于混合段，该方法的准确率达到90%。此外，I-Index与准确率有很好的相关性，Pearson的r = - 0.998。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Tackling the multilingual and heterogeneous documents with the pre-trained language identifiers

The Web has become one of the most important data sources, and the content shared is most often multilingual, as users belong to different cultures and speak different languages. Multilingual content (document) is not suitable for many people who only need content in one language. Furthermore, dividing a multilingual document into monolingual documents helps researchers extract only the text of the desired language to use in different tasks such as training or model testing. Therefore, it is challenging to clean and divide the raw content manually. This paper presents an automatic approach to dividing a multilingual document and reassembling it into monolingual documents by examining three existing state-of-the-art tools for Language Identification (LI). We prepared different corpora with different heterogeneity characteristics for the evaluation and evaluated their code-switching pattern using three different code-switching metrics. The proposed approach reached 99% as the best accuracy result for the long segment (long text) and 90% for the mixed segment. In addition, a good correlation was found between the I-Index and accuracy with Pearson’s r = −0.998.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Computers and Applications Computer Science-Computer Graphics and Computer-Aided Design

CiteScore

4.70

自引率

0.00%

发文量

期刊介绍： The International Journal of Computers and Applications (IJCA) is a unique platform for publishing novel ideas, research outcomes and fundamental advances in all aspects of Computer Science, Computer Engineering, and Computer Applications. This is a peer-reviewed international journal with a vision to provide the academic and industrial community a platform for presenting original research ideas and applications. IJCA welcomes four special types of papers in addition to the regular research papers within its scope: (a) Papers for which all results could be easily reproducible. For such papers, the authors will be asked to upload "instructions for reproduction'''', possibly with the source codes or stable URLs (from where the codes could be downloaded). (b) Papers with negative results. For such papers, the experimental setting and negative results must be presented in detail. Also, why the negative results are important for the research community must be explained clearly. The rationale behind this kind of paper is that this would help researchers choose the correct approaches to solve problems and avoid the (already worked out) failed approaches. (c) Detailed report, case study and literature review articles about innovative software / hardware, new technology, high impact computer applications and future development with sufficient background and subject coverage. (d) Special issue papers focussing on a particular theme with significant importance or papers selected from a relevant conference with sufficient improvement and new material to differentiate from the papers published in a conference proceedings.