分割与数位语言分割:南韩与北韩语言自然语言处理资源的批判观点

IF 0.3 0 ASIAN STUDIES

Korean Studies Pub Date : 2023-01-01 DOI:10.1353/ks.2023.a908624

Benoit Berthelier

{"title":"分割与数位语言分割:南韩与北韩语言自然语言处理资源的批判观点","authors":"Benoit Berthelier","doi":"10.1353/ks.2023.a908624","DOIUrl":null,"url":null,"abstract":"Abstract: The digital world is marked by large asymmetries in the volume of content available between different languages. As a direct corollary, this inequality also exists, amplified, in the number of resources (labeled and unlabeled datasets, pretrained models, academic research) available for the computational analysis of these languages or what is generally called natural language processing (NLP). NLP literature divides languages between high- and low-resource languages. Thanks to early private and public investment in the field, the Korean language is generally considered to be a high-resource language. Yet, the good fortunes of Korean in the age of machine learning obscure the divided state of the language, as recensions of available resources and research solely focus on the standard language of South Korea, thus making it the sole representant of an otherwise diverse linguistic family that includes the Northern standard language as well as regional and diasporic dialects. This paper shows that the resources developed for the South Korean language do not necessarily transfer to the North Korean language. However, it also argues that this does not make North Korean a low-resource language. On one hand, South Korean resources can be augmented with North Korean data to achieve better performance. On the other, North Korean has more resources than commonly assumed. Retracing the long history of NLP research in North Korea, the paper shows that a large number of datasets and research exists for the North Korean language even if they are not easily available. The paper concludes by exploring the possibility of \"unified\" language models and underscoring the need for active NLP research collaboration across the Korean peninsula.","PeriodicalId":43382,"journal":{"name":"Korean Studies","volume":"111 1","pages":"0"},"PeriodicalIF":0.3000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Division and the Digital Language Divide: A Critical Perspective on Natural Language Processing Resources for the South and North Korean Languages\",\"authors\":\"Benoit Berthelier\",\"doi\":\"10.1353/ks.2023.a908624\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract: The digital world is marked by large asymmetries in the volume of content available between different languages. As a direct corollary, this inequality also exists, amplified, in the number of resources (labeled and unlabeled datasets, pretrained models, academic research) available for the computational analysis of these languages or what is generally called natural language processing (NLP). NLP literature divides languages between high- and low-resource languages. Thanks to early private and public investment in the field, the Korean language is generally considered to be a high-resource language. Yet, the good fortunes of Korean in the age of machine learning obscure the divided state of the language, as recensions of available resources and research solely focus on the standard language of South Korea, thus making it the sole representant of an otherwise diverse linguistic family that includes the Northern standard language as well as regional and diasporic dialects. This paper shows that the resources developed for the South Korean language do not necessarily transfer to the North Korean language. However, it also argues that this does not make North Korean a low-resource language. On one hand, South Korean resources can be augmented with North Korean data to achieve better performance. On the other, North Korean has more resources than commonly assumed. Retracing the long history of NLP research in North Korea, the paper shows that a large number of datasets and research exists for the North Korean language even if they are not easily available. The paper concludes by exploring the possibility of \\\"unified\\\" language models and underscoring the need for active NLP research collaboration across the Korean peninsula.\",\"PeriodicalId\":43382,\"journal\":{\"name\":\"Korean Studies\",\"volume\":\"111 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.3000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Korean Studies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1353/ks.2023.a908624\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"ASIAN STUDIES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Korean Studies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1353/ks.2023.a908624","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"ASIAN STUDIES","Score":null,"Total":0}

引用次数: 0

摘要

摘要:数字世界的特点是不同语言之间的内容量存在很大的不对称性。作为一个直接的推论，这种不平等也存在，并被放大，在资源的数量(标记和未标记的数据集，预训练模型，学术研究)可用于这些语言的计算分析或通常被称为自然语言处理(NLP)。NLP文献将语言分为高资源语言和低资源语言。由于民间和政府在这一领域的早期投资，韩国语被普遍认为是资源丰富的语言。本文表明，为韩国语开发的资源并不一定会转移到朝鲜语。然而，它也认为，这并不意味着朝鲜语是一种资源匮乏的语言。一方面，韩国的资源可以与朝鲜的数据相结合，以获得更好的表现。另一方面，朝鲜拥有的资源比人们通常认为的要多。回顾朝鲜NLP研究的悠久历史，本文表明，尽管不容易获得，但存在大量的朝鲜语言数据集和研究。本文最后探讨了“统一”语言模型的可能性，并强调了在朝鲜半岛开展积极的NLP研究合作的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Division and the Digital Language Divide: A Critical Perspective on Natural Language Processing Resources for the South and North Korean Languages

Abstract: The digital world is marked by large asymmetries in the volume of content available between different languages. As a direct corollary, this inequality also exists, amplified, in the number of resources (labeled and unlabeled datasets, pretrained models, academic research) available for the computational analysis of these languages or what is generally called natural language processing (NLP). NLP literature divides languages between high- and low-resource languages. Thanks to early private and public investment in the field, the Korean language is generally considered to be a high-resource language. Yet, the good fortunes of Korean in the age of machine learning obscure the divided state of the language, as recensions of available resources and research solely focus on the standard language of South Korea, thus making it the sole representant of an otherwise diverse linguistic family that includes the Northern standard language as well as regional and diasporic dialects. This paper shows that the resources developed for the South Korean language do not necessarily transfer to the North Korean language. However, it also argues that this does not make North Korean a low-resource language. On one hand, South Korean resources can be augmented with North Korean data to achieve better performance. On the other, North Korean has more resources than commonly assumed. Retracing the long history of NLP research in North Korea, the paper shows that a large number of datasets and research exists for the North Korean language even if they are not easily available. The paper concludes by exploring the possibility of "unified" language models and underscoring the need for active NLP research collaboration across the Korean peninsula.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Korean Studies ASIAN STUDIES-

CiteScore

0.50

自引率

0.00%

发文量