Division and the Digital Language Divide: A Critical Perspective on Natural Language Processing Resources for the South and North Korean Languages

IF 0.3 0 ASIAN STUDIES
Benoit Berthelier
{"title":"Division and the Digital Language Divide: A Critical Perspective on Natural Language Processing Resources for the South and North Korean Languages","authors":"Benoit Berthelier","doi":"10.1353/ks.2023.a908624","DOIUrl":null,"url":null,"abstract":"Abstract: The digital world is marked by large asymmetries in the volume of content available between different languages. As a direct corollary, this inequality also exists, amplified, in the number of resources (labeled and unlabeled datasets, pretrained models, academic research) available for the computational analysis of these languages or what is generally called natural language processing (NLP). NLP literature divides languages between high- and low-resource languages. Thanks to early private and public investment in the field, the Korean language is generally considered to be a high-resource language. Yet, the good fortunes of Korean in the age of machine learning obscure the divided state of the language, as recensions of available resources and research solely focus on the standard language of South Korea, thus making it the sole representant of an otherwise diverse linguistic family that includes the Northern standard language as well as regional and diasporic dialects. This paper shows that the resources developed for the South Korean language do not necessarily transfer to the North Korean language. However, it also argues that this does not make North Korean a low-resource language. On one hand, South Korean resources can be augmented with North Korean data to achieve better performance. On the other, North Korean has more resources than commonly assumed. Retracing the long history of NLP research in North Korea, the paper shows that a large number of datasets and research exists for the North Korean language even if they are not easily available. The paper concludes by exploring the possibility of \"unified\" language models and underscoring the need for active NLP research collaboration across the Korean peninsula.","PeriodicalId":43382,"journal":{"name":"Korean Studies","volume":"111 1","pages":"0"},"PeriodicalIF":0.3000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Korean Studies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1353/ks.2023.a908624","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"ASIAN STUDIES","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract: The digital world is marked by large asymmetries in the volume of content available between different languages. As a direct corollary, this inequality also exists, amplified, in the number of resources (labeled and unlabeled datasets, pretrained models, academic research) available for the computational analysis of these languages or what is generally called natural language processing (NLP). NLP literature divides languages between high- and low-resource languages. Thanks to early private and public investment in the field, the Korean language is generally considered to be a high-resource language. Yet, the good fortunes of Korean in the age of machine learning obscure the divided state of the language, as recensions of available resources and research solely focus on the standard language of South Korea, thus making it the sole representant of an otherwise diverse linguistic family that includes the Northern standard language as well as regional and diasporic dialects. This paper shows that the resources developed for the South Korean language do not necessarily transfer to the North Korean language. However, it also argues that this does not make North Korean a low-resource language. On one hand, South Korean resources can be augmented with North Korean data to achieve better performance. On the other, North Korean has more resources than commonly assumed. Retracing the long history of NLP research in North Korea, the paper shows that a large number of datasets and research exists for the North Korean language even if they are not easily available. The paper concludes by exploring the possibility of "unified" language models and underscoring the need for active NLP research collaboration across the Korean peninsula.
分割与数位语言分割:南韩与北韩语言自然语言处理资源的批判观点
摘要:数字世界的特点是不同语言之间的内容量存在很大的不对称性。作为一个直接的推论,这种不平等也存在,并被放大,在资源的数量(标记和未标记的数据集,预训练模型,学术研究)可用于这些语言的计算分析或通常被称为自然语言处理(NLP)。NLP文献将语言分为高资源语言和低资源语言。由于民间和政府在这一领域的早期投资,韩国语被普遍认为是资源丰富的语言。本文表明,为韩国语开发的资源并不一定会转移到朝鲜语。然而,它也认为,这并不意味着朝鲜语是一种资源匮乏的语言。一方面,韩国的资源可以与朝鲜的数据相结合,以获得更好的表现。另一方面,朝鲜拥有的资源比人们通常认为的要多。回顾朝鲜NLP研究的悠久历史,本文表明,尽管不容易获得,但存在大量的朝鲜语言数据集和研究。本文最后探讨了“统一”语言模型的可能性,并强调了在朝鲜半岛开展积极的NLP研究合作的必要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Korean Studies
Korean Studies ASIAN STUDIES-
CiteScore
0.50
自引率
0.00%
发文量
16
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信