The Shortcomings of Language Tags for Linked Data When Modeling Lesser-Known Languages

International Conference on Language, Data, and Knowledge Pub Date : 1900-01-01 DOI:10.4230/OASIcs.LDK.2019.4

Frances Gillis-Webber, Sabine Tittel

{"title":"The Shortcomings of Language Tags for Linked Data When Modeling Lesser-Known Languages","authors":"Frances Gillis-Webber, Sabine Tittel","doi":"10.4230/OASIcs.LDK.2019.4","DOIUrl":null,"url":null,"abstract":"In recent years, the modeling of data from linguistic resources with Resource Description Framework (RDF), following the Linked Data paradigm and using the OntoLex-Lemon vocabulary, has become a prevalent method to create datasets for a multilingual web of data. An important aspect of data modeling is the use of language tags to mark lexicons, lexemes, word senses, etc. of a linguistic dataset. However, attempts to model data from lesser-known languages show significant shortcomings with the authoritative list of language codes by ISO 639: for many lesser-known languages spoken by minorities and also for historical stages of languages, language codes, the basis of language tags, are simply not available. This paper discusses these shortcomings based on the examples of three such languages, i.e., two varieties of click languages of Southern Africa together with Old French, and suggests solutions for the issues identified. 2012 ACM Subject Classification Computing methodologies → Language resources; Information systems → Dictionaries; Information systems → Semantic web description languages; Information systems → Graph-based database models; Information systems → Resource Description Framework (RDF); Software and its engineering → Interoperability; Information systems → Multilingual and cross-lingual retrieval; Computing methodologies → Information extraction; Computing methodologies → Artificial intelligence","PeriodicalId":377119,"journal":{"name":"International Conference on Language, Data, and Knowledge","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Language, Data, and Knowledge","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/OASIcs.LDK.2019.4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

In recent years, the modeling of data from linguistic resources with Resource Description Framework (RDF), following the Linked Data paradigm and using the OntoLex-Lemon vocabulary, has become a prevalent method to create datasets for a multilingual web of data. An important aspect of data modeling is the use of language tags to mark lexicons, lexemes, word senses, etc. of a linguistic dataset. However, attempts to model data from lesser-known languages show significant shortcomings with the authoritative list of language codes by ISO 639: for many lesser-known languages spoken by minorities and also for historical stages of languages, language codes, the basis of language tags, are simply not available. This paper discusses these shortcomings based on the examples of three such languages, i.e., two varieties of click languages of Southern Africa together with Old French, and suggests solutions for the issues identified. 2012 ACM Subject Classification Computing methodologies → Language resources; Information systems → Dictionaries; Information systems → Semantic web description languages; Information systems → Graph-based database models; Information systems → Resource Description Framework (RDF); Software and its engineering → Interoperability; Information systems → Multilingual and cross-lingual retrieval; Computing methodologies → Information extraction; Computing methodologies → Artificial intelligence

查看原文本刊更多论文

关联数据中语言标签在对未知语言建模时的不足

近年来，使用资源描述框架(RDF)对来自语言资源的数据建模，遵循关联数据范式并使用OntoLex-Lemon词汇表，已经成为一种为多语言数据网络创建数据集的流行方法。数据建模的一个重要方面是使用语言标记来标记语言数据集的词汇、词素、词义等。然而，从不太为人所知的语言中对数据进行建模的尝试表明，ISO 639的语言代码权威清单存在重大缺陷:对于少数民族使用的许多不太为人所知的语言，以及对于语言的历史阶段，语言代码，即语言标签的基础，根本无法获得。本文以三种语言，即南部非洲的两种点击语言和古法语为例，讨论了这些缺点，并针对发现的问题提出了解决方案。2012 ACM学科分类计算方法→语言资源;信息系统→词典;信息系统→语义网络描述语言;信息系统→基于图的数据库模型;信息系统→资源描述框架(RDF);软件及其工程→互操作性;信息系统→多语种和跨语种检索;计算方法→信息提取;计算方法→人工智能

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Language, Data, and Knowledge

自引率

0.00%

发文量