基于语料库的立陶宛语新词汇数据库

Q3 Arts and Humanities
J. Kovalevskaite, Erika Rimkute
{"title":"基于语料库的立陶宛语新词汇数据库","authors":"J. Kovalevskaite, Erika Rimkute","doi":"10.2478/sm-2022-0007","DOIUrl":null,"url":null,"abstract":"Summary In this paper, we describe a new lexicographic resource for advanced learners of Lithuanian, the Lexical Database of Lithuanian Language Usage, which is the first attempt in Lithuanian lexicography to prepare a description of vocabulary based on the word usage analysis in the particular corpus. The written subpart of the Lithuanian Pedagogic Corpus (approx. 620,000 tokens) was used to develop headword lists and collect word usage information in the form of corpus patterns. In the database, there are 3,700 lexical items, words and multi-word units (compounds, idioms or sayings). For the appr. 700 most frequent words from a shared vocabulary (they appear in texts assigned to A1, A2, B1 and B2 levels, and their frequency in the whole corpus is 100 occurrences and above), we prepared a full-record entry: it includes sense-related corpus patterns with grammatical, semantic and lexical information and the examples illustrating all pattern components. The short-record entry (no patterns, only examples) is prepared for the less frequent words from the shared vocabulary, which are derivationally related to the most frequent headwords. The users are provided with 2,542 derivatives, which are linked to 940 headwords. In the database, 28,550 encoding examples are manually selected for all 3,000 headwords and 700 phrases. We discuss the features of the database, and, particularly, the adopted semi-automated procedure of Corpus Pattern Analysis, which was used for the description of word usage. We evaluate the approach applied, and discuss its advantages for users as well as provide the suggestions for the future improvements of the resource, which can be used as an additional resource in the classroom of Lithuanian as a foreign language, and, together with the available corpora, fill in a gap of usage information in the existing (learner) dictionaries.","PeriodicalId":52368,"journal":{"name":"Sustainable Multilingualism","volume":"20 1","pages":"154 - 193"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A New Corpus-Driven Lexical Database for Lithuanian as a Foreign Language\",\"authors\":\"J. Kovalevskaite, Erika Rimkute\",\"doi\":\"10.2478/sm-2022-0007\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summary In this paper, we describe a new lexicographic resource for advanced learners of Lithuanian, the Lexical Database of Lithuanian Language Usage, which is the first attempt in Lithuanian lexicography to prepare a description of vocabulary based on the word usage analysis in the particular corpus. The written subpart of the Lithuanian Pedagogic Corpus (approx. 620,000 tokens) was used to develop headword lists and collect word usage information in the form of corpus patterns. In the database, there are 3,700 lexical items, words and multi-word units (compounds, idioms or sayings). For the appr. 700 most frequent words from a shared vocabulary (they appear in texts assigned to A1, A2, B1 and B2 levels, and their frequency in the whole corpus is 100 occurrences and above), we prepared a full-record entry: it includes sense-related corpus patterns with grammatical, semantic and lexical information and the examples illustrating all pattern components. The short-record entry (no patterns, only examples) is prepared for the less frequent words from the shared vocabulary, which are derivationally related to the most frequent headwords. The users are provided with 2,542 derivatives, which are linked to 940 headwords. In the database, 28,550 encoding examples are manually selected for all 3,000 headwords and 700 phrases. We discuss the features of the database, and, particularly, the adopted semi-automated procedure of Corpus Pattern Analysis, which was used for the description of word usage. We evaluate the approach applied, and discuss its advantages for users as well as provide the suggestions for the future improvements of the resource, which can be used as an additional resource in the classroom of Lithuanian as a foreign language, and, together with the available corpora, fill in a gap of usage information in the existing (learner) dictionaries.\",\"PeriodicalId\":52368,\"journal\":{\"name\":\"Sustainable Multilingualism\",\"volume\":\"20 1\",\"pages\":\"154 - 193\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Sustainable Multilingualism\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2478/sm-2022-0007\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Arts and Humanities\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sustainable Multilingualism","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/sm-2022-0007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Arts and Humanities","Score":null,"Total":0}
引用次数: 0

摘要

在本文中,我们描述了一个新的立陶宛语高级学习者词典编纂资源——立陶宛语用法词典数据库,这是立陶宛语词典编纂中第一次尝试在特定语料库中基于词的用法分析来编制词汇描述。立陶宛语教学语料库的书面部分(大约。620,000代币)用于开发词头列表并以语料库模式的形式收集单词使用信息。在数据库中,有3700个词汇、单词和多词单位(复合词、习语或谚语)。对于appr。从共享词汇表中选取700个最常见的单词(它们出现在A1、A2、B1和B2级别的文本中,在整个语料库中的出现频率为100次及以上),我们准备了一个完整的记录条目:它包括与语法、语义和词汇信息相关的语料库模式,以及说明所有模式组成部分的示例。短记录条目(没有模式,只有示例)是为共享词汇表中出现频率较低的单词准备的,这些单词在派生上与最常见的关键词相关。用户获得了2542个衍生词,这些衍生词与940个关键词相关联。在数据库中,为所有3,000个关键词和700个短语手动选择了28,550个编码示例。我们讨论了数据库的特点,特别是采用了半自动化的语料库模式分析程序,用于描述词的用法。我们对所采用的方法进行了评估,并讨论了其对用户的好处,并对该资源的未来改进提出了建议,该资源可以作为立陶宛语作为外语课堂的额外资源,并与可用的语料库一起填补现有(学习者)词典中使用信息的空白。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A New Corpus-Driven Lexical Database for Lithuanian as a Foreign Language
Summary In this paper, we describe a new lexicographic resource for advanced learners of Lithuanian, the Lexical Database of Lithuanian Language Usage, which is the first attempt in Lithuanian lexicography to prepare a description of vocabulary based on the word usage analysis in the particular corpus. The written subpart of the Lithuanian Pedagogic Corpus (approx. 620,000 tokens) was used to develop headword lists and collect word usage information in the form of corpus patterns. In the database, there are 3,700 lexical items, words and multi-word units (compounds, idioms or sayings). For the appr. 700 most frequent words from a shared vocabulary (they appear in texts assigned to A1, A2, B1 and B2 levels, and their frequency in the whole corpus is 100 occurrences and above), we prepared a full-record entry: it includes sense-related corpus patterns with grammatical, semantic and lexical information and the examples illustrating all pattern components. The short-record entry (no patterns, only examples) is prepared for the less frequent words from the shared vocabulary, which are derivationally related to the most frequent headwords. The users are provided with 2,542 derivatives, which are linked to 940 headwords. In the database, 28,550 encoding examples are manually selected for all 3,000 headwords and 700 phrases. We discuss the features of the database, and, particularly, the adopted semi-automated procedure of Corpus Pattern Analysis, which was used for the description of word usage. We evaluate the approach applied, and discuss its advantages for users as well as provide the suggestions for the future improvements of the resource, which can be used as an additional resource in the classroom of Lithuanian as a foreign language, and, together with the available corpora, fill in a gap of usage information in the existing (learner) dictionaries.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Sustainable Multilingualism
Sustainable Multilingualism Social Sciences-Linguistics and Language
CiteScore
0.50
自引率
0.00%
发文量
10
审稿时长
39 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信