Pedagogic Corpus of Lithuanian: A New Resource for Learning and Teaching Lithuanian as a Foreign Language

Q3 Arts and Humanities

Sustainable Multilingualism Pub Date : 2020-11-01 DOI:10.2478/sm-2020-0019

J. Kovalevskaite, Erika Rimkute

{"title":"Pedagogic Corpus of Lithuanian: A New Resource for Learning and Teaching Lithuanian as a Foreign Language","authors":"J. Kovalevskaite, Erika Rimkute","doi":"10.2478/sm-2020-0019","DOIUrl":null,"url":null,"abstract":"Summary The paper aims to present the first pedagogic corpus of Lithuanian i.e. monolingual specialized corpus, prepared for learning and teaching Lithuanian in a foreign language classroom. The corpus has been collected as a result of the project “Lithuanian Academic Scheme for International Cooperation in Baltic Studies”. It is motivated by the need to have a more appropriate resource which could be representative, authentic and relevant enough concerning the process of learning and teaching Lithuanian as it is known that language represented in other existing corpora of Lithuanian (e.g. Corpus of Contemporary Lithuanian, 140 m tokens) is too complex to use for learning activities. The pedagogic corpus includes authentic Lithuanian texts, selected using such criteria as a learner-relevant communicative function and genre. Spoken language as well as written language are represented in the corpus. The size of the corpus is 669.000 tokens: 111.000 tokens from texts and spoken language for A1–A2 levels, 558.000 tokens from texts and spoken language for B1–B2 levels (according to the CEFR – Common European Framework of Reference for Languages). In this paper, we aim to discuss in detail the written subpart of the corpus (containing 620.000 tokens) which includes levelled texts from coursebooks and unlevelled texts from other sources. The level-appropriate labels were assigned automatically to the texts from other sources and this text classification procedure is presented in the paper. The texts from coursebooks and other sources could be classified into 29 text types (dialogs, narratives, information, etc.) and 4 groups according to the communicative aims: informational texts, educational texts, advertising and fiction. Informational texts comprise the biggest part of the corpus; three mostly represented text types differ in coursebook texts and other sources: the most common coursebook texts are informational, narratives, and dialogs (appr. 78% of all coursebook texts). Texts from other sources are represented with richer diversity – appr. 73% of all texts from this subpart can be classified into 5 text types: subtitles, informational texts, educational texts, fiction, and advisory texts. The future work making pedagogic corpus available for learners and its possible application are presented in the closing remarks.","PeriodicalId":52368,"journal":{"name":"Sustainable Multilingualism","volume":"17 1","pages":"197 - 230"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sustainable Multilingualism","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/sm-2020-0019","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Arts and Humanities","Score":null,"Total":0}

引用次数: 1

Abstract

Summary The paper aims to present the first pedagogic corpus of Lithuanian i.e. monolingual specialized corpus, prepared for learning and teaching Lithuanian in a foreign language classroom. The corpus has been collected as a result of the project “Lithuanian Academic Scheme for International Cooperation in Baltic Studies”. It is motivated by the need to have a more appropriate resource which could be representative, authentic and relevant enough concerning the process of learning and teaching Lithuanian as it is known that language represented in other existing corpora of Lithuanian (e.g. Corpus of Contemporary Lithuanian, 140 m tokens) is too complex to use for learning activities. The pedagogic corpus includes authentic Lithuanian texts, selected using such criteria as a learner-relevant communicative function and genre. Spoken language as well as written language are represented in the corpus. The size of the corpus is 669.000 tokens: 111.000 tokens from texts and spoken language for A1–A2 levels, 558.000 tokens from texts and spoken language for B1–B2 levels (according to the CEFR – Common European Framework of Reference for Languages). In this paper, we aim to discuss in detail the written subpart of the corpus (containing 620.000 tokens) which includes levelled texts from coursebooks and unlevelled texts from other sources. The level-appropriate labels were assigned automatically to the texts from other sources and this text classification procedure is presented in the paper. The texts from coursebooks and other sources could be classified into 29 text types (dialogs, narratives, information, etc.) and 4 groups according to the communicative aims: informational texts, educational texts, advertising and fiction. Informational texts comprise the biggest part of the corpus; three mostly represented text types differ in coursebook texts and other sources: the most common coursebook texts are informational, narratives, and dialogs (appr. 78% of all coursebook texts). Texts from other sources are represented with richer diversity – appr. 73% of all texts from this subpart can be classified into 5 text types: subtitles, informational texts, educational texts, fiction, and advisory texts. The future work making pedagogic corpus available for learners and its possible application are presented in the closing remarks.

查看原文本刊更多论文

立陶宛语教学语料库：学习和教学立陶宛语的新资源

摘要本文旨在介绍第一个立陶宛语教学语料库，即单语专业语料库，用于在外语课堂上学习和教授立陶宛语。该语料库是“立陶宛波罗的海研究国际合作学术计划”项目的成果。它的动机是需要有一个更合适的资源，该资源可以在立陶宛语的学习和教学过程中具有足够的代表性、真实性和相关性，因为众所周知，其他现有立陶宛语语料库（如当代立陶宛语语料库，1.4亿个标记）中所代表的语言过于复杂，无法用于学习活动。教学语料库包括真实的立陶宛文本，使用与学习者相关的交际功能和类型等标准进行选择。口语和书面语都在语料库中表现出来。语料库的大小为669.000个标记：A1–A2级别的111.000个标记来自文本和口语，B1–B2级别的55.8万个标记来自文字和口语（根据CEFR–欧洲通用语言参考框架）。在本文中，我们的目的是详细讨论语料库的书面子部分（包含620.000个标记），其中包括来自课本的分级文本和来自其他来源的未分级文本。将适当级别的标签自动分配给来自其他来源的文本，并在本文中介绍了这种文本分类程序。根据交际目的，教材和其他来源的文本可分为29种文本类型（对话、叙事、信息等）和4组：信息文本、教育文本、广告和小说。信息文本是语料库的最大组成部分；三种最常见的文本类型在教材文本和其他来源中有所不同：最常见的教材文本是信息、叙述和对话（约占所有教材文本的78%）。来自其他来源的文本表现出更丰富的多样性——大约。本子部分73%的文本可分为5种文本类型：字幕、信息文本、教育文本、小说和咨询文本。在结束语中介绍了为学习者提供教学语料库的未来工作及其可能的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊