Označavanje pogrešaka u CroLTeC-u (računalnom učeničkom korpusu hrvatskog kao stranog jezika)

IF 0.2 0 LANGUAGE & LINGUISTICS

Rasprave Pub Date : 2020-10-30 DOI:10.31724/rihjj.46.2.24

Nives Mikelić Preradović

{"title":"Označavanje pogrešaka u CroLTeC-u (računalnom učeničkom korpusu hrvatskog kao stranog jezika)","authors":"Nives Mikelić Preradović","doi":"10.31724/rihjj.46.2.24","DOIUrl":null,"url":null,"abstract":"The paper describes the error-tagging scheme developed for the CroLTeC learner corpus (http://nlp.ffzg.hr/resources/corpora/croltec/) – the first electronic learner corpus of Croatian as a foreign language. CroLTeC contains essays collected from 755 students with 36 different mother tongues, among which the most prominent were Spanish, English, German, Polish, Chinese, French, and Arabic. It consists of 4,747 essays, out of which 1,217 were digitally born, while 3530 essays were scanned, transcribed in RTF format, Mikelić Preradović.indd 919 4.11.2020. 11:30:22 920 Rasprave 46/2 (2020.) str. 899–920 and converted into XML format. CroLTeC has a total of 1,054,287 tokens, and essays have been collected on all 6 levels of Common European Framework of Reference for Languages (CEFR) at Croaticum – Center for Croatian as Second and Foreign Language at the Faculty of Humanities and Social Sciences in Zagreb, Department of Information Sciences, Natural Language Processing group. All CroLTeC essays contain metadata about the title, number, and type of essay (homework, part of an exam or field class, etc.). Data were lemmatized and annotated with morphosyntactic tags with the ReLDI tagger (Ljubešić et al., 2016). Also, the corpus is searchable by age, sex, language proficiency level, and the mother tongue of the learner. The error-tagging scheme is partially based on Šolar (the scheme of Developmental corpus of Slovene) and the error-coding of the Cambridge Learner Corpus and further tailored to the Croatian language. The goal of the development of the error-tagging scheme is to build a sub-corpus that will serve as a repository of authentic data about the learner’s interlanguage. It should enable researchers and teachers of Croatian as a foreign language to explore the interlanguage, to discover the aspects of the grammar that are the most difficult to master and to tailor teaching materials to different groups of learners (not only according to their Croatian language proficiency level but also to their first language). Finally, the error-tagged sub-corpus should also serve as a starting point for designing computer-aided tools to correct lexical errors, misuse of verbal tenses, phrasal verbs, and collocations. Ključne riječi: učenički korpusi, CroLTeC, obilježavanje pogrešaka, ispravljanje pogrešaka, normalizacija","PeriodicalId":51986,"journal":{"name":"Rasprave","volume":"46 1","pages":"899-920"},"PeriodicalIF":0.2000,"publicationDate":"2020-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Rasprave","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31724/rihjj.46.2.24","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}

引用次数: 0

Abstract

The paper describes the error-tagging scheme developed for the CroLTeC learner corpus (http://nlp.ffzg.hr/resources/corpora/croltec/) – the first electronic learner corpus of Croatian as a foreign language. CroLTeC contains essays collected from 755 students with 36 different mother tongues, among which the most prominent were Spanish, English, German, Polish, Chinese, French, and Arabic. It consists of 4,747 essays, out of which 1,217 were digitally born, while 3530 essays were scanned, transcribed in RTF format, Mikelić Preradović.indd 919 4.11.2020. 11:30:22 920 Rasprave 46/2 (2020.) str. 899–920 and converted into XML format. CroLTeC has a total of 1,054,287 tokens, and essays have been collected on all 6 levels of Common European Framework of Reference for Languages (CEFR) at Croaticum – Center for Croatian as Second and Foreign Language at the Faculty of Humanities and Social Sciences in Zagreb, Department of Information Sciences, Natural Language Processing group. All CroLTeC essays contain metadata about the title, number, and type of essay (homework, part of an exam or field class, etc.). Data were lemmatized and annotated with morphosyntactic tags with the ReLDI tagger (Ljubešić et al., 2016). Also, the corpus is searchable by age, sex, language proficiency level, and the mother tongue of the learner. The error-tagging scheme is partially based on Šolar (the scheme of Developmental corpus of Slovene) and the error-coding of the Cambridge Learner Corpus and further tailored to the Croatian language. The goal of the development of the error-tagging scheme is to build a sub-corpus that will serve as a repository of authentic data about the learner’s interlanguage. It should enable researchers and teachers of Croatian as a foreign language to explore the interlanguage, to discover the aspects of the grammar that are the most difficult to master and to tailor teaching materials to different groups of learners (not only according to their Croatian language proficiency level but also to their first language). Finally, the error-tagged sub-corpus should also serve as a starting point for designing computer-aided tools to correct lexical errors, misuse of verbal tenses, phrasal verbs, and collocations. Ključne riječi: učenički korpusi, CroLTeC, obilježavanje pogrešaka, ispravljanje pogrešaka, normalizacija

查看原文本刊更多论文

CroLTeC（克罗地亚语作为外语的计算机学生群体）中的标记错误

本文描述了为CroLTeC学习者语料库(http://nlp.ffzg.hr/resources/corpora/croltec/)开发的错误标记方案-克罗地亚语作为外语的第一个电子学习者语料库。CroLTeC收录了来自36种不同母语的755名学生的论文，其中最重要的是西班牙语、英语、德语、波兰语、汉语、法语和阿拉伯语。它由4747篇文章组成，其中1217篇是数字生成的，而3530篇文章被扫描，以RTF格式转录。2020年4月11日。11:30:22 920 Rasprave 46/2 (2020.) str. 899-920并转换为XML格式。CroLTeC共有1,054,287个代币，论文已在萨格勒布人文和社会科学学院克罗地亚语作为第二语言和外语中心的克罗地亚语共同欧洲语言参考框架(CEFR)的所有六个级别上收集，信息科学系，自然语言处理小组。CroLTeC的所有论文都包含有关标题、数量和论文类型(家庭作业、考试的一部分或实地课程等)的元数据。使用ReLDI标记器对数据进行语义化和形态句法标注(Ljubešić et al.， 2016)。此外，语料库可以根据学习者的年龄、性别、语言熟练程度和母语进行搜索。错误标记方案部分基于Šolar(斯洛文尼亚语发展语料库方案)和剑桥学习者语料库的错误编码，并进一步针对克罗地亚语进行了调整。开发错误标记方案的目标是建立一个子语料库，作为学习者中介语的真实数据库。它应使克罗地亚语作为外语的研究人员和教师能够探索中介语，发现语法中最难掌握的方面，并为不同的学习者群体量身定制教材(不仅根据他们的克罗地亚语熟练程度，而且根据他们的第一语言)。最后，错误标记的子语料库也应该作为设计计算机辅助工具的起点，以纠正词汇错误、动词时态、短语动词和搭配的误用。klju nerije i: u eni ki korpusi, CroLTeC, obilježavanje pogrešaka, ispravljanje pogrešaka, normalizacija

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊