Analysis of incorrect POS-tagging in student texts with linguistic errors in German

RESEARCH RESULT Theoretical and Applied Linguistics Pub Date : 2022-09-30 DOI:10.18413/2313-8912-2022-8-3-0-6

I. Kotiurova, L. Shchegoleva

{"title":"Analysis of incorrect POS-tagging in student texts with linguistic errors in German","authors":"I. Kotiurova, L. Shchegoleva","doi":"10.18413/2313-8912-2022-8-3-0-6","DOIUrl":null,"url":null,"abstract":"The electronic learner corpus of student texts in German, the PACT, contains the parts-of-speech (POS) tagging. This markup is performed automatically using RFTagger. Since the texts of the corpus are written by students, they may contain various kinds of errors: grammatical, spelling, stylistic, and others. Sentences may be formulated incorrectly, without taking into account the rules of the language and accepted norms. This can affect the work of programs that process texts in automatic mode, and as a result, generate incorrect tagging that needs to be verified manually. The purpose of the study is to investigate the degree of influence of various kinds of errors in non-authentic texts on the results of the automatic part-of-speech tagging. Based on expert error markup in the corpus texts, 11 types of errors were identified that affect the part-of-speech tagger quality. For each type of error, ten sentences containing an error were selected from the corpus. The resulting pool of texts was processed by the part-of-speech taggers RFTagger and TreeTagger. The parts of speech that were suggested by these automatic taggers were compared with the parts of speech determined by experts manually. As a result of the comparison, the following patterns were revealed: part-of-speech taggers are mistaken when writing the non-declinable form of the adjective instead of the declinable; when writing one word separately; in the absence of the suffix \"-er\" in possessive adjectives formed from geographical names; when writing nouns with a lowercase letter; when writing a verb with a capital letter. For each case, the article provides an analysis of the forms and causes of incorrect POS-tagging, as well as differences in the work of the two part-of-speech taggers. Taking into account the revealed patterns will allow more efficient organization of the POS-tagging verification in the learner corpus in German. The results of the study will also be useful for developers of part-of-speech taggers.","PeriodicalId":346928,"journal":{"name":"RESEARCH RESULT Theoretical and Applied Linguistics","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"RESEARCH RESULT Theoretical and Applied Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18413/2313-8912-2022-8-3-0-6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The electronic learner corpus of student texts in German, the PACT, contains the parts-of-speech (POS) tagging. This markup is performed automatically using RFTagger. Since the texts of the corpus are written by students, they may contain various kinds of errors: grammatical, spelling, stylistic, and others. Sentences may be formulated incorrectly, without taking into account the rules of the language and accepted norms. This can affect the work of programs that process texts in automatic mode, and as a result, generate incorrect tagging that needs to be verified manually. The purpose of the study is to investigate the degree of influence of various kinds of errors in non-authentic texts on the results of the automatic part-of-speech tagging. Based on expert error markup in the corpus texts, 11 types of errors were identified that affect the part-of-speech tagger quality. For each type of error, ten sentences containing an error were selected from the corpus. The resulting pool of texts was processed by the part-of-speech taggers RFTagger and TreeTagger. The parts of speech that were suggested by these automatic taggers were compared with the parts of speech determined by experts manually. As a result of the comparison, the following patterns were revealed: part-of-speech taggers are mistaken when writing the non-declinable form of the adjective instead of the declinable; when writing one word separately; in the absence of the suffix "-er" in possessive adjectives formed from geographical names; when writing nouns with a lowercase letter; when writing a verb with a capital letter. For each case, the article provides an analysis of the forms and causes of incorrect POS-tagging, as well as differences in the work of the two part-of-speech taggers. Taking into account the revealed patterns will allow more efficient organization of the POS-tagging verification in the learner corpus in German. The results of the study will also be useful for developers of part-of-speech taggers.

查看原文本刊更多论文

德语语言错误学生语篇pos标注错误分析

德语学生文本的电子学习者语料库PACT包含词性标注。这个标记是使用RFTagger自动执行的。由于语料库的文本是由学生写的，它们可能包含各种各样的错误:语法、拼写、文体和其他。如果不考虑语言的规则和公认的规范，句子的表达可能会不正确。这可能会影响以自动模式处理文本的程序的工作，并因此生成不正确的标记，需要手工验证。本研究的目的是探讨非真实文本中各种错误对词性自动标注结果的影响程度。基于语料库文本中的专家错误标记，识别出影响词性标注器质量的11种错误。对于每种类型的错误，从语料库中选择10个包含错误的句子。生成的文本池由词性标记器RFTagger和TreeTagger处理。将自动标注器建议的词类与专家手动确定的词类进行比较。通过比较，发现了以下现象:将形容词的不可退化形式写成不可退化形式时，词性标注出现了错误;当单独写一个单词时;在由地名构成的所有格形容词中缺少后缀“-er”时;用小写字母写名词时;用大写字母写动词时。针对每种情况，本文分析了pos标注错误的形式和原因，以及两种词性标注器在工作上的差异。考虑到所揭示的模式，可以更有效地组织德语学习者语料库中的pos标记验证。这项研究的结果也将对词性标注器的开发人员有用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

RESEARCH RESULT Theoretical and Applied Linguistics

自引率

0.00%

发文量