Distractor Generation for Lexical Questions Using Learner Corpus Data

Journal of Linguistics/Jazykovedný casopis Pub Date : 2023-06-01 DOI:10.2478/jazcas-2023-0051

Nikita Login

{"title":"Distractor Generation for Lexical Questions Using Learner Corpus Data","authors":"Nikita Login","doi":"10.2478/jazcas-2023-0051","DOIUrl":null,"url":null,"abstract":"Abstract Learner corpora with error annotation can serve as a source of data for automated question generation (QG) for language testing. In case of multiple choice gapfill lexical questions, this process involves two steps. The first step is to extract sentences with lexical corrections from the learner corpus. The second step, which is the focus of this paper, is to generate distractors for the retrieved questions. The presented approach (called DisSelector) is based on supervised learning on specially annotated learner corpus data. For each sentence a list of distractor candidates was retrieved. Then, each candidate was manually labelled as a plausible or implausible distractor. The derived set of examples was additionally filtered by a set of lexical and grammatical rules and then split into training and testing subsets in 4:1 ratio. Several classification models, including classical machine learning algorithms and gradient boosting implementations, were trained on the data. Word and sentence vectors from language models together with corpus word frequencies were used as input features for the classifiers. The highest F1-score (0.72) was attained by a XGBoost model. Various configurations of DisSelector showed improvements over the unsupervised baseline in both automatic and expert evaluation. DisSelector was integrated into an opensource language testing platform LangExBank as a microservice with a REST API.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"9 1","pages":"345 - 356"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Linguistics/Jazykovedný casopis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/jazcas-2023-0051","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract Learner corpora with error annotation can serve as a source of data for automated question generation (QG) for language testing. In case of multiple choice gapfill lexical questions, this process involves two steps. The first step is to extract sentences with lexical corrections from the learner corpus. The second step, which is the focus of this paper, is to generate distractors for the retrieved questions. The presented approach (called DisSelector) is based on supervised learning on specially annotated learner corpus data. For each sentence a list of distractor candidates was retrieved. Then, each candidate was manually labelled as a plausible or implausible distractor. The derived set of examples was additionally filtered by a set of lexical and grammatical rules and then split into training and testing subsets in 4:1 ratio. Several classification models, including classical machine learning algorithms and gradient boosting implementations, were trained on the data. Word and sentence vectors from language models together with corpus word frequencies were used as input features for the classifiers. The highest F1-score (0.72) was attained by a XGBoost model. Various configurations of DisSelector showed improvements over the unsupervised baseline in both automatic and expert evaluation. DisSelector was integrated into an opensource language testing platform LangExBank as a microservice with a REST API.

查看原文本刊更多论文

利用学习者语料库数据生成词汇问题的干扰项

摘要带有错误注释的学习者语料库可以作为语言测试自动生成问题（QG）的数据源。就多选填空词汇题而言，这一过程包括两个步骤。第一步是从学习者语料库中提取带有词汇修正的句子。第二步，也就是本文的重点，是为检索到的问题生成干扰项。本文介绍的方法（称为 DisSelector）基于对专门标注的学习者语料库数据的监督学习。对于每个句子，我们都会检索出一个候选干扰项列表。然后，人工将每个候选者标记为可信或不可信的干扰项。衍生出的示例集还要经过一套词法和语法规则的过滤，然后按 4:1 的比例分成训练子集和测试子集。在这些数据上训练了多个分类模型，包括经典的机器学习算法和梯度提升实现。语言模型中的单词和句子向量以及语料库中的单词频率被用作分类器的输入特征。F1 分数最高的是 XGBoost 模型（0.72）。在自动评估和专家评估中，DisSelector 的各种配置都比无监督基线有所改进。DisSelector 作为带有 REST API 的微服务被集成到开源语言测试平台 LangExBank 中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Linguistics/Jazykovedný casopis

自引率

0.00%

发文量