Down-sampling from hierarchically structured corpus data

IF 1.6 2区文学 0 LANGUAGE & LINGUISTICS

International Journal of Corpus Linguistics Pub Date : 2024-03-25 DOI:10.1075/ijcl.23079.son

Lukas Sönning

引用次数: 0

Abstract

Resource constraints often force researchers to downsize the list of tokens returned by a corpus query. This paper sketches a methodology for down-sampling and offers a survey of current practices. We build on earlier work and extend the evaluation of down-sampling designs to settings where tokens are clustered by text file and lexeme. Our case study deals with third-person present-tense verb inflection in Early Modern English and focuses on five predictors: year, gender, genre, frequency, and phonological context. We evaluate two strategies for selecting 2,000 (out of 11,645) tokens: simple down-sampling, where each hit has the same selection probability; and structured down-sampling, where this probability is inversely proportional to the author- and verb-specific token count. We form 500 subsamples using each scheme and compare regression results to a reference model fit to the full set of cases. We observe that structured down-sampling shows better performance on several evaluation criteria.

查看原文本刊更多论文

从分层结构的语料库数据中向下采样

资源限制常常迫使研究人员缩减语料库查询返回的标记列表。本文概述了缩减取样的方法，并对当前的实践进行了调查。我们在早期工作的基础上，将对向下取样设计的评估扩展到了按文本文件和词素对标记进行聚类的情况。我们的案例研究涉及早期现代英语中的第三人称现在式动词变位，重点关注五个预测因素：年份、性别、体裁、频率和语音语境。我们评估了从 11,645 个标记中选择 2,000 个标记的两种策略：简单向下抽样，即每个命中标记的选择概率相同；结构化向下抽样，即选择概率与作者和动词的特定标记数成反比。我们使用每种方案形成 500 个子样本，并将回归结果与拟合全套案例的参考模型进行比较。我们发现，结构化向下取样在多个评估标准上都有更好的表现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Corpus Linguistics Multiple-

CiteScore

3.30

自引率

0.00%

发文量

期刊介绍： The International Journal of Corpus Linguistics (IJCL) publishes original research covering methodological, applied and theoretical work in any area of corpus linguistics. Through its focus on empirical language research, IJCL provides a forum for the presentation of new findings and innovative approaches in any area of linguistics (e.g. lexicology, grammar, discourse analysis, stylistics, sociolinguistics, morphology, contrastive linguistics), applied linguistics (e.g. language teaching, forensic linguistics), and translation studies. Based on its interest in corpus methodology, IJCL also invites contributions on the interface between corpus and computational linguistics.