Down-sampling from hierarchically structured corpus data

IF 1.6 2区 文学 0 LANGUAGE & LINGUISTICS
Lukas Sönning
{"title":"Down-sampling from hierarchically structured corpus data","authors":"Lukas Sönning","doi":"10.1075/ijcl.23079.son","DOIUrl":null,"url":null,"abstract":"\nResource constraints often force researchers to downsize the list of tokens returned by a corpus query. This paper sketches a methodology for down-sampling and offers a survey of current practices. We build on earlier work and extend the evaluation of down-sampling designs to settings where tokens are clustered by text file and lexeme. Our case study deals with third-person present-tense verb inflection in Early Modern English and focuses on five predictors: year, gender, genre, frequency, and phonological context. We evaluate two strategies for selecting 2,000 (out of 11,645) tokens: simple down-sampling, where each hit has the same selection probability; and structured down-sampling, where this probability is inversely proportional to the author- and verb-specific token count. We form 500 subsamples using each scheme and compare regression results to a reference model fit to the full set of cases. We observe that structured down-sampling shows better performance on several evaluation criteria.","PeriodicalId":46843,"journal":{"name":"International Journal of Corpus Linguistics","volume":null,"pages":null},"PeriodicalIF":1.6000,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Corpus Linguistics","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1075/ijcl.23079.son","RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Resource constraints often force researchers to downsize the list of tokens returned by a corpus query. This paper sketches a methodology for down-sampling and offers a survey of current practices. We build on earlier work and extend the evaluation of down-sampling designs to settings where tokens are clustered by text file and lexeme. Our case study deals with third-person present-tense verb inflection in Early Modern English and focuses on five predictors: year, gender, genre, frequency, and phonological context. We evaluate two strategies for selecting 2,000 (out of 11,645) tokens: simple down-sampling, where each hit has the same selection probability; and structured down-sampling, where this probability is inversely proportional to the author- and verb-specific token count. We form 500 subsamples using each scheme and compare regression results to a reference model fit to the full set of cases. We observe that structured down-sampling shows better performance on several evaluation criteria.
从分层结构的语料库数据中向下采样
资源限制常常迫使研究人员缩减语料库查询返回的标记列表。本文概述了缩减取样的方法,并对当前的实践进行了调查。我们在早期工作的基础上,将对向下取样设计的评估扩展到了按文本文件和词素对标记进行聚类的情况。我们的案例研究涉及早期现代英语中的第三人称现在式动词变位,重点关注五个预测因素:年份、性别、体裁、频率和语音语境。我们评估了从 11,645 个标记中选择 2,000 个标记的两种策略:简单向下抽样,即每个命中标记的选择概率相同;结构化向下抽样,即选择概率与作者和动词的特定标记数成反比。我们使用每种方案形成 500 个子样本,并将回归结果与拟合全套案例的参考模型进行比较。我们发现,结构化向下取样在多个评估标准上都有更好的表现。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.30
自引率
0.00%
发文量
43
期刊介绍: The International Journal of Corpus Linguistics (IJCL) publishes original research covering methodological, applied and theoretical work in any area of corpus linguistics. Through its focus on empirical language research, IJCL provides a forum for the presentation of new findings and innovative approaches in any area of linguistics (e.g. lexicology, grammar, discourse analysis, stylistics, sociolinguistics, morphology, contrastive linguistics), applied linguistics (e.g. language teaching, forensic linguistics), and translation studies. Based on its interest in corpus methodology, IJCL also invites contributions on the interface between corpus and computational linguistics.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信