EMTeC：机器生成文本的眼球运动语料库。

IF 4.6 2区心理学 Q1 PSYCHOLOGY, EXPERIMENTAL

Behavior Research Methods Pub Date : 2025-06-03 DOI:10.3758/s13428-025-02677-4

Lena S Bolliger, Patrick Haller, Isabelle C R Cretton, David R Reich, Tannon Kew, Lena A Jäger

{"title":"EMTeC：机器生成文本的眼球运动语料库。","authors":"Lena S Bolliger, Patrick Haller, Isabelle C R Cretton, David R Reich, Tannon Kew, Lena A Jäger","doi":"10.3758/s13428-025-02677-4","DOIUrl":null,"url":null,"abstract":"The Eye movements on Machine-generated Texts Corpus (EMTeC) is a naturalistic eye-movements-while-reading corpus of 107 native English speakers reading machine-generated texts. The texts are generated by three large language models using five different decoding strategies, and they fall into six different text-type categories. EMTeC entails the eye movement data at all stages of pre-processing, i.e., the raw coordinate data sampled at 2000 Hz, the fixation sequences, and the reading measures. It further provides both the original and a corrected version of the fixation sequences, accounting for vertical calibration drift. Moreover, the corpus includes the language models' internals that underlie the generation of the stimulus texts: the transition scores, the attention scores, and the hidden states. The stimuli are annotated for a range of linguistic features both at text and at word level. We anticipate EMTeC to be utilized for a variety of use cases such as, but not restricted to, the investigation of reading behavior on machine-generated text and the impact of different decoding strategies; reading behavior on different text types; the development of new pre-processing, data filtering, and drift correction algorithms; the cognitive interpretability and enhancement of language models; and the assessment of the predictive power of surprisal and entropy for human reading times. The data at all stages of pre-processing, the model internals, and the code to reproduce the stimulus generation, data pre-processing, and analyses can be accessed via https://github.com/DiLi-Lab/EMTeC/ .","PeriodicalId":8717,"journal":{"name":"Behavior Research Methods","volume":"57 7","pages":"189"},"PeriodicalIF":4.6000,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12134054/pdf/","citationCount":"0","resultStr":"{\"title\":\"EMTeC: A corpus of eye movements on machine-generated texts.\",\"authors\":\"Lena S Bolliger, Patrick Haller, Isabelle C R Cretton, David R Reich, Tannon Kew, Lena A Jäger\",\"doi\":\"10.3758/s13428-025-02677-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Eye movements on Machine-generated Texts Corpus (EMTeC) is a naturalistic eye-movements-while-reading corpus of 107 native English speakers reading machine-generated texts. The texts are generated by three large language models using five different decoding strategies, and they fall into six different text-type categories. EMTeC entails the eye movement data at all stages of pre-processing, i.e., the raw coordinate data sampled at 2000 Hz, the fixation sequences, and the reading measures. It further provides both the original and a corrected version of the fixation sequences, accounting for vertical calibration drift. Moreover, the corpus includes the language models' internals that underlie the generation of the stimulus texts: the transition scores, the attention scores, and the hidden states. The stimuli are annotated for a range of linguistic features both at text and at word level. We anticipate EMTeC to be utilized for a variety of use cases such as, but not restricted to, the investigation of reading behavior on machine-generated text and the impact of different decoding strategies; reading behavior on different text types; the development of new pre-processing, data filtering, and drift correction algorithms; the cognitive interpretability and enhancement of language models; and the assessment of the predictive power of surprisal and entropy for human reading times. The data at all stages of pre-processing, the model internals, and the code to reproduce the stimulus generation, data pre-processing, and analyses can be accessed via https://github.com/DiLi-Lab/EMTeC/ .\",\"PeriodicalId\":8717,\"journal\":{\"name\":\"Behavior Research Methods\",\"volume\":\"57 7\",\"pages\":\"189\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2025-06-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12134054/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Behavior Research Methods\",\"FirstCategoryId\":\"102\",\"ListUrlMain\":\"https://doi.org/10.3758/s13428-025-02677-4\",\"RegionNum\":2,\"RegionCategory\":\"心理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"PSYCHOLOGY, EXPERIMENTAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Behavior Research Methods","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.3758/s13428-025-02677-4","RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHOLOGY, EXPERIMENTAL","Score":null,"Total":0}

引用次数: 0

摘要

机器生成文本语料库（EMTeC）是107个母语为英语的人阅读机器生成文本时的自然眼动语料库。文本由三个大型语言模型使用五种不同的解码策略生成，它们分为六种不同的文本类型类别。EMTeC需要预处理所有阶段的眼动数据，即2000 Hz采样的原始坐标数据、注视序列和阅读测量。它进一步提供了原始和修正版本的固定序列，考虑到垂直校准漂移。此外，语料库还包括语言模型的内部结构，这些结构构成了刺激文本生成的基础：过渡分数、注意分数和隐藏状态。在文本和词的层次上对刺激进行了一系列的语言特征注释。我们预计EMTeC将用于各种用例，例如，但不限于，研究机器生成文本的阅读行为和不同解码策略的影响；不同文本类型下的阅读行为；开发新的预处理、数据滤波和漂移校正算法；语言模型的认知可解释性及其增强以及惊喜和熵对人类阅读时间的预测能力的评估。所有预处理阶段的数据，模型内部，以及重现刺激生成，数据预处理和分析的代码都可以通过https://github.com/DiLi-Lab/EMTeC/访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

EMTeC: A corpus of eye movements on machine-generated texts.

The Eye movements on Machine-generated Texts Corpus (EMTeC) is a naturalistic eye-movements-while-reading corpus of 107 native English speakers reading machine-generated texts. The texts are generated by three large language models using five different decoding strategies, and they fall into six different text-type categories. EMTeC entails the eye movement data at all stages of pre-processing, i.e., the raw coordinate data sampled at 2000 Hz, the fixation sequences, and the reading measures. It further provides both the original and a corrected version of the fixation sequences, accounting for vertical calibration drift. Moreover, the corpus includes the language models' internals that underlie the generation of the stimulus texts: the transition scores, the attention scores, and the hidden states. The stimuli are annotated for a range of linguistic features both at text and at word level. We anticipate EMTeC to be utilized for a variety of use cases such as, but not restricted to, the investigation of reading behavior on machine-generated text and the impact of different decoding strategies; reading behavior on different text types; the development of new pre-processing, data filtering, and drift correction algorithms; the cognitive interpretability and enhancement of language models; and the assessment of the predictive power of surprisal and entropy for human reading times. The data at all stages of pre-processing, the model internals, and the code to reproduce the stimulus generation, data pre-processing, and analyses can be accessed via https://github.com/DiLi-Lab/EMTeC/ .

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Behavior Research Methods Multiple-

CiteScore

10.30

自引率

9.30%

发文量

266

期刊介绍： Behavior Research Methods publishes articles concerned with the methods, techniques, and instrumentation of research in experimental psychology. The journal focuses particularly on the use of computer technology in psychological research. An annual special issue is devoted to this field.