中文分词协议语料库。

IF 4.6 2区心理学 Q1 PSYCHOLOGY, EXPERIMENTAL

Behavior Research Methods Pub Date : 2024-12-28 DOI:10.3758/s13428-024-02528-8

Yiu-Kei Tsang, Ming Yan, Jinger Pan, Megan Yin Kan Chan

{"title":"中文分词协议语料库。","authors":"Yiu-Kei Tsang, Ming Yan, Jinger Pan, Megan Yin Kan Chan","doi":"10.3758/s13428-024-02528-8","DOIUrl":null,"url":null,"abstract":"The absence of explicit word boundaries is a distinctive characteristic of Chinese script, setting it apart from most alphabetic scripts, leading to word boundary disagreement among readers. Previous studies have examined how this feature may influence reading performance. However, further investigations are required to generate more ecologically valid and generalizable findings. In order to advance our understanding of the impact of word boundaries in Chinese reading, we introduce the Chinese Word Segmentation Agreement (CWSA) corpus. This corpus consists of 500 sentences, comprising 9813 character tokens and 1590 character types, and provides data on word segmentation agreement at each character position. The data revealed a high level of overall segmentation agreement (92%). However, participants disagreed on the position of word boundaries in 8.96% of the cases. Moreover, about 85% of the sentences contained at least one ambiguous word boundary. The character strings with high levels of disagreement were tentatively classified into three categories, namely the morphosyntactic type (e.g., \"-\"), modifier-head type (e.g., \"-\"), and others (e.g., \"-\"). Finally, the agreement scores also significantly influenced reading behaviors, as evidenced by analyses with published eye movement data. Specifically, a high level of disagreement was associated with longer single fixation durations. We discuss the implications of these results and highlight how the CWSA corpus can facilitate future research on word segmentation in Chinese reading.","PeriodicalId":8717,"journal":{"name":"Behavior Research Methods","volume":"57 1","pages":"25"},"PeriodicalIF":4.6000,"publicationDate":"2024-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11682008/pdf/","citationCount":"0","resultStr":"{\"title\":\"A corpus of Chinese word segmentation agreement.\",\"authors\":\"Yiu-Kei Tsang, Ming Yan, Jinger Pan, Megan Yin Kan Chan\",\"doi\":\"10.3758/s13428-024-02528-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The absence of explicit word boundaries is a distinctive characteristic of Chinese script, setting it apart from most alphabetic scripts, leading to word boundary disagreement among readers. Previous studies have examined how this feature may influence reading performance. However, further investigations are required to generate more ecologically valid and generalizable findings. In order to advance our understanding of the impact of word boundaries in Chinese reading, we introduce the Chinese Word Segmentation Agreement (CWSA) corpus. This corpus consists of 500 sentences, comprising 9813 character tokens and 1590 character types, and provides data on word segmentation agreement at each character position. The data revealed a high level of overall segmentation agreement (92%). However, participants disagreed on the position of word boundaries in 8.96% of the cases. Moreover, about 85% of the sentences contained at least one ambiguous word boundary. The character strings with high levels of disagreement were tentatively classified into three categories, namely the morphosyntactic type (e.g., \\\"-\\\"), modifier-head type (e.g., \\\"-\\\"), and others (e.g., \\\"-\\\"). Finally, the agreement scores also significantly influenced reading behaviors, as evidenced by analyses with published eye movement data. Specifically, a high level of disagreement was associated with longer single fixation durations. We discuss the implications of these results and highlight how the CWSA corpus can facilitate future research on word segmentation in Chinese reading.\",\"PeriodicalId\":8717,\"journal\":{\"name\":\"Behavior Research Methods\",\"volume\":\"57 1\",\"pages\":\"25\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2024-12-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11682008/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Behavior Research Methods\",\"FirstCategoryId\":\"102\",\"ListUrlMain\":\"https://doi.org/10.3758/s13428-024-02528-8\",\"RegionNum\":2,\"RegionCategory\":\"心理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"PSYCHOLOGY, EXPERIMENTAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Behavior Research Methods","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.3758/s13428-024-02528-8","RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHOLOGY, EXPERIMENTAL","Score":null,"Total":0}

引用次数: 0

摘要

没有明确的词边界是汉字的一个显著特征，使其与大多数字母文字不同，导致读者对词边界的分歧。之前的研究已经研究了这一特征如何影响阅读表现。然而，需要进一步的调查，以产生更多的生态有效和可推广的发现。为了进一步了解词边界对汉语阅读的影响，我们引入了汉语分词协议语料库。该语料库由500个句子组成，包含9813个字符标记和1590个字符类型，并提供每个字符位置的分词协议数据。数据显示，整体细分一致性很高（92%）。然而，在8.96%的情况下，参与者不同意单词边界的位置。此外，大约85%的句子至少包含一个模棱两可的词边界。将歧义程度较高的字符串初步分为三类，即形态句法类型（如“-”）、修饰语头部类型（如“-”）和其他类型（如“-”）。最后，通过对已发表的眼动数据的分析可以证明，一致性得分也显著影响了阅读行为。具体来说，高度的不一致与较长的单次注视持续时间有关。我们讨论了这些结果的意义，并强调了CWSA语料库如何促进汉语阅读中的分词研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A corpus of Chinese word segmentation agreement.

The absence of explicit word boundaries is a distinctive characteristic of Chinese script, setting it apart from most alphabetic scripts, leading to word boundary disagreement among readers. Previous studies have examined how this feature may influence reading performance. However, further investigations are required to generate more ecologically valid and generalizable findings. In order to advance our understanding of the impact of word boundaries in Chinese reading, we introduce the Chinese Word Segmentation Agreement (CWSA) corpus. This corpus consists of 500 sentences, comprising 9813 character tokens and 1590 character types, and provides data on word segmentation agreement at each character position. The data revealed a high level of overall segmentation agreement (92%). However, participants disagreed on the position of word boundaries in 8.96% of the cases. Moreover, about 85% of the sentences contained at least one ambiguous word boundary. The character strings with high levels of disagreement were tentatively classified into three categories, namely the morphosyntactic type (e.g., "-"), modifier-head type (e.g., "-"), and others (e.g., "-"). Finally, the agreement scores also significantly influenced reading behaviors, as evidenced by analyses with published eye movement data. Specifically, a high level of disagreement was associated with longer single fixation durations. We discuss the implications of these results and highlight how the CWSA corpus can facilitate future research on word segmentation in Chinese reading.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Behavior Research Methods Multiple-

CiteScore

10.30

自引率

9.30%

发文量

266

期刊介绍： Behavior Research Methods publishes articles concerned with the methods, techniques, and instrumentation of research in experimental psychology. The journal focuses particularly on the use of computer technology in psychological research. An annual special issue is devoted to this field.