句子长度的语言共性：比较10种语言的句子长度分布。

IF 2.4 2区心理学 Q2 PSYCHOLOGY, EXPERIMENTAL

Cognitive Science Pub Date : 2025-09-23 DOI:10.1111/cogs.70115

Yikai Zhou, Jingyang Jiang, Haitao Liu

{"title":"句子长度的语言共性：比较10种语言的句子长度分布。","authors":"Yikai Zhou, Jingyang Jiang, Haitao Liu","doi":"10.1111/cogs.70115","DOIUrl":null,"url":null,"abstract":"<p>Sentence length reflects cognitive constraints and stylistic decisions about speech and text segmentation for effective communication, but whether sentence length distributions follow universal patterns across languages and genres remains unclear. This study investigates whether sentence lengths and sub-sentence lengths—defined as the number of words between sentence-ending punctuation marks and between adjacent punctuation marks—follow a unified probabilistic distribution across languages, whether this reflects linguistic genealogy, and whether the distribution is affected by genre. Given the links between sentence length, cognitive constraints, and stylistic decisions, we predicted that sentence and sub-sentence lengths would follow a unified probabilistic distribution across languages, modulated by linguistic genealogy and genre. Analyzing news texts in 10 languages, we found that sentence and sub-sentence length distributions both conform to a probabilistic model, the Extended Positive Negative Binomial distribution, which was previously shown to capture sentence length distributions in certain languages. To assess whether these differences align with linguistic typology, we performed cluster analysis based on mean length and distribution parameters, with results mirroring known linguistic genealogical relationships. To examine the genre effects, we analyzed sentence and sub-sentence length distributions across three written genres in English and Chinese. Generalized linear models revealed systematic influences of both genre and language, but with varying results on different linguistic levels: genre accounted for more variance in sentence-level metrics, whereas language exerted stronger effects at the sub-sentence level. Sentence and sub-sentence length distributions reflect a universal probabilistic pattern in punctuation-based sentence segmentation, influenced by cognitive constraints and genre-driven adaptability across languages.</p>","PeriodicalId":48349,"journal":{"name":"Cognitive Science","volume":"49 9","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Language Universals in Sentence Length: Comparing Sentence Length Distributions of 10 Languages\",\"authors\":\"Yikai Zhou, Jingyang Jiang, Haitao Liu\",\"doi\":\"10.1111/cogs.70115\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Sentence length reflects cognitive constraints and stylistic decisions about speech and text segmentation for effective communication, but whether sentence length distributions follow universal patterns across languages and genres remains unclear. This study investigates whether sentence lengths and sub-sentence lengths—defined as the number of words between sentence-ending punctuation marks and between adjacent punctuation marks—follow a unified probabilistic distribution across languages, whether this reflects linguistic genealogy, and whether the distribution is affected by genre. Given the links between sentence length, cognitive constraints, and stylistic decisions, we predicted that sentence and sub-sentence lengths would follow a unified probabilistic distribution across languages, modulated by linguistic genealogy and genre. Analyzing news texts in 10 languages, we found that sentence and sub-sentence length distributions both conform to a probabilistic model, the Extended Positive Negative Binomial distribution, which was previously shown to capture sentence length distributions in certain languages. To assess whether these differences align with linguistic typology, we performed cluster analysis based on mean length and distribution parameters, with results mirroring known linguistic genealogical relationships. To examine the genre effects, we analyzed sentence and sub-sentence length distributions across three written genres in English and Chinese. Generalized linear models revealed systematic influences of both genre and language, but with varying results on different linguistic levels: genre accounted for more variance in sentence-level metrics, whereas language exerted stronger effects at the sub-sentence level. Sentence and sub-sentence length distributions reflect a universal probabilistic pattern in punctuation-based sentence segmentation, influenced by cognitive constraints and genre-driven adaptability across languages.</p>\",\"PeriodicalId\":48349,\"journal\":{\"name\":\"Cognitive Science\",\"volume\":\"49 9\",\"pages\":\"\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Cognitive Science\",\"FirstCategoryId\":\"102\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/cogs.70115\",\"RegionNum\":2,\"RegionCategory\":\"心理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PSYCHOLOGY, EXPERIMENTAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cognitive Science","FirstCategoryId":"102","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/cogs.70115","RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PSYCHOLOGY, EXPERIMENTAL","Score":null,"Total":0}

引用次数: 0

摘要

句子长度反映了语音和文本分割的认知约束和风格决定，但句子长度分布是否遵循语言和体裁的普遍模式尚不清楚。本研究考察了句子长度和子句长度（定义为句子结尾标点符号之间和相邻标点符号之间的单词数）在不同语言中是否遵循统一的概率分布，这是否反映了语言谱系，以及这种分布是否受到体裁的影响。考虑到句子长度、认知约束和文体决定之间的联系，我们预测句子和子句子长度将遵循统一的概率分布，受语言谱系和体裁的调节。通过分析10种语言的新闻文本，我们发现句子和子句子的长度分布都符合一个概率模型，即扩展正负二项分布，该模型之前被证明可以捕捉某些语言的句子长度分布。为了评估这些差异是否与语言类型学一致，我们基于平均长度和分布参数进行了聚类分析，结果反映了已知的语言谱系关系。为了检验体裁效应，我们分析了英汉三种书写体裁的句子和子句长度分布。广义线性模型揭示了体裁和语言的系统性影响，但在不同的语言水平上结果不同：体裁在句子水平上的影响更大，而语言在子句水平上的影响更大。在基于标点符号的句子分词中，句子和子句长度分布受认知约束和跨语言体裁适应性的影响，呈现出一种普遍的概率分布模式。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Language Universals in Sentence Length: Comparing Sentence Length Distributions of 10 Languages

查看原文本刊更多论文

Language Universals in Sentence Length: Comparing Sentence Length Distributions of 10 Languages

Sentence length reflects cognitive constraints and stylistic decisions about speech and text segmentation for effective communication, but whether sentence length distributions follow universal patterns across languages and genres remains unclear. This study investigates whether sentence lengths and sub-sentence lengths—defined as the number of words between sentence-ending punctuation marks and between adjacent punctuation marks—follow a unified probabilistic distribution across languages, whether this reflects linguistic genealogy, and whether the distribution is affected by genre. Given the links between sentence length, cognitive constraints, and stylistic decisions, we predicted that sentence and sub-sentence lengths would follow a unified probabilistic distribution across languages, modulated by linguistic genealogy and genre. Analyzing news texts in 10 languages, we found that sentence and sub-sentence length distributions both conform to a probabilistic model, the Extended Positive Negative Binomial distribution, which was previously shown to capture sentence length distributions in certain languages. To assess whether these differences align with linguistic typology, we performed cluster analysis based on mean length and distribution parameters, with results mirroring known linguistic genealogical relationships. To examine the genre effects, we analyzed sentence and sub-sentence length distributions across three written genres in English and Chinese. Generalized linear models revealed systematic influences of both genre and language, but with varying results on different linguistic levels: genre accounted for more variance in sentence-level metrics, whereas language exerted stronger effects at the sub-sentence level. Sentence and sub-sentence length distributions reflect a universal probabilistic pattern in punctuation-based sentence segmentation, influenced by cognitive constraints and genre-driven adaptability across languages.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Cognitive Science PSYCHOLOGY, EXPERIMENTAL-

CiteScore

4.10

自引率

8.00%

发文量

139

期刊介绍： Cognitive Science publishes articles in all areas of cognitive science, covering such topics as knowledge representation, inference, memory processes, learning, problem solving, planning, perception, natural language understanding, connectionism, brain theory, motor control, intentional systems, and other areas of interdisciplinary concern. Highest priority is given to research reports that are specifically written for a multidisciplinary audience. The audience is primarily researchers in cognitive science and its associated fields, including anthropologists, education researchers, psychologists, philosophers, linguists, computer scientists, neuroscientists, and roboticists.