{"title":"句子长度的语言共性:比较10种语言的句子长度分布。","authors":"Yikai Zhou, Jingyang Jiang, Haitao Liu","doi":"10.1111/cogs.70115","DOIUrl":null,"url":null,"abstract":"<p>Sentence length reflects cognitive constraints and stylistic decisions about speech and text segmentation for effective communication, but whether sentence length distributions follow universal patterns across languages and genres remains unclear. This study investigates whether sentence lengths and sub-sentence lengths—defined as the number of words between sentence-ending punctuation marks and between adjacent punctuation marks—follow a unified probabilistic distribution across languages, whether this reflects linguistic genealogy, and whether the distribution is affected by genre. Given the links between sentence length, cognitive constraints, and stylistic decisions, we predicted that sentence and sub-sentence lengths would follow a unified probabilistic distribution across languages, modulated by linguistic genealogy and genre. Analyzing news texts in 10 languages, we found that sentence and sub-sentence length distributions both conform to a probabilistic model, the Extended Positive Negative Binomial distribution, which was previously shown to capture sentence length distributions in certain languages. To assess whether these differences align with linguistic typology, we performed cluster analysis based on mean length and distribution parameters, with results mirroring known linguistic genealogical relationships. To examine the genre effects, we analyzed sentence and sub-sentence length distributions across three written genres in English and Chinese. Generalized linear models revealed systematic influences of both genre and language, but with varying results on different linguistic levels: genre accounted for more variance in sentence-level metrics, whereas language exerted stronger effects at the sub-sentence level. Sentence and sub-sentence length distributions reflect a universal probabilistic pattern in punctuation-based sentence segmentation, influenced by cognitive constraints and genre-driven adaptability across languages.</p>","PeriodicalId":48349,"journal":{"name":"Cognitive Science","volume":"49 9","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Language Universals in Sentence Length: Comparing Sentence Length Distributions of 10 Languages\",\"authors\":\"Yikai Zhou, Jingyang Jiang, Haitao Liu\",\"doi\":\"10.1111/cogs.70115\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Sentence length reflects cognitive constraints and stylistic decisions about speech and text segmentation for effective communication, but whether sentence length distributions follow universal patterns across languages and genres remains unclear. This study investigates whether sentence lengths and sub-sentence lengths—defined as the number of words between sentence-ending punctuation marks and between adjacent punctuation marks—follow a unified probabilistic distribution across languages, whether this reflects linguistic genealogy, and whether the distribution is affected by genre. Given the links between sentence length, cognitive constraints, and stylistic decisions, we predicted that sentence and sub-sentence lengths would follow a unified probabilistic distribution across languages, modulated by linguistic genealogy and genre. Analyzing news texts in 10 languages, we found that sentence and sub-sentence length distributions both conform to a probabilistic model, the Extended Positive Negative Binomial distribution, which was previously shown to capture sentence length distributions in certain languages. To assess whether these differences align with linguistic typology, we performed cluster analysis based on mean length and distribution parameters, with results mirroring known linguistic genealogical relationships. To examine the genre effects, we analyzed sentence and sub-sentence length distributions across three written genres in English and Chinese. Generalized linear models revealed systematic influences of both genre and language, but with varying results on different linguistic levels: genre accounted for more variance in sentence-level metrics, whereas language exerted stronger effects at the sub-sentence level. Sentence and sub-sentence length distributions reflect a universal probabilistic pattern in punctuation-based sentence segmentation, influenced by cognitive constraints and genre-driven adaptability across languages.</p>\",\"PeriodicalId\":48349,\"journal\":{\"name\":\"Cognitive Science\",\"volume\":\"49 9\",\"pages\":\"\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Cognitive Science\",\"FirstCategoryId\":\"102\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/cogs.70115\",\"RegionNum\":2,\"RegionCategory\":\"心理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PSYCHOLOGY, EXPERIMENTAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cognitive Science","FirstCategoryId":"102","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/cogs.70115","RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PSYCHOLOGY, EXPERIMENTAL","Score":null,"Total":0}
Language Universals in Sentence Length: Comparing Sentence Length Distributions of 10 Languages
Sentence length reflects cognitive constraints and stylistic decisions about speech and text segmentation for effective communication, but whether sentence length distributions follow universal patterns across languages and genres remains unclear. This study investigates whether sentence lengths and sub-sentence lengths—defined as the number of words between sentence-ending punctuation marks and between adjacent punctuation marks—follow a unified probabilistic distribution across languages, whether this reflects linguistic genealogy, and whether the distribution is affected by genre. Given the links between sentence length, cognitive constraints, and stylistic decisions, we predicted that sentence and sub-sentence lengths would follow a unified probabilistic distribution across languages, modulated by linguistic genealogy and genre. Analyzing news texts in 10 languages, we found that sentence and sub-sentence length distributions both conform to a probabilistic model, the Extended Positive Negative Binomial distribution, which was previously shown to capture sentence length distributions in certain languages. To assess whether these differences align with linguistic typology, we performed cluster analysis based on mean length and distribution parameters, with results mirroring known linguistic genealogical relationships. To examine the genre effects, we analyzed sentence and sub-sentence length distributions across three written genres in English and Chinese. Generalized linear models revealed systematic influences of both genre and language, but with varying results on different linguistic levels: genre accounted for more variance in sentence-level metrics, whereas language exerted stronger effects at the sub-sentence level. Sentence and sub-sentence length distributions reflect a universal probabilistic pattern in punctuation-based sentence segmentation, influenced by cognitive constraints and genre-driven adaptability across languages.
期刊介绍:
Cognitive Science publishes articles in all areas of cognitive science, covering such topics as knowledge representation, inference, memory processes, learning, problem solving, planning, perception, natural language understanding, connectionism, brain theory, motor control, intentional systems, and other areas of interdisciplinary concern. Highest priority is given to research reports that are specifically written for a multidisciplinary audience. The audience is primarily researchers in cognitive science and its associated fields, including anthropologists, education researchers, psychologists, philosophers, linguists, computer scientists, neuroscientists, and roboticists.