Word Length in Chinese: The Menzerath-Altmann Law is Valid After All

IF 0.7 2区 文学 0 LANGUAGE & LINGUISTICS
Tereza Motalová, Ján Mačutek, Radek Čech
{"title":"Word Length in Chinese: The Menzerath-Altmann Law is Valid After All","authors":"Tereza Motalová, Ján Mačutek, Radek Čech","doi":"10.1080/09296174.2023.2259937","DOIUrl":null,"url":null,"abstract":"ABSTRACTAccording to the Menzerath-Altmann law, longer language constructs consist, on average, of shorter constituents. It is most often studied at the level of words and syllables (the mean syllable length gets shorter with the increasing word length). Its validity at this level was corroborated in several languages. However, it was claimed that Chinese is an exception with respect to the validity of the Menzerath-Altmann law. We show that the law is valid if word types are considered, while the behaviour of word tokens is different. This difference can be explained by the fact that the Zipf law of abbreviation is valid not only for words but also for syllables (shorter syllables are used more frequently).KEYWORDS: word lengthMenzerath-Altmann lawChinesesyllableChinese characters AcknowledgmentsThe work was supported from European Regional Development Fund Project “Sinophone Borderlands – Interaction at the Edges”, CZ.02.1.01/0.0/0.0/16_019/0000791 (T. Motalová), VEGA 2/0096/21 (J. Mačutek), APVV-21-0216 (J. Mačutek), and Operational Programme Integrated Infrastructure (OPII) for the project 313011BWH2: “InoCHF – Research and development in the field of innovative technologies in the management of patients with CHF”, co-financed by the European Regional Development Fund (J. Mačutek).Disclosure statementNo potential conflict of interest was reported by the author(s).Notes1. A more general formula with an additional parameter c, yx=axbecx, is sometimes used, see e.g. Mačutek et al. (Citation2019).2. The MAL has found its place also in research areas outside of human language, such as e.g. music (Boroda & Altmann, Citation1991), animal communication (Gustison et al., Citation2016), and genome structure (Ferrer-I-Cancho et al., Citation2014). The ‘common denominator’ of these branches of science is that they study information flow (in a very general sense).3. Syllable length was measured in moras, not in phonemes.4. In some of the papers cited in this paragraph, the mean syllable length is expressed in the number of graphemes rather than phonemes. The mean syllable length is quite similar for both choices in languages with shallow orthographies (Coulmas, Citation2002).5. Erization is an addition of the r-suffix (儿) to a syllable, e.g. 花 huā becomes 花儿 huār (‘flower’). Moreover, there are a few singular exceptions of polysyllabic characters in Chinese. Qiu (Citation2000, p. 26, 406) mentions 瓩 qiānwǎ ‘kilowatt’, 浬 hǎilǐ ‘nautical mile’, and 哩 yīnglǐ ‘English mile’ (none of these words occurs in our language material).6. Xin Han-Da cidian – Das neue Chinesisch-Deutsche Wörterbuch, 1985. Commercial Press, Beijing.7. In fact, one can speak about phonological words here, see e.g. Hall (Citation1999) or Zsiga (Citation2013, pp. 342–346). Thus, this approach can be considered a study of the MAL on the level of words, albeit from a slightly different perspective.8. Lengths of stress units ranged between 1 and 18 syllables while in the case of rhythmic segments between 1 and 7 syllables (Ščigulinská & Schusterová, Citation2014, pp. 70–72, p. 77).9. Kovaľová and Schusterová (Citation2016, pp. 122–133) reported lengths of stress units between 1 and 21 syllables, similarly to Rothe-Neves et al. (Citation2017, p. 6) who reported lengths of utterances between 2 and 29 syllables. On the other hand, Geršić and Altmann (Citation1980, pp. 115–123) tested the law on word lengths only up to 5 syllables.10. https://www.fon.hum.uva.nl/praat/ (accessed 1 June 2023).11. Recall that Stave et al. (Citation2021) study the relation between word length in morphemes and the mean morpheme length in graphemes.12. https://www.wordproject.org/ (accessed 1 June 2023).13. International Biblical Association. Wordproject®: Sheng Jing: Xīnyuē Quán Shū [Holy Bible. New Testament]. Available at https://www.wordproject.org/bibles/pn/index.htm (accessed 1 June 2023).14. International Biblical Association. Wordproject®: 圣经. 新约全书 [Holy Bible. New Testament]. Available at https://www.wordproject.org/bibles/gb_cat/index.htm (accessed 1 June 2023).15. Available at https://github.com/tsroten/pynlpir (accessed 1 June 2023).16. Available at https://github.com/NLPIR-team/NLPIR (accessed 1 June 2023).17. Available at http://bcc.blcu.edu.cn/downloads/resources/%E6%B1%89%E5%AD%97%E4%BF%A1%E6%81%AF%E8%AF%8D%E5%85%B8.zip (accessed 1 June 2023).18. Available at https://github.com/mozillazg/python-pinyin (accessed 23 July 2023).19. http://www.nlreg.com (accessed June 2023)20. Naturally, this requirement is another rule of thumb. See e.g. Mačutek and Rovenchak (Citation2011) and Mačutek et al. (Citation2021) for similar, but slightly different approaches to the problem of word length categories with too low frequencies.21. If, e.g. we measure word length in syllables, and lengths from 1 to 5 occur more than 10 times, length 6 has frequency 12, and length 7 has frequency 1, we pool the last two lengths into one category. The weighted mean word length in this category is 12×6+1×712+1=6.08; see data in Table 1.22. We also obtained comparable results for the relation between word length and the mean syllables length for Pīnyīn Rìjì Duǎnwén, a diary written by Zhang Qiling (available at http://www.pinyin.info/readings/pinyin_riji_duanwen.html, accessed 1 June 2023), and for a sample containing Press reportage (text category A) and Science academic prose (text category J) from The Lancaster Corpus of Mandarin Chinese (McEnery et al., Citation2003). Similarly to Table 1 and Figure 1, there is a decreasing tendency of the mean syllable length, with a slight increase for the longest words.23. We also obtained comparable results for the relation between word length in Chinese characters and the mean character size in components and strokes, respectively, for a short story 我为什么要结婚 [Why do I want to get married] from a short story collection 黄昏里的男孩 [The boy in the dusk]) written by Yu Hua (Citation2012), as well as for a sample containing Press reportage (text category A) and Science academic prose (text category J) from The Lancaster Corpus of Mandarin Chinese (McEnery et al., Citation2003).24. Words consisting of one, two, and three syllables make 99.7% of all word tokens in the Chinese translation of the New Testament, see Table 1.25. Given the wide scope of the least effort principle (see Zipf, Citation1949), easier-to-pronounce tones probably occur more frequently (see Zhang, Citation2002). Tone characteristics can also interact with other word properties, e.g. longer words can have a higher proportion of simpler tones than shorter ones.26. According to Berdicevskis (Citation2021, p. 27), ‘clauses are not repeated in languages often enough to enable frequency estimates’.Additional informationFundingThis work was supported by the Agentúra na Podporu Výskumu a Vývoja [APVV-21-0216]; European Regional Development Fund [CZ.02.1.01/0.0/0.0/16_019/0000791]; Operational Programme Integrated Infrastructure (OPII) [313011BWH2]; Vedecká Grantová Agentúra MŠVVaŠ SR a SAV [2/0096/21].","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":null,"pages":null},"PeriodicalIF":0.7000,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Quantitative Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/09296174.2023.2259937","RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
引用次数: 0

Abstract

ABSTRACTAccording to the Menzerath-Altmann law, longer language constructs consist, on average, of shorter constituents. It is most often studied at the level of words and syllables (the mean syllable length gets shorter with the increasing word length). Its validity at this level was corroborated in several languages. However, it was claimed that Chinese is an exception with respect to the validity of the Menzerath-Altmann law. We show that the law is valid if word types are considered, while the behaviour of word tokens is different. This difference can be explained by the fact that the Zipf law of abbreviation is valid not only for words but also for syllables (shorter syllables are used more frequently).KEYWORDS: word lengthMenzerath-Altmann lawChinesesyllableChinese characters AcknowledgmentsThe work was supported from European Regional Development Fund Project “Sinophone Borderlands – Interaction at the Edges”, CZ.02.1.01/0.0/0.0/16_019/0000791 (T. Motalová), VEGA 2/0096/21 (J. Mačutek), APVV-21-0216 (J. Mačutek), and Operational Programme Integrated Infrastructure (OPII) for the project 313011BWH2: “InoCHF – Research and development in the field of innovative technologies in the management of patients with CHF”, co-financed by the European Regional Development Fund (J. Mačutek).Disclosure statementNo potential conflict of interest was reported by the author(s).Notes1. A more general formula with an additional parameter c, yx=axbecx, is sometimes used, see e.g. Mačutek et al. (Citation2019).2. The MAL has found its place also in research areas outside of human language, such as e.g. music (Boroda & Altmann, Citation1991), animal communication (Gustison et al., Citation2016), and genome structure (Ferrer-I-Cancho et al., Citation2014). The ‘common denominator’ of these branches of science is that they study information flow (in a very general sense).3. Syllable length was measured in moras, not in phonemes.4. In some of the papers cited in this paragraph, the mean syllable length is expressed in the number of graphemes rather than phonemes. The mean syllable length is quite similar for both choices in languages with shallow orthographies (Coulmas, Citation2002).5. Erization is an addition of the r-suffix (儿) to a syllable, e.g. 花 huā becomes 花儿 huār (‘flower’). Moreover, there are a few singular exceptions of polysyllabic characters in Chinese. Qiu (Citation2000, p. 26, 406) mentions 瓩 qiānwǎ ‘kilowatt’, 浬 hǎilǐ ‘nautical mile’, and 哩 yīnglǐ ‘English mile’ (none of these words occurs in our language material).6. Xin Han-Da cidian – Das neue Chinesisch-Deutsche Wörterbuch, 1985. Commercial Press, Beijing.7. In fact, one can speak about phonological words here, see e.g. Hall (Citation1999) or Zsiga (Citation2013, pp. 342–346). Thus, this approach can be considered a study of the MAL on the level of words, albeit from a slightly different perspective.8. Lengths of stress units ranged between 1 and 18 syllables while in the case of rhythmic segments between 1 and 7 syllables (Ščigulinská & Schusterová, Citation2014, pp. 70–72, p. 77).9. Kovaľová and Schusterová (Citation2016, pp. 122–133) reported lengths of stress units between 1 and 21 syllables, similarly to Rothe-Neves et al. (Citation2017, p. 6) who reported lengths of utterances between 2 and 29 syllables. On the other hand, Geršić and Altmann (Citation1980, pp. 115–123) tested the law on word lengths only up to 5 syllables.10. https://www.fon.hum.uva.nl/praat/ (accessed 1 June 2023).11. Recall that Stave et al. (Citation2021) study the relation between word length in morphemes and the mean morpheme length in graphemes.12. https://www.wordproject.org/ (accessed 1 June 2023).13. International Biblical Association. Wordproject®: Sheng Jing: Xīnyuē Quán Shū [Holy Bible. New Testament]. Available at https://www.wordproject.org/bibles/pn/index.htm (accessed 1 June 2023).14. International Biblical Association. Wordproject®: 圣经. 新约全书 [Holy Bible. New Testament]. Available at https://www.wordproject.org/bibles/gb_cat/index.htm (accessed 1 June 2023).15. Available at https://github.com/tsroten/pynlpir (accessed 1 June 2023).16. Available at https://github.com/NLPIR-team/NLPIR (accessed 1 June 2023).17. Available at http://bcc.blcu.edu.cn/downloads/resources/%E6%B1%89%E5%AD%97%E4%BF%A1%E6%81%AF%E8%AF%8D%E5%85%B8.zip (accessed 1 June 2023).18. Available at https://github.com/mozillazg/python-pinyin (accessed 23 July 2023).19. http://www.nlreg.com (accessed June 2023)20. Naturally, this requirement is another rule of thumb. See e.g. Mačutek and Rovenchak (Citation2011) and Mačutek et al. (Citation2021) for similar, but slightly different approaches to the problem of word length categories with too low frequencies.21. If, e.g. we measure word length in syllables, and lengths from 1 to 5 occur more than 10 times, length 6 has frequency 12, and length 7 has frequency 1, we pool the last two lengths into one category. The weighted mean word length in this category is 12×6+1×712+1=6.08; see data in Table 1.22. We also obtained comparable results for the relation between word length and the mean syllables length for Pīnyīn Rìjì Duǎnwén, a diary written by Zhang Qiling (available at http://www.pinyin.info/readings/pinyin_riji_duanwen.html, accessed 1 June 2023), and for a sample containing Press reportage (text category A) and Science academic prose (text category J) from The Lancaster Corpus of Mandarin Chinese (McEnery et al., Citation2003). Similarly to Table 1 and Figure 1, there is a decreasing tendency of the mean syllable length, with a slight increase for the longest words.23. We also obtained comparable results for the relation between word length in Chinese characters and the mean character size in components and strokes, respectively, for a short story 我为什么要结婚 [Why do I want to get married] from a short story collection 黄昏里的男孩 [The boy in the dusk]) written by Yu Hua (Citation2012), as well as for a sample containing Press reportage (text category A) and Science academic prose (text category J) from The Lancaster Corpus of Mandarin Chinese (McEnery et al., Citation2003).24. Words consisting of one, two, and three syllables make 99.7% of all word tokens in the Chinese translation of the New Testament, see Table 1.25. Given the wide scope of the least effort principle (see Zipf, Citation1949), easier-to-pronounce tones probably occur more frequently (see Zhang, Citation2002). Tone characteristics can also interact with other word properties, e.g. longer words can have a higher proportion of simpler tones than shorter ones.26. According to Berdicevskis (Citation2021, p. 27), ‘clauses are not repeated in languages often enough to enable frequency estimates’.Additional informationFundingThis work was supported by the Agentúra na Podporu Výskumu a Vývoja [APVV-21-0216]; European Regional Development Fund [CZ.02.1.01/0.0/0.0/16_019/0000791]; Operational Programme Integrated Infrastructure (OPII) [313011BWH2]; Vedecká Grantová Agentúra MŠVVaŠ SR a SAV [2/0096/21].
单词长度:Menzerath-Altmann定律仍然有效
摘要根据Menzerath-Altmann定律,较长的语言结构平均由较短的组成部分组成。它通常在单词和音节的层面上进行研究(平均音节长度随着单词长度的增加而变短)。它在这一级的有效性在若干语文中得到证实。然而,有人声称中国在Menzerath-Altmann法的有效性方面是个例外。我们表明,如果考虑单词类型,该定律是有效的,而单词标记的行为是不同的。这种差异可以用以下事实来解释:齐夫缩写定律不仅对单词有效,而且对音节也有效(更短的音节使用得更频繁)。本研究由欧洲区域发展基金项目“汉语边疆——边缘的互动”、cz . 02.01 /0.0/0.0/16_019/0000791 (T. motalov<e:1>)、VEGA 2/0096/21 (J. ma<e:1> utek)、APVV-21-0216 (J. ma<e:1> utek)和项目313011BWH2的运营计划综合基础设施(OPII)资助。“InoCHF - CHF患者管理创新技术领域的研究与开发”,由欧洲区域发展基金(J. mautek)共同资助。披露声明作者未报告潜在的利益冲突。有时会使用带有额外参数c的更一般的公式,yx=axbecx,参见例如ma<e:1>尤特克等人(Citation2019)。MAL在人类语言以外的研究领域也占有一席之地,例如音乐(Boroda & Altmann, Citation1991)、动物交流(Gustison等,Citation2016)和基因组结构(Ferrer-I-Cancho等,Citation2014)。这些科学分支的“共同点”是它们研究信息流(在非常普遍的意义上)。音节长度是用动词而不是音素来衡量的。在本段引用的一些论文中,平均音节长度是用字素而不是音素的数量来表示的。在浅正字法的语言中,两种选择的平均音节长度相当相似(Coulmas, Citation2002)。“Erization”是在一个音节后面加上r后缀,例如“花”变成了“花”huār。此外,汉语多音节汉字也有个别例外。邱(Citation2000, p. 26,406)提到瓩qiānwǎ“千瓦”、浬hǎilǐ“海里”和“英哩”(这些词在我们的语言材料中都没有出现)。新汉-大典-新汉德Wörterbuch, 1985。商务印书馆,北京。事实上,人们可以在这里谈论音系词,例如Hall (Citation1999)或Zsiga (Citation2013, pp. 342-346)。因此,这种方法可以被认为是在单词层面上对MAL的研究,尽管是从一个稍微不同的角度。重音单元的长度在1到18个音节之间,而节奏段的长度在1到7个音节之间(Ščigulinská & schusterov<e:1>, Citation2014, pp. 70-72, p. 77)。Kovaľová和schusterov<e:1> (Citation2016, pp. 122-133)报告了1到21个音节之间的重音单位长度,类似于Rothe-Neves等人(Citation2017, p. 6)报告的2到29个音节之间的话语长度。另一方面,Geršić和Altmann (Citation1980, pp. 115-123)测试了单词长度不超过5个音节的规律。https://www.fon.hum.uva.nl/praat/(2023年6月1日访问)。回想一下,Stave等人(Citation2021)研究了语素中的单词长度与字形中的平均语素长度之间的关系。https://www.wordproject.org/(2023年6月1日访问)。国际圣经协会。词汇计划®:盛静:x<s:1> nyuu æ Quán shhi[圣经]。新约]。可在https://www.wordproject.org/bibles/pn/index.htm获得(2023年6月1日访问)。国际圣经协会。文字工程®:《圣经》。新约]。可在https://www.wordproject.org/bibles/gb_cat/index.htm获得(2023年6月1日访问)。可在https://github.com/tsroten/pynlpir获得(2023年6月1日访问)。可在https://github.com/NLPIR-team/NLPIR获得(2023年6月1日访问)。可在http://bcc.blcu.edu.cn/downloads/resources/%E6%B1%89%E5%AD%97%E4%BF%A1%E6%81%AF%E8%AF%8D%E5%85%B8.zip获得(2023年6月1日访问)。可在https://github.com/mozillazg/python-pinyin获得(2023年7月23日访问)。http://www.nlreg.com(2023年6月访问)当然,这个要求是另一条经验法则。参见ma<e:1> utek和Rovenchak (Citation2011)以及ma<e:1> utek等人(Citation2021)对频率过低的词长度分类问题的类似但略有不同的方法。例如,如果我们测量单词的音节长度,长度从1到5出现超过10次,长度6的频率为12,长度7的频率为1,我们将后两个长度归为一类。 该类别的加权平均词长为12×6+1×712+1=6.08;数据见表1.22。我们还获得了单词长度和平均音节长度之间关系的可比较结果,包括张启龄写的日记(http://www.pinyin.info/readings/pinyin_riji_duanwen.html,访问日期为2023年6月1日),以及兰开斯特普通话语料库中包含新闻报道文学(文本类别a)和科学学术散文(文本类别J)的样本(McEnery等人,Citation2003)。与表1和图1类似,平均音节长度呈下降趋势,最长的单词略有增加。我们也获得了类似的结果之间的关系用汉字字长和平均字符大小的组件和中风,分别为短篇小说我为什么要结婚(为什么我想结婚)从一个短篇小说集黄昏里的男孩[黄昏中的小男孩])余华写的(Citation2012),以及一个示例包含新闻报道(文本类别)和科学学术散文的兰开斯特文集(文本类别J)普通话(McEnery et al .,Citation2003)。。由一个、两个和三个音节组成的单词占新约中文翻译中所有单词标记的99.7%,见表1.25。考虑到最小努力原则的广泛应用范围(见Zipf, Citation1949),容易发音的音调可能出现得更频繁(见Zhang, Citation2002)。语调特征还可以与其他单词属性相互作用,例如,较长的单词比较短的单词具有更高比例的简单语调。根据Berdicevskis (Citation2021,第27页)的说法,“在语言中,子句的重复频率不够高,无法进行频率估计”。本研究得到Agentúra na Podporu Výskumu a Vývoja [APVV-21-0216]的支持;欧洲区域发展基金[CZ.02.1.01/0.0/0.0/16_019/0000791];运营计划综合基础设施(OPII) [313011BWH2];vedeck<s:1> grantov<e:1> Agentúra MŠVVaŠ SR a SAV[2/0096/21]。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
2.90
自引率
7.10%
发文量
7
期刊介绍: The Journal of Quantitative Linguistics is an international forum for the publication and discussion of research on the quantitative characteristics of language and text in an exact mathematical form. This approach, which is of growing interest, opens up important and exciting theoretical perspectives, as well as solutions for a wide range of practical problems such as machine learning or statistical parsing, by introducing into linguistics the methods and models of advanced scientific disciplines such as the natural sciences, economics, and psychology.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信