自然语言的复杂系统方法

IF 23.9 1区 物理与天体物理 Q1 PHYSICS, MULTIDISCIPLINARY
Tomasz Stanisz , Stanisław Drożdż , Jarosław Kwapień
{"title":"自然语言的复杂系统方法","authors":"Tomasz Stanisz ,&nbsp;Stanisław Drożdż ,&nbsp;Jarosław Kwapień","doi":"10.1016/j.physrep.2023.12.002","DOIUrl":null,"url":null,"abstract":"<div><p>The science of complexity aims to answer the question of what rules nature chooses when assembling the basic constituents of matter and energy into structures and dynamical patterns that cascade through the entire hierarchy of scales in the Universe. A related phenomenon – natural language – can successfully mirror such structures as reflected by its ability to encode and transmit information about them and among them. It is thus legitimate to expect that natural language carries the essence of complexity. And indeed, in the human’s speaking and writing it is particularly true that <em>more is different</em>. Natural language thus deserves a central place in the related quantitative study within the science of complexity.</p><p>With this in mind the present review summarizes the main methodological concepts used in this domain and documents their applicability and utility in identifying universal as well as system-specific features of natural language in its written representation in several major Western languages. In particular, three main complexity-related current research trends in quantitative linguistics are exhaustively covered. The first part addresses the issue of word frequencies in texts and, in particular, demonstrates that taking punctuation into consideration largely restores scaling whose violation in the Zipf’s law for the most frequent words is commonly modelled by the so-called Mandelbrot’s correction. The second part introduces methods inspired by time series analysis, used in studying various kinds of long-range correlations in written texts. The related time series are generated on the basis of text partition into sentences or into phrases between consecutive punctuation marks. It turns out that these series develop features often found in signals generated by complex systems: the presence of long-range correlations along with fractal or even multifractal structures. Moreover, it appears that the distances between consecutive punctuation marks quite universally across languages comply with the discrete variant of the Weibull distribution<span>, often appearing in survival analysis. In the third part, the application of the network formalism to natural language is reviewed, particularly in the context of word-adjacency networks whose structure reflects the word co-occurrence in texts. Various parameters characterizing topology of such networks can be used for classification of texts, for example, from a stylometric perspective. Network approach can also be applied in semantic analysis to represent a hierarchy of words and associations between them based on their meaning. Structure of such networks turns out to be significantly different from that observed in random networks, revealing genuine properties of language. Finally, punctuation appears to have a significant impact not only on the language’s information-carrying ability but also on its key statistical properties, hence it seems recommended to consider punctuation marks on a par with words.</span></p></div>","PeriodicalId":404,"journal":{"name":"Physics Reports","volume":"1053 ","pages":"Pages 1-84"},"PeriodicalIF":23.9000,"publicationDate":"2023-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Complex systems approach to natural language\",\"authors\":\"Tomasz Stanisz ,&nbsp;Stanisław Drożdż ,&nbsp;Jarosław Kwapień\",\"doi\":\"10.1016/j.physrep.2023.12.002\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>The science of complexity aims to answer the question of what rules nature chooses when assembling the basic constituents of matter and energy into structures and dynamical patterns that cascade through the entire hierarchy of scales in the Universe. A related phenomenon – natural language – can successfully mirror such structures as reflected by its ability to encode and transmit information about them and among them. It is thus legitimate to expect that natural language carries the essence of complexity. And indeed, in the human’s speaking and writing it is particularly true that <em>more is different</em>. Natural language thus deserves a central place in the related quantitative study within the science of complexity.</p><p>With this in mind the present review summarizes the main methodological concepts used in this domain and documents their applicability and utility in identifying universal as well as system-specific features of natural language in its written representation in several major Western languages. In particular, three main complexity-related current research trends in quantitative linguistics are exhaustively covered. The first part addresses the issue of word frequencies in texts and, in particular, demonstrates that taking punctuation into consideration largely restores scaling whose violation in the Zipf’s law for the most frequent words is commonly modelled by the so-called Mandelbrot’s correction. The second part introduces methods inspired by time series analysis, used in studying various kinds of long-range correlations in written texts. The related time series are generated on the basis of text partition into sentences or into phrases between consecutive punctuation marks. It turns out that these series develop features often found in signals generated by complex systems: the presence of long-range correlations along with fractal or even multifractal structures. Moreover, it appears that the distances between consecutive punctuation marks quite universally across languages comply with the discrete variant of the Weibull distribution<span>, often appearing in survival analysis. In the third part, the application of the network formalism to natural language is reviewed, particularly in the context of word-adjacency networks whose structure reflects the word co-occurrence in texts. Various parameters characterizing topology of such networks can be used for classification of texts, for example, from a stylometric perspective. Network approach can also be applied in semantic analysis to represent a hierarchy of words and associations between them based on their meaning. Structure of such networks turns out to be significantly different from that observed in random networks, revealing genuine properties of language. Finally, punctuation appears to have a significant impact not only on the language’s information-carrying ability but also on its key statistical properties, hence it seems recommended to consider punctuation marks on a par with words.</span></p></div>\",\"PeriodicalId\":404,\"journal\":{\"name\":\"Physics Reports\",\"volume\":\"1053 \",\"pages\":\"Pages 1-84\"},\"PeriodicalIF\":23.9000,\"publicationDate\":\"2023-12-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Physics Reports\",\"FirstCategoryId\":\"4\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0370157323004076\",\"RegionNum\":1,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"PHYSICS, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Physics Reports","FirstCategoryId":"4","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0370157323004076","RegionNum":1,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PHYSICS, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

摘要

复杂性科学旨在回答这样一个问题:大自然在把物质和能量的基本成分组合成结构和动态模式时,会选择什么样的规则,而这些结构和动态模式会在宇宙的整个尺度层次中层层递进。一种相关的现象--自然语言--能够成功地反映这种结构,这体现在自然语言能够编码和传递关于这些结构以及它们之间的信息。因此,我们有理由认为,自然语言蕴含着复杂性的精髓。事实上,在人类的说话和书写中,"多 "就是 "不同",这一点尤其正确。有鉴于此,本综述总结了这一领域使用的主要方法论概念,并记录了这些概念在识别几种主要西方语言中自然语言书面表述的普遍特征和特定系统特征方面的适用性和实用性。其中,详尽介绍了当前定量语言学中与复杂性相关的三大研究趋势。第一部分讨论了文本中的词频问题,特别是证明了将标点符号考虑在内可以在很大程度上恢复缩放,而对于最常出现的词,其违反齐普夫定律的情况通常是通过所谓的曼德尔布罗特修正来模拟的。第二部分介绍了受时间序列分析启发的方法,用于研究书面文本中的各种长程相关性。相关的时间序列是在将文本划分为句子或连续标点符号之间的短语的基础上生成的。结果发现,这些序列具有复杂系统产生的信号中常见的特征:存在长程相关性以及分形甚至多分形结构。此外,在各种语言中,连续标点符号之间的距离似乎普遍符合韦布尔分布的离散变体,这种分布经常出现在生存分析中。第三部分回顾了网络形式主义在自然语言中的应用,特别是在词缀网络中的应用,这些网络的结构反映了文本中词的共现情况。表征此类网络拓扑结构的各种参数可用于文本分类,例如从文体计量学的角度进行分类。网络方法还可用于语义分析,根据词义来表示词的层次和词之间的关联。事实证明,这种网络结构与随机网络中观察到的结构明显不同,揭示了语言的真正特性。最后,标点符号似乎不仅对语言的信息承载能力有重大影响,而且对其关键的统计属性也有重大影响,因此似乎建议将标点符号与单词同等看待。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Complex systems approach to natural language

The science of complexity aims to answer the question of what rules nature chooses when assembling the basic constituents of matter and energy into structures and dynamical patterns that cascade through the entire hierarchy of scales in the Universe. A related phenomenon – natural language – can successfully mirror such structures as reflected by its ability to encode and transmit information about them and among them. It is thus legitimate to expect that natural language carries the essence of complexity. And indeed, in the human’s speaking and writing it is particularly true that more is different. Natural language thus deserves a central place in the related quantitative study within the science of complexity.

With this in mind the present review summarizes the main methodological concepts used in this domain and documents their applicability and utility in identifying universal as well as system-specific features of natural language in its written representation in several major Western languages. In particular, three main complexity-related current research trends in quantitative linguistics are exhaustively covered. The first part addresses the issue of word frequencies in texts and, in particular, demonstrates that taking punctuation into consideration largely restores scaling whose violation in the Zipf’s law for the most frequent words is commonly modelled by the so-called Mandelbrot’s correction. The second part introduces methods inspired by time series analysis, used in studying various kinds of long-range correlations in written texts. The related time series are generated on the basis of text partition into sentences or into phrases between consecutive punctuation marks. It turns out that these series develop features often found in signals generated by complex systems: the presence of long-range correlations along with fractal or even multifractal structures. Moreover, it appears that the distances between consecutive punctuation marks quite universally across languages comply with the discrete variant of the Weibull distribution, often appearing in survival analysis. In the third part, the application of the network formalism to natural language is reviewed, particularly in the context of word-adjacency networks whose structure reflects the word co-occurrence in texts. Various parameters characterizing topology of such networks can be used for classification of texts, for example, from a stylometric perspective. Network approach can also be applied in semantic analysis to represent a hierarchy of words and associations between them based on their meaning. Structure of such networks turns out to be significantly different from that observed in random networks, revealing genuine properties of language. Finally, punctuation appears to have a significant impact not only on the language’s information-carrying ability but also on its key statistical properties, hence it seems recommended to consider punctuation marks on a par with words.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Physics Reports
Physics Reports 物理-物理:综合
CiteScore
56.10
自引率
0.70%
发文量
102
审稿时长
9.1 weeks
期刊介绍: Physics Reports keeps the active physicist up-to-date on developments in a wide range of topics by publishing timely reviews which are more extensive than just literature surveys but normally less than a full monograph. Each report deals with one specific subject and is generally published in a separate volume. These reviews are specialist in nature but contain enough introductory material to make the main points intelligible to a non-specialist. The reader will not only be able to distinguish important developments and trends in physics but will also find a sufficient number of references to the original literature.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信