Tomasz Stanisz , Stanisław Drożdż , Jarosław Kwapień
{"title":"Complex systems approach to natural language","authors":"Tomasz Stanisz , Stanisław Drożdż , Jarosław Kwapień","doi":"10.1016/j.physrep.2023.12.002","DOIUrl":null,"url":null,"abstract":"<div><p>The science of complexity aims to answer the question of what rules nature chooses when assembling the basic constituents of matter and energy into structures and dynamical patterns that cascade through the entire hierarchy of scales in the Universe. A related phenomenon – natural language – can successfully mirror such structures as reflected by its ability to encode and transmit information about them and among them. It is thus legitimate to expect that natural language carries the essence of complexity. And indeed, in the human’s speaking and writing it is particularly true that <em>more is different</em>. Natural language thus deserves a central place in the related quantitative study within the science of complexity.</p><p>With this in mind the present review summarizes the main methodological concepts used in this domain and documents their applicability and utility in identifying universal as well as system-specific features of natural language in its written representation in several major Western languages. In particular, three main complexity-related current research trends in quantitative linguistics are exhaustively covered. The first part addresses the issue of word frequencies in texts and, in particular, demonstrates that taking punctuation into consideration largely restores scaling whose violation in the Zipf’s law for the most frequent words is commonly modelled by the so-called Mandelbrot’s correction. The second part introduces methods inspired by time series analysis, used in studying various kinds of long-range correlations in written texts. The related time series are generated on the basis of text partition into sentences or into phrases between consecutive punctuation marks. It turns out that these series develop features often found in signals generated by complex systems: the presence of long-range correlations along with fractal or even multifractal structures. Moreover, it appears that the distances between consecutive punctuation marks quite universally across languages comply with the discrete variant of the Weibull distribution<span>, often appearing in survival analysis. In the third part, the application of the network formalism to natural language is reviewed, particularly in the context of word-adjacency networks whose structure reflects the word co-occurrence in texts. Various parameters characterizing topology of such networks can be used for classification of texts, for example, from a stylometric perspective. Network approach can also be applied in semantic analysis to represent a hierarchy of words and associations between them based on their meaning. Structure of such networks turns out to be significantly different from that observed in random networks, revealing genuine properties of language. Finally, punctuation appears to have a significant impact not only on the language’s information-carrying ability but also on its key statistical properties, hence it seems recommended to consider punctuation marks on a par with words.</span></p></div>","PeriodicalId":404,"journal":{"name":"Physics Reports","volume":"1053 ","pages":"Pages 1-84"},"PeriodicalIF":23.9000,"publicationDate":"2023-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Physics Reports","FirstCategoryId":"4","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0370157323004076","RegionNum":1,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PHYSICS, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
The science of complexity aims to answer the question of what rules nature chooses when assembling the basic constituents of matter and energy into structures and dynamical patterns that cascade through the entire hierarchy of scales in the Universe. A related phenomenon – natural language – can successfully mirror such structures as reflected by its ability to encode and transmit information about them and among them. It is thus legitimate to expect that natural language carries the essence of complexity. And indeed, in the human’s speaking and writing it is particularly true that more is different. Natural language thus deserves a central place in the related quantitative study within the science of complexity.
With this in mind the present review summarizes the main methodological concepts used in this domain and documents their applicability and utility in identifying universal as well as system-specific features of natural language in its written representation in several major Western languages. In particular, three main complexity-related current research trends in quantitative linguistics are exhaustively covered. The first part addresses the issue of word frequencies in texts and, in particular, demonstrates that taking punctuation into consideration largely restores scaling whose violation in the Zipf’s law for the most frequent words is commonly modelled by the so-called Mandelbrot’s correction. The second part introduces methods inspired by time series analysis, used in studying various kinds of long-range correlations in written texts. The related time series are generated on the basis of text partition into sentences or into phrases between consecutive punctuation marks. It turns out that these series develop features often found in signals generated by complex systems: the presence of long-range correlations along with fractal or even multifractal structures. Moreover, it appears that the distances between consecutive punctuation marks quite universally across languages comply with the discrete variant of the Weibull distribution, often appearing in survival analysis. In the third part, the application of the network formalism to natural language is reviewed, particularly in the context of word-adjacency networks whose structure reflects the word co-occurrence in texts. Various parameters characterizing topology of such networks can be used for classification of texts, for example, from a stylometric perspective. Network approach can also be applied in semantic analysis to represent a hierarchy of words and associations between them based on their meaning. Structure of such networks turns out to be significantly different from that observed in random networks, revealing genuine properties of language. Finally, punctuation appears to have a significant impact not only on the language’s information-carrying ability but also on its key statistical properties, hence it seems recommended to consider punctuation marks on a par with words.
期刊介绍:
Physics Reports keeps the active physicist up-to-date on developments in a wide range of topics by publishing timely reviews which are more extensive than just literature surveys but normally less than a full monograph. Each report deals with one specific subject and is generally published in a separate volume. These reviews are specialist in nature but contain enough introductory material to make the main points intelligible to a non-specialist. The reader will not only be able to distinguish important developments and trends in physics but will also find a sufficient number of references to the original literature.