Olga Pelloni, Rob van der Goot, Peter Ranacher, Ivan Vulic, Tanja Samardzic
{"title":"Subword symmetry in natural languages.","authors":"Olga Pelloni, Rob van der Goot, Peter Ranacher, Ivan Vulic, Tanja Samardzic","doi":"10.1098/rsos.250295","DOIUrl":null,"url":null,"abstract":"<p><p>Symmetric patterns are found in the orderly arrangements of natural structures, from proteins to the symmetry in animals' bodies. Symmetric structures are more stable and easier to describe and compress, which is why they may have been preferred as building blocks in natural selection. The idea that natural languages undergo an evolutionary process akin to the evolution of species has been pervasive in the study of language. This process might result in symmetric patterns as in other natural structures, but the notion of symmetry is rarely associated with the study of natural language. In this study, we look for symmetric patterns in text data, considering the length of subword units under a range of possible subword analyses. We study the length of subword units in 32 languages and discover that the splits of long words tend to be symmetric regardless of the segmentation method and that some automatic methods give symmetric splits at all word lengths. These results include natural language in the set of phenomena that can be described in terms of symmetry, opening a new research avenue for the empirical study of text data as a structure comparable to various other structures in the natural world.</p>","PeriodicalId":21525,"journal":{"name":"Royal Society Open Science","volume":"12 8","pages":"250295"},"PeriodicalIF":2.9000,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12370235/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Royal Society Open Science","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1098/rsos.250295","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/8/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Symmetric patterns are found in the orderly arrangements of natural structures, from proteins to the symmetry in animals' bodies. Symmetric structures are more stable and easier to describe and compress, which is why they may have been preferred as building blocks in natural selection. The idea that natural languages undergo an evolutionary process akin to the evolution of species has been pervasive in the study of language. This process might result in symmetric patterns as in other natural structures, but the notion of symmetry is rarely associated with the study of natural language. In this study, we look for symmetric patterns in text data, considering the length of subword units under a range of possible subword analyses. We study the length of subword units in 32 languages and discover that the splits of long words tend to be symmetric regardless of the segmentation method and that some automatic methods give symmetric splits at all word lengths. These results include natural language in the set of phenomena that can be described in terms of symmetry, opening a new research avenue for the empirical study of text data as a structure comparable to various other structures in the natural world.
期刊介绍:
Royal Society Open Science is a new open journal publishing high-quality original research across the entire range of science on the basis of objective peer-review.
The journal covers the entire range of science and mathematics and will allow the Society to publish all the high-quality work it receives without the usual restrictions on scope, length or impact.