{"title":"Word Length Distribution in German Texts during the 17th-19th Century","authors":"Fei Lian, Y. Li","doi":"10.1080/09296174.2019.1662536","DOIUrl":"https://doi.org/10.1080/09296174.2019.1662536","url":null,"abstract":"ABSTRACT Word length in German texts has been a frequently discussed issue in the field of quantitative linguistics. Taking an overall view of the existing research data, however, most of the research focuses on literary texts and private letters and the size of data corpus for each research is relatively small. This paper provides a time- and genre-based analysis of word length distribution in German using 360 texts originated between the 17th and 19th centuries, aiming to find a probability distribution that can capture well the German word length distribution from a diachronic perspective and to reveal the relationship between the word length distribution and boundary conditions such as the genre and the creation time of text. Results indicate that the word length distribution in German texts written in different eras abides by the 1-displaced hyper-Poisson distribution, whose parameters (a, b) are interconnected with boundary conditions. This study corroborates that the word length distribution of a certain language is consistent, due to the constraint of the cognitive mechanism. Besides, the parameters of probability distribution can be good indicators of the writing style as well as the creation time of text.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"28 1","pages":"117 - 137"},"PeriodicalIF":1.4,"publicationDate":"2019-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2019.1662536","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42201015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Statistics in Corpus Linguistics: A Practical Guide","authors":"Cunxin Han","doi":"10.1080/09296174.2019.1646069","DOIUrl":"https://doi.org/10.1080/09296174.2019.1646069","url":null,"abstract":"Corpus linguistics is a powerful quantitative methodology, which heavily relies on frequency data and statistical procedures. It is difficult to talk about corpus linguistics without mentioning sta...","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"27 1","pages":"379 - 383"},"PeriodicalIF":1.4,"publicationDate":"2019-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2019.1646069","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46755901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal Coding and the Origins of Zipfian Laws","authors":"R. Ferrer-i-Cancho, C. Bentz","doi":"10.1080/09296174.2020.1778387","DOIUrl":"https://doi.org/10.1080/09296174.2020.1778387","url":null,"abstract":"ABSTRACT The problem of compression in standard information theory consists of assigning codes as short as possible to numbers. Here we consider the problem of optimal coding – under an arbitrary coding scheme – and show that it predicts Zipf’s law of abbreviation, namely a tendency in natural languages for more frequent words to be shorter. We apply this result to investigate optimal coding also under so-called non-singular coding, a scheme where unique segmentation is not warranted but codes stand for a distinct number. Optimal non-singular coding predicts that the length of a word should grow approximately as the logarithm of its frequency rank, which is again consistent with Zipf’s law of abbreviation. Optimal non-singular coding in combination with the maximum entropy principle also predicts Zipf’s rank-frequency distribution. Furthermore, our findings on optimal non-singular coding challenge common beliefs about random typing. It turns out that random typing is in fact an optimal coding process, in stark contrast with the common assumption that it is detached from cost cutting considerations. Finally, we discuss the implications of optimal coding for the construction of a compact theory of Zipfian laws more generally as well as other linguistic laws.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"165 - 194"},"PeriodicalIF":1.4,"publicationDate":"2019-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1778387","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47778723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correlations and Potential Cross-Linguistic Indicators of Writing Style","authors":"P. Juola, George K. Mikros, Sean Vinsick","doi":"10.1080/09296174.2018.1458395","DOIUrl":"https://doi.org/10.1080/09296174.2018.1458395","url":null,"abstract":"Abstract In this paper, we present preliminary results on how an individual’s writing style persists even across languages. In other words, what aspects of an individual’s writing will persist irrespective of the language in which he or she writes? We argue that cognitive and social traits are likely to persist and demonstrate this by two separate analyses of bilingual corpora using the same individuals. We show that for various measures of linguistic complexity (which we consider to be a cognitive variable) and participation in specific social conventions (a social one), the correlation between scores on the two languages studied is significantly higher than would be expected by chance. We argue that this type of correlation may permit cross-linguistic authorship attribution.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"26 1","pages":"146 - 171"},"PeriodicalIF":1.4,"publicationDate":"2019-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2018.1458395","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46316258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Calculation of Semantic Distances Between Words: From Synonymy to Antonymy","authors":"M. Vakulenko","doi":"10.1080/09296174.2018.1452524","DOIUrl":"https://doi.org/10.1080/09296174.2018.1452524","url":null,"abstract":"Abstract A new approach to numerically measure the semantic distances between lexical units (words and collocations) based on the geometric analogies and analytical calculation, is put forward. Having considered the cases of equal and different weights of semes, we obtained exact algebraic formulas describing different levels of the meanings proximity, ranging from absolute synonymy to full antonymy. It was emphasized that absolute synonymy arises when the compared units contain equal numbers of semes that fully coincide and have equal weights in the corresponding pairs. Calculation of the semes weights helps to locate the unit more precisely on the semantic sphere. It was shown that the level of synonymy and antonymy decreases if different semes are accentuated, while the semantic distance between the units without identical semes cannot be influenced by seme boosting. It was observed that depending on the context, a word may wander over this sphere, thus modifying its lexical semantic relations with other units. As the proposed approach contributes to formalization of the units comparison procedure, it is advisable for incorporation into relevant automatic tools, particularly into WordNet and FrameNet. The obtained results may be useful for various linguistic and associated studies including automatic text analysis and processing, computer lexicography, information search and retrieval, machine translation and other NLP applications that are related to the artificial intelligence problem.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"26 1","pages":"116 - 128"},"PeriodicalIF":1.4,"publicationDate":"2019-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2018.1452524","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47278957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Frequency Effect and Neutralization of Tones in Mandarin Chinese","authors":"Huifang Kong, Shengyi Wu","doi":"10.1080/09296174.2018.1452140","DOIUrl":"https://doi.org/10.1080/09296174.2018.1452140","url":null,"abstract":"Abstract Tonal neutralization in Mandarin has long been thought to be connected with lexical frequency. But this has never been investigated quantitatively because of the methodological challenge. In this study, a production experiment was run with speakers reading disyllabic words in neutral tones with frequency estimates derived from a Frequency Dictionary. The dependent measures were the three acoustic correlates of: duration, F0 contour and intensity. Independent measures included the lexical frequency at three levels (low, middle and high). Regression analysis showed that neutralization of tones are directly correlated with lexical frequency independent of other factors. A regularity, the more frequent, the shorter in duration; the more frequent, the lower in pitch; the more frequent, the weaker in intensity governs the neutralization of tones in reduced syllables. However, the exact shape of such an effect displays a different scenario in a different frequency range. Only high frequency words display a significant difference from low frequency words. Last but not the least, an exemplar representation is proposed to express a neutral tone’s observed frequency effect naturally.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"26 1","pages":"115 - 95"},"PeriodicalIF":1.4,"publicationDate":"2019-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2018.1452140","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45619077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Comprehensive Study of the Parameters in the Creation and Comparison of Feature Vectors in Distributional Semantic Models","authors":"A. Dobó, J. Csirik","doi":"10.1080/09296174.2019.1570897","DOIUrl":"https://doi.org/10.1080/09296174.2019.1570897","url":null,"abstract":"ABSTRACT Measuring the semantic similarity and relatedness of words can play a vital role in many natural language processing tasks. Distributional semantic models computing these measures can have many different parameters, such as different weighting schemes, vector similarity measures, feature transformation functions and dimensionality reduction techniques. Despite their importance there is no truly comprehensive study simultaneously evaluating the numerous parameters of such models, while also considering the interaction of these parameters with each other. We would like to address this gap with our systematic study. Taking the necessary distributional information extracted from the chosen dataset as already granted, we evaluate all important aspects of the creation and comparison of feature vectors in distributional semantic models. Testing altogether 10 parameters simultaneously, we try to find the best combination of parameter settings, with a large number of settings examined in case of some of the parameters. Beside evaluating the conventionally used settings for the parameters, we also propose numerous novel variants, as well as novel combinations of parameter settings, some of which significantly outperform the combinations of settings in general use, thus achieving state-of-the-art results.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"27 1","pages":"244 - 271"},"PeriodicalIF":1.4,"publicationDate":"2019-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2019.1570897","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48204474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Systemic Dynamics Model of Text Production","authors":"Giacomo P. Figueredo, G. Figueredo","doi":"10.1080/09296174.2019.1567301","DOIUrl":"https://doi.org/10.1080/09296174.2019.1567301","url":null,"abstract":"ABSTRACT This paper introduces a quantitative model of text as it unfolds in time. The model conceptualizes text as a functional unit of language. This organization can be difficult to identify because it forms complex patterns of linguistic laws, probability and dynamics. These patterns are covert configurations and need complex methods to be investigated. One such method is to draw from qualitative frameworks derived from the quantitative properties of language. Previous studies have shown that covert configurations can be obtained through qualitative frameworks. When dynamics is considered, however, a model of text production including the variable time is needed. This paper therefore aims at addressing this research gap by proposing a dynamics model of text unfolding. It draws from systemic theory and models its categories quantitatively. Time is introduced as variation of choice. The model is applied to a sample of text. Results show how individual choices contribute to text unfolding – describing the amount of meanings at any given moment in text time. In addition, the dynamic accumulation indicates core characteristics of a text, which can be further explored in text behaviour and typology.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"27 1","pages":"291 - 320"},"PeriodicalIF":1.4,"publicationDate":"2019-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2019.1567301","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"59838178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quantitative Approaches to the Russian Language","authors":"E. Kelih","doi":"10.1080/09296174.2018.1558834","DOIUrl":"https://doi.org/10.1080/09296174.2018.1558834","url":null,"abstract":"The omnibus volume under review comprises 10 individual chapters by 22 authors, thus most of the chapters are co-authored. This seems to reflect the overall interdisciplinary approach focus of the ...","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"27 1","pages":"80 - 83"},"PeriodicalIF":1.4,"publicationDate":"2019-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2018.1558834","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43913979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Statistical Analysis of the Tables in Mahadevan’s Concordance of the Indus Valley Script","authors":"M. Oakes","doi":"10.1080/09296174.2017.1406294","DOIUrl":"https://doi.org/10.1080/09296174.2017.1406294","url":null,"abstract":"Abstract The Indus Script originates from the culture known as the Indus Valley Civilization, which flourished from approximately 2600 to 1900 bc. Several thousand objects bearing these signs have been found over a wide area of Northern India and Pakistan. In 1977, Iravatham Mahadevan published a concordance of all of the scripts that had been discovered so far. Accompanying the concordance are a set of nine tables showing the distribution of individual signs by position, archaeological site, object type, field symbol (accompanying image), and direction of writing. Analysis of the frequencies of the signs found so far using Large Numbers of Rare Events (LNRE) models estimated the total vocabulary of the language, including signs not yet found, to be about 857. All the tables were analysed using Pearson’s residuals, and it was found that the signs were not randomly distributed, but some showed statistically significant associations with position, object, field symbol or direction of writing. A more detailed analysis of the relation between signs and field symbols was made using correspondence analysis, which showed that certain signs were associated with the unicorn symbol, while others were associated with the gharial and dotted circle symbols.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"26 1","pages":"22 - 47"},"PeriodicalIF":1.4,"publicationDate":"2019-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2017.1406294","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41996848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}