Unifying Models for Word Length Distributions Based on Types and Tokens

IF 1.7 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics Pub Date : 2023-04-03 DOI:10.1080/09296174.2023.2202061

Peter Zörnig, T. Berg

引用次数: 1

Abstract

ABSTRACT Word length studies have been one of the central issues in Quantitative Linguistics for a long time. Most models were constructed for very specific purposes, i.e. the individual models apply only to a specific language, only to token counts or only to type counts. The present paper takes up the challenge of developing unifying models which account for both type and token frequencies of a moderately large sample of languages (eight Indo-European and two non-Indo-European languages). We introduce three models which can be well fitted to all our data: the exponentiated Hyper-Poisson distribution, the generalized gamma and the Sichel distribution. We also discuss the possibility of interpreting the model parameters linguistically.

查看原文本刊更多论文

基于类型和标记的字长分布统一模型

长期以来，字长研究一直是数量语言学的核心问题之一。大多数模型都是为非常特定的目的构建的，即单个模型仅适用于特定的语言，仅适用于令牌计数或仅适用于类型计数。本文提出了开发统一模型的挑战，该模型考虑了中等规模的语言样本（八种印欧语言和两种非印欧语言）的类型和表征频率。我们介绍了三个可以很好地拟合我们所有数据的模型：指数超泊松分布、广义伽玛和Sichel分布。我们还讨论了用语言解释模型参数的可能性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Quantitative Linguistics Multiple-

CiteScore

2.90

自引率

7.10%

发文量

期刊介绍： The Journal of Quantitative Linguistics is an international forum for the publication and discussion of research on the quantitative characteristics of language and text in an exact mathematical form. This approach, which is of growing interest, opens up important and exciting theoretical perspectives, as well as solutions for a wide range of practical problems such as machine learning or statistical parsing, by introducing into linguistics the methods and models of advanced scientific disciplines such as the natural sciences, economics, and psychology.