Linguistic and computational modeling in language science

The Shape of Data in the Digital Humanities Pub Date : 2018-11-02 DOI:10.4324/9781315552941-12

E. Teich, Péter Fankhauser

{"title":"Linguistic and computational modeling in language science","authors":"E. Teich, Péter Fankhauser","doi":"10.4324/9781315552941-12","DOIUrl":null,"url":null,"abstract":"historical perspectives. When practiced as a science, linguistics is characterized by the tension between the two methodological dispositions of rationalism and empiricism. At any point in time in the history of linguistics, one is more dominant than the other. In the last two decades, we have been experiencing a new wave of empiricism in linguistic fields as diverse as psycholinguistics (e.g., Chater et al., 2015), language typology (e.g., Piantidosi and Gibson, 2014), language change (e.g., Bybee, 2010) and language variation (e.g., Bresnan and Ford, 2010). Consequently, the practices of modeling are being renegotiated in different linguistic communities, readdressing some fundamental methodological questions such as: How to cast a research question into an appropriate study design? How to obtain evidence (data) for a hypothesis (e.g., experiment vs. corpus)? How to process the data? How to evaluate a hypothesis in the light of the data obtained? This new empiricism is characterized by an interest in language use in context accompanied by a commitment to computational modeling, which is probably most developed in psycholinguistics, giving rise to the field of “computational psycholinguistics” (cf. Crocker, 2010), but recently getting stronger also in corpus linguistics. The predominant domain of corpus linguistics is language variation, aiming at statements on relative differences/similarities between linguistic varieties (time periods, registers, genres). Corpus analysis is thus comparative by nature; technically, this involves comparing probability distributions of (sets of) linguistic features (e.g., the relative frequency of passive vs. active voice in narrative vs. expository genres) and assessing whether they are significantly different or not. Here, descriptive statistical techniques come into play but also language modeling and machine learning methods (e.g., clustering, latent semantic analysis, or Bayesian modeling). Similarly, corpus processing—that is, preparing text material for analysis—relies on computational models, for example, for annotation. What is important to note here is that processing and analysis are broken up into different steps, each using a different computational micro-model that takes care of a specific task (e.g., labeling linguistic units in annotation) and consists of a descriptive component (set of allowed labels) and an analytic or algorithmic component (procedure by which labels are assigned). Linguistic and computational modeling in language science","PeriodicalId":200326,"journal":{"name":"The Shape of Data in the Digital Humanities","volume":"86 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Shape of Data in the Digital Humanities","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4324/9781315552941-12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

historical perspectives. When practiced as a science, linguistics is characterized by the tension between the two methodological dispositions of rationalism and empiricism. At any point in time in the history of linguistics, one is more dominant than the other. In the last two decades, we have been experiencing a new wave of empiricism in linguistic fields as diverse as psycholinguistics (e.g., Chater et al., 2015), language typology (e.g., Piantidosi and Gibson, 2014), language change (e.g., Bybee, 2010) and language variation (e.g., Bresnan and Ford, 2010). Consequently, the practices of modeling are being renegotiated in different linguistic communities, readdressing some fundamental methodological questions such as: How to cast a research question into an appropriate study design? How to obtain evidence (data) for a hypothesis (e.g., experiment vs. corpus)? How to process the data? How to evaluate a hypothesis in the light of the data obtained? This new empiricism is characterized by an interest in language use in context accompanied by a commitment to computational modeling, which is probably most developed in psycholinguistics, giving rise to the field of “computational psycholinguistics” (cf. Crocker, 2010), but recently getting stronger also in corpus linguistics. The predominant domain of corpus linguistics is language variation, aiming at statements on relative differences/similarities between linguistic varieties (time periods, registers, genres). Corpus analysis is thus comparative by nature; technically, this involves comparing probability distributions of (sets of) linguistic features (e.g., the relative frequency of passive vs. active voice in narrative vs. expository genres) and assessing whether they are significantly different or not. Here, descriptive statistical techniques come into play but also language modeling and machine learning methods (e.g., clustering, latent semantic analysis, or Bayesian modeling). Similarly, corpus processing—that is, preparing text material for analysis—relies on computational models, for example, for annotation. What is important to note here is that processing and analysis are broken up into different steps, each using a different computational micro-model that takes care of a specific task (e.g., labeling linguistic units in annotation) and consists of a descriptive component (set of allowed labels) and an analytic or algorithmic component (procedure by which labels are assigned). Linguistic and computational modeling in language science

查看原文本刊更多论文

语言科学中的语言学和计算建模

历史视角。当作为一门科学实践时，语言学的特点是理性主义和经验主义两种方法论倾向之间的紧张关系。在语言学历史上的任何时间点，一个比另一个更占主导地位。在过去的二十年中，我们在语言领域经历了一波新的经验主义浪潮，如心理语言学(例如，Chater等人，2015年)，语言类人学(例如，Piantidosi和Gibson, 2014年)，语言变化(例如，Bybee, 2010年)和语言变异(例如，Bresnan和Ford, 2010年)。因此，在不同的语言社区中，建模的实践正在重新协商，重新解决一些基本的方法问题，例如:如何将研究问题纳入适当的研究设计?如何获得假设的证据(数据)(例如，实验与语料库)?如何处理这些数据?如何根据获得的数据来评估一个假设?这种新的经验主义的特点是对语境中的语言使用感兴趣，并致力于计算建模，这可能是心理语言学中最发达的，产生了“计算心理语言学”领域(cf. Crocker, 2010)，但最近在语料库语言学中也越来越强大。语料库语言学的主要领域是语言变异，旨在陈述语言变体(时间段、语域、体裁)之间的相对差异/相似之处。因此，语料库分析本质上是比较的;从技术上讲，这涉及到比较语言特征(集合)的概率分布(例如，叙事性和说事性体裁中被动语态和主动语态的相对频率)，并评估它们是否存在显著差异。在这里，描述性统计技术发挥了作用，但也有语言建模和机器学习方法(例如，聚类，潜在语义分析或贝叶斯建模)。类似地，语料库处理(即为分析准备文本材料)依赖于计算模型，例如用于注释的模型。这里需要注意的是，处理和分析被分解成不同的步骤，每个步骤都使用不同的计算微观模型来处理特定的任务(例如，在注释中标记语言单元)，并由描述性组件(允许的标签集)和分析或算法组件(分配标签的过程)组成。语言科学中的语言学和计算建模

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The Shape of Data in the Digital Humanities

自引率

0.00%

发文量