{"title":"Linguistic and computational modeling in language science","authors":"E. Teich, Péter Fankhauser","doi":"10.4324/9781315552941-12","DOIUrl":null,"url":null,"abstract":"historical perspectives. When practiced as a science, linguistics is characterized by the tension between the two methodological dispositions of rationalism and empiricism. At any point in time in the history of linguistics, one is more dominant than the other. In the last two decades, we have been experiencing a new wave of empiricism in linguistic fields as diverse as psycholinguistics (e.g., Chater et al., 2015), language typology (e.g., Piantidosi and Gibson, 2014), language change (e.g., Bybee, 2010) and language variation (e.g., Bresnan and Ford, 2010). Consequently, the practices of modeling are being renegotiated in different linguistic communities, readdressing some fundamental methodological questions such as: How to cast a research question into an appropriate study design? How to obtain evidence (data) for a hypothesis (e.g., experiment vs. corpus)? How to process the data? How to evaluate a hypothesis in the light of the data obtained? This new empiricism is characterized by an interest in language use in context accompanied by a commitment to computational modeling, which is probably most developed in psycholinguistics, giving rise to the field of “computational psycholinguistics” (cf. Crocker, 2010), but recently getting stronger also in corpus linguistics. The predominant domain of corpus linguistics is language variation, aiming at statements on relative differences/similarities between linguistic varieties (time periods, registers, genres). Corpus analysis is thus comparative by nature; technically, this involves comparing probability distributions of (sets of) linguistic features (e.g., the relative frequency of passive vs. active voice in narrative vs. expository genres) and assessing whether they are significantly different or not. Here, descriptive statistical techniques come into play but also language modeling and machine learning methods (e.g., clustering, latent semantic analysis, or Bayesian modeling). Similarly, corpus processing—that is, preparing text material for analysis—relies on computational models, for example, for annotation. What is important to note here is that processing and analysis are broken up into different steps, each using a different computational micro-model that takes care of a specific task (e.g., labeling linguistic units in annotation) and consists of a descriptive component (set of allowed labels) and an analytic or algorithmic component (procedure by which labels are assigned). Linguistic and computational modeling in language science","PeriodicalId":200326,"journal":{"name":"The Shape of Data in the Digital Humanities","volume":"86 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Shape of Data in the Digital Humanities","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4324/9781315552941-12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
historical perspectives. When practiced as a science, linguistics is characterized by the tension between the two methodological dispositions of rationalism and empiricism. At any point in time in the history of linguistics, one is more dominant than the other. In the last two decades, we have been experiencing a new wave of empiricism in linguistic fields as diverse as psycholinguistics (e.g., Chater et al., 2015), language typology (e.g., Piantidosi and Gibson, 2014), language change (e.g., Bybee, 2010) and language variation (e.g., Bresnan and Ford, 2010). Consequently, the practices of modeling are being renegotiated in different linguistic communities, readdressing some fundamental methodological questions such as: How to cast a research question into an appropriate study design? How to obtain evidence (data) for a hypothesis (e.g., experiment vs. corpus)? How to process the data? How to evaluate a hypothesis in the light of the data obtained? This new empiricism is characterized by an interest in language use in context accompanied by a commitment to computational modeling, which is probably most developed in psycholinguistics, giving rise to the field of “computational psycholinguistics” (cf. Crocker, 2010), but recently getting stronger also in corpus linguistics. The predominant domain of corpus linguistics is language variation, aiming at statements on relative differences/similarities between linguistic varieties (time periods, registers, genres). Corpus analysis is thus comparative by nature; technically, this involves comparing probability distributions of (sets of) linguistic features (e.g., the relative frequency of passive vs. active voice in narrative vs. expository genres) and assessing whether they are significantly different or not. Here, descriptive statistical techniques come into play but also language modeling and machine learning methods (e.g., clustering, latent semantic analysis, or Bayesian modeling). Similarly, corpus processing—that is, preparing text material for analysis—relies on computational models, for example, for annotation. What is important to note here is that processing and analysis are broken up into different steps, each using a different computational micro-model that takes care of a specific task (e.g., labeling linguistic units in annotation) and consists of a descriptive component (set of allowed labels) and an analytic or algorithmic component (procedure by which labels are assigned). Linguistic and computational modeling in language science