Natural Language Processing for Corpus Linguistics by Jonathan Dunn. Cambridge: Cambridge University Press, 2022. ISBN 9781009070447 (PB), ISBN 9781009070447 (OC), vi+88 pages.

IF 1.9 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering Pub Date : 2022-05-12 DOI:10.1017/S1351324922000201

J. Wen, Lan Yi

{"title":"Natural Language Processing for Corpus Linguistics by Jonathan Dunn. Cambridge: Cambridge University Press, 2022. ISBN 9781009070447 (PB), ISBN 9781009070447 (OC), vi+88 pages.","authors":"J. Wen, Lan Yi","doi":"10.1017/S1351324922000201","DOIUrl":null,"url":null,"abstract":"Corpus linguistics is essentially the computer-based empirical analysis that examines naturally occurring language and its use with a representative collection of machine-readable texts (Sinclair, 1991; Biber, Conrad and Reppen, 1998; McEnery and Hardie, 2012). The techniques of corpus linguistics enable the analyzing of large amounts of corpus data from both qualitative (e.g., concordances) and quantitative (e.g., word frequencies) perspectives, which in turn may yield evidence for or against the proposed linguistic statements or assumptions (Reppen, 2010). Despite its success in a wide range of fields (Römer, 2022), traditional corpus linguistics has become seemingly disconnected from recent technological advances in artificial intelligence as the computing power and corpus data available for linguistic analysis continue to grow in the past decades. In this connection, more sophisticated methods are needed to update and expand the arsenal for corpus linguistics research. As its name suggests, this monograph focuses exclusively on utilizing NLP techniques to uncover different aspects of language use through the lens of corpus linguistics. It consists of four main chapters plus a brief conclusion. Each of the four main chapters highlights a different aspect of computational methodologies for corpus linguistic research, followed by a discussion on the potential ethical issues that are pertinent to the application. Five corpus-based case studies are presented to demonstrate how and why a particular computational method is used for linguistic analysis. Given the methodological orientation of the book, it is not surprising that there are substantial technical details concerning the implementation of these methods, which is usually a daunting task for those readers without any background knowledge in computer programming. Fortunately, the author has made all the Python scripts and corpus data used in the case studies publicly available online at https://doi.org/10.24433/CO.3402613.v1. These online supporting materials are an invaluable complement to the book because they not only ease readers from coding but also make every result and graph in the book readily reproducible. To provide better hands-on experience for readers, a quick walkthrough on the accessing of online materials is presented prior to the beginning of the main chapters. With just a few clicks, readers will be able to run the code and replicate the case studies with interactive code notebooks. Of course, readers who are familiar with Python programming are encouraged to further explore the corpus data and expand the scripts to serve their own research purposes. Chapter 1 provides a general overview of the computational analysis in corpus linguistics research and outlines the key issues to be addressed. It first defines the major problems (namely, categorization and comparison) in corpus analysis that NLP models can solve, and explains why computational linguistic analysis is needed for corpus linguistic research (namely., reproducibility and scalability). The author then introduces all five case studies to be presented in the forthcoming chapters. These studies, ranging from usage-based grammar to corpus-based sociolinguistics, demonstrate how NLP methods can be applied to investigate real-world linguistic phenomena. As for the key issues, the categorization problems and comparison problems are discussed in two","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"842 - 845"},"PeriodicalIF":1.9000,"publicationDate":"2022-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/S1351324922000201","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 1

Abstract

Corpus linguistics is essentially the computer-based empirical analysis that examines naturally occurring language and its use with a representative collection of machine-readable texts (Sinclair, 1991; Biber, Conrad and Reppen, 1998; McEnery and Hardie, 2012). The techniques of corpus linguistics enable the analyzing of large amounts of corpus data from both qualitative (e.g., concordances) and quantitative (e.g., word frequencies) perspectives, which in turn may yield evidence for or against the proposed linguistic statements or assumptions (Reppen, 2010). Despite its success in a wide range of fields (Römer, 2022), traditional corpus linguistics has become seemingly disconnected from recent technological advances in artificial intelligence as the computing power and corpus data available for linguistic analysis continue to grow in the past decades. In this connection, more sophisticated methods are needed to update and expand the arsenal for corpus linguistics research. As its name suggests, this monograph focuses exclusively on utilizing NLP techniques to uncover different aspects of language use through the lens of corpus linguistics. It consists of four main chapters plus a brief conclusion. Each of the four main chapters highlights a different aspect of computational methodologies for corpus linguistic research, followed by a discussion on the potential ethical issues that are pertinent to the application. Five corpus-based case studies are presented to demonstrate how and why a particular computational method is used for linguistic analysis. Given the methodological orientation of the book, it is not surprising that there are substantial technical details concerning the implementation of these methods, which is usually a daunting task for those readers without any background knowledge in computer programming. Fortunately, the author has made all the Python scripts and corpus data used in the case studies publicly available online at https://doi.org/10.24433/CO.3402613.v1. These online supporting materials are an invaluable complement to the book because they not only ease readers from coding but also make every result and graph in the book readily reproducible. To provide better hands-on experience for readers, a quick walkthrough on the accessing of online materials is presented prior to the beginning of the main chapters. With just a few clicks, readers will be able to run the code and replicate the case studies with interactive code notebooks. Of course, readers who are familiar with Python programming are encouraged to further explore the corpus data and expand the scripts to serve their own research purposes. Chapter 1 provides a general overview of the computational analysis in corpus linguistics research and outlines the key issues to be addressed. It first defines the major problems (namely, categorization and comparison) in corpus analysis that NLP models can solve, and explains why computational linguistic analysis is needed for corpus linguistic research (namely., reproducibility and scalability). The author then introduces all five case studies to be presented in the forthcoming chapters. These studies, ranging from usage-based grammar to corpus-based sociolinguistics, demonstrate how NLP methods can be applied to investigate real-world linguistic phenomena. As for the key issues, the categorization problems and comparison problems are discussed in two

查看原文本刊更多论文

《语料库语言学中的自然语言处理》，作者:Jonathan Dunn。剑桥:剑桥大学出版社，2022。ISBN 9781009070447 (PB)， ISBN 9781009070447 (OC)， vi+88页。

语料库语言学本质上是基于计算机的实证分析，通过机器可读文本的代表性集合来研究自然语言及其使用（Sinclair，1991；Biber、Conrad和Reppen，1998；McEnery和Hardie，2012）。语料库语言学的技术能够从定性（例如，一致性）和定量（例如，词频）两个角度分析大量的语料库数据，这反过来可能产生支持或反对所提出的语言陈述或假设的证据（Reppen，2010）。尽管传统语料库语言学在广泛的领域取得了成功（Römer，2022），但随着可用于语言分析的计算能力和语料库数据在过去几十年中不断增长，传统语料库语言学似乎与人工智能的最新技术进步脱节。在这方面，需要更复杂的方法来更新和扩大语料库语言学研究的武器库。正如它的名字所暗示的，这本专著专注于利用NLP技术，通过语料库语言学的视角揭示语言使用的不同方面。它由四个主要章节和一个简短的结论组成。四个主要章节中的每一章都强调了语料库语言学研究的计算方法的不同方面，然后讨论了与应用相关的潜在伦理问题。五个基于语料库的案例研究展示了一种特殊的计算方法是如何以及为什么被用于语言分析的。考虑到本书的方法论方向，关于这些方法的实施有大量的技术细节也就不足为奇了，对于那些没有任何计算机编程背景知识的读者来说，这通常是一项艰巨的任务。幸运的是，作者已经在网上公开了案例研究中使用的所有Python脚本和语料库数据https://doi.org/10.24433/CO.3402613.v1.这些在线支持材料是对这本书的宝贵补充，因为它们不仅使读者易于编码，而且使书中的每一个结果和图表都易于复制。为了给读者提供更好的动手体验，在主要章节开始之前，我们将简要介绍如何访问在线材料。只需点击几下，读者就可以运行代码，并通过交互式代码笔记本复制案例研究。当然，我们鼓励熟悉Python编程的读者进一步探索语料库数据，并扩展脚本，以满足他们自己的研究目的。第一章概述了语料库语言学研究中的计算分析，并概述了需要解决的关键问题。它首先定义了NLP模型可以解决的语料库分析中的主要问题（即分类和比较），并解释了为什么语料库语言学研究需要计算语言学分析（即再现性和可扩展性）。然后，作者介绍了将在下一章中介绍的所有五个案例研究。这些研究，从基于用法的语法到基于语料库的社会语言学，展示了NLP方法如何应用于研究真实世界的语言现象。至于关键问题，分类问题和比较问题分为两部分进行了讨论

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

5.90

自引率

12.00%

发文量

审稿时长

>12 weeks

期刊介绍： Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.