Natural Language Engineering最新文献_第9页

Comparison of text preprocessing methods 文本预处理方法比较

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-06-13 DOI: 10.1017/S1351324922000213

Christine P. Chai

{"title":"Comparison of text preprocessing methods","authors":"Christine P. Chai","doi":"10.1017/S1351324922000213","DOIUrl":"https://doi.org/10.1017/S1351324922000213","url":null,"abstract":"Abstract Text preprocessing is not only an essential step to prepare the corpus for modeling but also a key area that directly affects the natural language processing (NLP) application results. For instance, precise tokenization increases the accuracy of part-of-speech (POS) tagging, and retaining multiword expressions improves reasoning and machine translation. The text corpus needs to be appropriately preprocessed before it is ready to serve as the input to computer models. The preprocessing requirements depend on both the nature of the corpus and the NLP application itself, that is, what researchers would like to achieve from analyzing the data. Conventional text preprocessing practices generally suffice, but there exist situations where the text preprocessing needs to be customized for better analysis results. Hence, we discuss the pros and cons of several common text preprocessing methods: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions. Then, we provide examples of text datasets which require special preprocessing and how previous researchers handled the challenge. We expect this article to be a starting guideline on how to select and fine-tune text preprocessing methods.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"509 - 553"},"PeriodicalIF":2.5,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49277409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task 误入歧途还是误入歧途：通过NLI任务探索多语言BERT的语言学知识

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-06-09 DOI: 10.1017/S1351324922000225

M. Tikhonova, V. Mikhailov, D. Pisarevskaya, Valentin Malykh, Tatiana Shavrina

{"title":"Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task","authors":"M. Tikhonova, V. Mikhailov, D. Pisarevskaya, Valentin Malykh, Tatiana Shavrina","doi":"10.1017/S1351324922000225","DOIUrl":"https://doi.org/10.1017/S1351324922000225","url":null,"abstract":"Abstract Recent research has reported that standard fine-tuning approaches can be unstable due to being prone to various sources of randomness, including but not limited to weight initialization, training data order, and hardware. Such brittleness can lead to different evaluation results, prediction confidences, and generalization inconsistency of the same models independently fine-tuned under the same experimental setup. Our paper explores this problem in natural language inference, a common task in benchmarking practices, and extends the ongoing research to the multilingual setting. We propose six novel textual entailment and broad-coverage diagnostic datasets for French, German, and Swedish. Our key findings are that the mBERT model demonstrates fine-tuning instability for categories that involve lexical semantics, logic, and predicate-argument structure and struggles to learn monotonicity, negation, numeracy, and symmetry. We also observe that using extra training data only in English can enhance the generalization performance and fine-tuning stability, which we attribute to the cross-lingual transfer capabilities. However, the ratio of particular features in the additional training data might rather hurt the performance for model instances. We are publicly releasing the datasets, hoping to foster the diagnostic investigation of language models (LMs) in a cross-lingual scenario, particularly in terms of benchmarking, which might promote a more holistic understanding of multilingualism in LMs and cross-lingual knowledge transfer.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"554 - 583"},"PeriodicalIF":2.5,"publicationDate":"2022-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47433843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Emerging trends: General fine-tuning (gft) 新兴趋势:通用微调(gft)

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-05-23 DOI: 10.1017/S1351324922000237

Kenneth Ward Church, Xingyu Cai, Yibiao Ying, Zeyu Chen, Guangxu Xun, Yuchen Bian

引用次数: 2

Turkish abstractive text summarization using pretrained sequence-to-sequence models 使用预训练序列到序列模型的土耳其语抽象文本摘要

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-05-13 DOI: 10.1017/S1351324922000195

Batuhan Baykara, Tunga Güngör

{"title":"Turkish abstractive text summarization using pretrained sequence-to-sequence models","authors":"Batuhan Baykara, Tunga Güngör","doi":"10.1017/S1351324922000195","DOIUrl":"https://doi.org/10.1017/S1351324922000195","url":null,"abstract":"Abstract The tremendous amount of increase in the number of documents available on the Web has turned finding the relevant piece of information into a challenging, tedious, and time-consuming activity. Accordingly, automatic text summarization has become an important field of study by gaining significant attention from the researchers. Lately, with the advances in deep learning, neural abstractive text summarization with sequence-to-sequence (Seq2Seq) models has gained popularity. There have been many improvements in these models such as the use of pretrained language models (e.g., GPT, BERT, and XLM) and pretrained Seq2Seq models (e.g., BART and T5). These improvements have addressed certain shortcomings in neural summarization and have improved upon challenges such as saliency, fluency, and semantics which enable generating higher quality summaries. Unfortunately, these research attempts were mostly limited to the English language. Monolingual BERT models and multilingual pretrained Seq2Seq models have been released recently providing the opportunity to utilize such state-of-the-art models in low-resource languages such as Turkish. In this study, we make use of pretrained Seq2Seq models and obtain state-of-the-art results on the two large-scale Turkish datasets, TR-News and MLSum, for the text summarization task. Then, we utilize the title information in the datasets and establish hard baselines for the title generation task on both datasets. We show that the input to the models has a substantial amount of importance for the success of such tasks. Additionally, we provide extensive analysis of the models including cross-dataset evaluations, various text generation options, and the effect of preprocessing in ROUGE evaluations for Turkish. It is shown that the monolingual BERT models outperform the multilingual BERT models on all tasks across all the datasets. Lastly, qualitative evaluations of the generated summaries and titles of the models are provided.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1275 - 1304"},"PeriodicalIF":2.5,"publicationDate":"2022-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46982077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Natural Language Processing for Corpus Linguistics by Jonathan Dunn. Cambridge: Cambridge University Press, 2022. ISBN 9781009070447 (PB), ISBN 9781009070447 (OC), vi+88 pages. 《语料库语言学中的自然语言处理》，作者:Jonathan Dunn。剑桥:剑桥大学出版社，2022。ISBN 9781009070447 (PB)， ISBN 9781009070447 (OC)， vi+88页。

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-05-12 DOI: 10.1017/S1351324922000201

J. Wen, Lan Yi

{"title":"Natural Language Processing for Corpus Linguistics by Jonathan Dunn. Cambridge: Cambridge University Press, 2022. ISBN 9781009070447 (PB), ISBN 9781009070447 (OC), vi+88 pages.","authors":"J. Wen, Lan Yi","doi":"10.1017/S1351324922000201","DOIUrl":"https://doi.org/10.1017/S1351324922000201","url":null,"abstract":"Corpus linguistics is essentially the computer-based empirical analysis that examines naturally occurring language and its use with a representative collection of machine-readable texts (Sinclair, 1991; Biber, Conrad and Reppen, 1998; McEnery and Hardie, 2012). The techniques of corpus linguistics enable the analyzing of large amounts of corpus data from both qualitative (e.g., concordances) and quantitative (e.g., word frequencies) perspectives, which in turn may yield evidence for or against the proposed linguistic statements or assumptions (Reppen, 2010). Despite its success in a wide range of fields (Römer, 2022), traditional corpus linguistics has become seemingly disconnected from recent technological advances in artificial intelligence as the computing power and corpus data available for linguistic analysis continue to grow in the past decades. In this connection, more sophisticated methods are needed to update and expand the arsenal for corpus linguistics research. As its name suggests, this monograph focuses exclusively on utilizing NLP techniques to uncover different aspects of language use through the lens of corpus linguistics. It consists of four main chapters plus a brief conclusion. Each of the four main chapters highlights a different aspect of computational methodologies for corpus linguistic research, followed by a discussion on the potential ethical issues that are pertinent to the application. Five corpus-based case studies are presented to demonstrate how and why a particular computational method is used for linguistic analysis. Given the methodological orientation of the book, it is not surprising that there are substantial technical details concerning the implementation of these methods, which is usually a daunting task for those readers without any background knowledge in computer programming. Fortunately, the author has made all the Python scripts and corpus data used in the case studies publicly available online at https://doi.org/10.24433/CO.3402613.v1. These online supporting materials are an invaluable complement to the book because they not only ease readers from coding but also make every result and graph in the book readily reproducible. To provide better hands-on experience for readers, a quick walkthrough on the accessing of online materials is presented prior to the beginning of the main chapters. With just a few clicks, readers will be able to run the code and replicate the case studies with interactive code notebooks. Of course, readers who are familiar with Python programming are encouraged to further explore the corpus data and expand the scripts to serve their own research purposes. Chapter 1 provides a general overview of the computational analysis in corpus linguistics research and outlines the key issues to be addressed. It first defines the major problems (namely, categorization and comparison) in corpus analysis that NLP models can solve, and explains why computational linguistic analysis is needed for","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"842 - 845"},"PeriodicalIF":2.5,"publicationDate":"2022-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42332455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Abstract meaning representation of Turkish 土耳其语的抽象意义表征

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-04-28 DOI: 10.1017/s1351324922000183

Elif Oral, Ali Acar, Gülşen Eryiğit

{"title":"Abstract meaning representation of Turkish","authors":"Elif Oral, Ali Acar, Gülşen Eryiğit","doi":"10.1017/s1351324922000183","DOIUrl":"https://doi.org/10.1017/s1351324922000183","url":null,"abstract":"\u0000 Abstract meaning representation (AMR) is a graph-based sentence-level meaning representation that has become highly popular in recent years. AMR is a knowledge-based meaning representation heavily relying on frame semantics for linking predicate frames and entity knowledge bases such as DBpedia for linking named entity concepts. Although it is originally designed for English, its adaptation to non-English languages is possible by defining language-specific divergences and representations. This article introduces the first AMR representation framework for Turkish, which poses diverse challenges for AMR due to its typological differences compared to English; agglutinative, free constituent order, morphologically highly rich resulting in fewer word surface forms in sentences. The introduced solutions to these peculiarities are expected to guide the studies for other similar languages and speed up the construction of a cross-lingual universal AMR framework. Besides this main contribution, the article also presents the construction of the first AMR corpus of 700 sentences, the first AMR parser (i.e., a tree-to-graph rule-based AMR parser) used for semi-automatic annotation, and the evaluation of the introduced resources for Turkish.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2022-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46085536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A survey of methods for revealing and overcoming weaknesses of data-driven Natural Language Understanding 揭示和克服数据驱动的自然语言理解弱点的方法综述

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-04-22 DOI: 10.1017/s1351324922000171

Viktor Schlegel, G. Nenadic, R. Batista-Navarro

引用次数: 4

NLE volume 28 issue 3 Cover and Front matter NLE第28卷第3期封面和封面问题

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-04-08 DOI: 10.1017/s1351324922000158

R. Mitkov, B. Boguraev

引用次数: 0

NLE volume 28 issue 3 Cover and Back matter NLE第28卷第3期封面和封底

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-04-08 DOI: 10.1017/s135132492200016x

引用次数: 0

The voice synthesis business: 2022 update 语音合成业务：2022年更新

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-04-08 DOI: 10.1017/S1351324922000146

R. Dale

引用次数: 6