Natural Language Engineering最新文献

筛选
英文 中文
Comparison of text preprocessing methods 文本预处理方法比较
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2022-06-13 DOI: 10.1017/S1351324922000213
Christine P. Chai
{"title":"Comparison of text preprocessing methods","authors":"Christine P. Chai","doi":"10.1017/S1351324922000213","DOIUrl":"https://doi.org/10.1017/S1351324922000213","url":null,"abstract":"Abstract Text preprocessing is not only an essential step to prepare the corpus for modeling but also a key area that directly affects the natural language processing (NLP) application results. For instance, precise tokenization increases the accuracy of part-of-speech (POS) tagging, and retaining multiword expressions improves reasoning and machine translation. The text corpus needs to be appropriately preprocessed before it is ready to serve as the input to computer models. The preprocessing requirements depend on both the nature of the corpus and the NLP application itself, that is, what researchers would like to achieve from analyzing the data. Conventional text preprocessing practices generally suffice, but there exist situations where the text preprocessing needs to be customized for better analysis results. Hence, we discuss the pros and cons of several common text preprocessing methods: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions. Then, we provide examples of text datasets which require special preprocessing and how previous researchers handled the challenge. We expect this article to be a starting guideline on how to select and fine-tune text preprocessing methods.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"509 - 553"},"PeriodicalIF":2.5,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49277409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task 误入歧途还是误入歧途:通过NLI任务探索多语言BERT的语言学知识
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2022-06-09 DOI: 10.1017/S1351324922000225
M. Tikhonova, V. Mikhailov, D. Pisarevskaya, Valentin Malykh, Tatiana Shavrina
{"title":"Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task","authors":"M. Tikhonova, V. Mikhailov, D. Pisarevskaya, Valentin Malykh, Tatiana Shavrina","doi":"10.1017/S1351324922000225","DOIUrl":"https://doi.org/10.1017/S1351324922000225","url":null,"abstract":"Abstract Recent research has reported that standard fine-tuning approaches can be unstable due to being prone to various sources of randomness, including but not limited to weight initialization, training data order, and hardware. Such brittleness can lead to different evaluation results, prediction confidences, and generalization inconsistency of the same models independently fine-tuned under the same experimental setup. Our paper explores this problem in natural language inference, a common task in benchmarking practices, and extends the ongoing research to the multilingual setting. We propose six novel textual entailment and broad-coverage diagnostic datasets for French, German, and Swedish. Our key findings are that the mBERT model demonstrates fine-tuning instability for categories that involve lexical semantics, logic, and predicate-argument structure and struggles to learn monotonicity, negation, numeracy, and symmetry. We also observe that using extra training data only in English can enhance the generalization performance and fine-tuning stability, which we attribute to the cross-lingual transfer capabilities. However, the ratio of particular features in the additional training data might rather hurt the performance for model instances. We are publicly releasing the datasets, hoping to foster the diagnostic investigation of language models (LMs) in a cross-lingual scenario, particularly in terms of benchmarking, which might promote a more holistic understanding of multilingualism in LMs and cross-lingual knowledge transfer.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"554 - 583"},"PeriodicalIF":2.5,"publicationDate":"2022-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47433843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Emerging trends: General fine-tuning (gft) 新兴趋势:通用微调(gft)
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2022-05-23 DOI: 10.1017/S1351324922000237
Kenneth Ward Church, Xingyu Cai, Yibiao Ying, Zeyu Chen, Guangxu Xun, Yuchen Bian
{"title":"Emerging trends: General fine-tuning (gft)","authors":"Kenneth Ward Church, Xingyu Cai, Yibiao Ying, Zeyu Chen, Guangxu Xun, Yuchen Bian","doi":"10.1017/S1351324922000237","DOIUrl":"https://doi.org/10.1017/S1351324922000237","url":null,"abstract":"Abstract This paper describes gft (general fine-tuning), a little language for deep nets, introduced at an ACL-2022 tutorial. gft makes deep nets accessible to a broad audience including non-programmers. It is standard practice in many fields to use statistics packages such as R. One should not need to know how to program in order to fit a regression or classification model and to use the model to make predictions for novel inputs. With gft, fine-tuning and inference are similar to fit and predict in regression and classification. gft demystifies deep nets; no one would suggest that regression-like methods are “intelligent.”","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"28 1","pages":"519 - 535"},"PeriodicalIF":2.5,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45341971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Turkish abstractive text summarization using pretrained sequence-to-sequence models 使用预训练序列到序列模型的土耳其语抽象文本摘要
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2022-05-13 DOI: 10.1017/S1351324922000195
Batuhan Baykara, Tunga Güngör
{"title":"Turkish abstractive text summarization using pretrained sequence-to-sequence models","authors":"Batuhan Baykara, Tunga Güngör","doi":"10.1017/S1351324922000195","DOIUrl":"https://doi.org/10.1017/S1351324922000195","url":null,"abstract":"Abstract The tremendous amount of increase in the number of documents available on the Web has turned finding the relevant piece of information into a challenging, tedious, and time-consuming activity. Accordingly, automatic text summarization has become an important field of study by gaining significant attention from the researchers. Lately, with the advances in deep learning, neural abstractive text summarization with sequence-to-sequence (Seq2Seq) models has gained popularity. There have been many improvements in these models such as the use of pretrained language models (e.g., GPT, BERT, and XLM) and pretrained Seq2Seq models (e.g., BART and T5). These improvements have addressed certain shortcomings in neural summarization and have improved upon challenges such as saliency, fluency, and semantics which enable generating higher quality summaries. Unfortunately, these research attempts were mostly limited to the English language. Monolingual BERT models and multilingual pretrained Seq2Seq models have been released recently providing the opportunity to utilize such state-of-the-art models in low-resource languages such as Turkish. In this study, we make use of pretrained Seq2Seq models and obtain state-of-the-art results on the two large-scale Turkish datasets, TR-News and MLSum, for the text summarization task. Then, we utilize the title information in the datasets and establish hard baselines for the title generation task on both datasets. We show that the input to the models has a substantial amount of importance for the success of such tasks. Additionally, we provide extensive analysis of the models including cross-dataset evaluations, various text generation options, and the effect of preprocessing in ROUGE evaluations for Turkish. It is shown that the monolingual BERT models outperform the multilingual BERT models on all tasks across all the datasets. Lastly, qualitative evaluations of the generated summaries and titles of the models are provided.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1275 - 1304"},"PeriodicalIF":2.5,"publicationDate":"2022-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46982077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Natural Language Processing for Corpus Linguistics by Jonathan Dunn. Cambridge: Cambridge University Press, 2022. ISBN 9781009070447 (PB), ISBN 9781009070447 (OC), vi+88 pages. 《语料库语言学中的自然语言处理》,作者:Jonathan Dunn。剑桥:剑桥大学出版社,2022。ISBN 9781009070447 (PB), ISBN 9781009070447 (OC), vi+88页。
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2022-05-12 DOI: 10.1017/S1351324922000201
J. Wen, Lan Yi
{"title":"Natural Language Processing for Corpus Linguistics by Jonathan Dunn. Cambridge: Cambridge University Press, 2022. ISBN 9781009070447 (PB), ISBN 9781009070447 (OC), vi+88 pages.","authors":"J. Wen, Lan Yi","doi":"10.1017/S1351324922000201","DOIUrl":"https://doi.org/10.1017/S1351324922000201","url":null,"abstract":"Corpus linguistics is essentially the computer-based empirical analysis that examines naturally occurring language and its use with a representative collection of machine-readable texts (Sinclair, 1991; Biber, Conrad and Reppen, 1998; McEnery and Hardie, 2012). The techniques of corpus linguistics enable the analyzing of large amounts of corpus data from both qualitative (e.g., concordances) and quantitative (e.g., word frequencies) perspectives, which in turn may yield evidence for or against the proposed linguistic statements or assumptions (Reppen, 2010). Despite its success in a wide range of fields (Römer, 2022), traditional corpus linguistics has become seemingly disconnected from recent technological advances in artificial intelligence as the computing power and corpus data available for linguistic analysis continue to grow in the past decades. In this connection, more sophisticated methods are needed to update and expand the arsenal for corpus linguistics research. As its name suggests, this monograph focuses exclusively on utilizing NLP techniques to uncover different aspects of language use through the lens of corpus linguistics. It consists of four main chapters plus a brief conclusion. Each of the four main chapters highlights a different aspect of computational methodologies for corpus linguistic research, followed by a discussion on the potential ethical issues that are pertinent to the application. Five corpus-based case studies are presented to demonstrate how and why a particular computational method is used for linguistic analysis. Given the methodological orientation of the book, it is not surprising that there are substantial technical details concerning the implementation of these methods, which is usually a daunting task for those readers without any background knowledge in computer programming. Fortunately, the author has made all the Python scripts and corpus data used in the case studies publicly available online at https://doi.org/10.24433/CO.3402613.v1. These online supporting materials are an invaluable complement to the book because they not only ease readers from coding but also make every result and graph in the book readily reproducible. To provide better hands-on experience for readers, a quick walkthrough on the accessing of online materials is presented prior to the beginning of the main chapters. With just a few clicks, readers will be able to run the code and replicate the case studies with interactive code notebooks. Of course, readers who are familiar with Python programming are encouraged to further explore the corpus data and expand the scripts to serve their own research purposes. Chapter 1 provides a general overview of the computational analysis in corpus linguistics research and outlines the key issues to be addressed. It first defines the major problems (namely, categorization and comparison) in corpus analysis that NLP models can solve, and explains why computational linguistic analysis is needed for","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"842 - 845"},"PeriodicalIF":2.5,"publicationDate":"2022-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42332455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Abstract meaning representation of Turkish 土耳其语的抽象意义表征
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2022-04-28 DOI: 10.1017/s1351324922000183
Elif Oral, Ali Acar, Gülşen Eryiğit
{"title":"Abstract meaning representation of Turkish","authors":"Elif Oral, Ali Acar, Gülşen Eryiğit","doi":"10.1017/s1351324922000183","DOIUrl":"https://doi.org/10.1017/s1351324922000183","url":null,"abstract":"\u0000 Abstract meaning representation (AMR) is a graph-based sentence-level meaning representation that has become highly popular in recent years. AMR is a knowledge-based meaning representation heavily relying on frame semantics for linking predicate frames and entity knowledge bases such as DBpedia for linking named entity concepts. Although it is originally designed for English, its adaptation to non-English languages is possible by defining language-specific divergences and representations. This article introduces the first AMR representation framework for Turkish, which poses diverse challenges for AMR due to its typological differences compared to English; agglutinative, free constituent order, morphologically highly rich resulting in fewer word surface forms in sentences. The introduced solutions to these peculiarities are expected to guide the studies for other similar languages and speed up the construction of a cross-lingual universal AMR framework. Besides this main contribution, the article also presents the construction of the first AMR corpus of 700 sentences, the first AMR parser (i.e., a tree-to-graph rule-based AMR parser) used for semi-automatic annotation, and the evaluation of the introduced resources for Turkish.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2022-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46085536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A survey of methods for revealing and overcoming weaknesses of data-driven Natural Language Understanding 揭示和克服数据驱动的自然语言理解弱点的方法综述
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2022-04-22 DOI: 10.1017/s1351324922000171
Viktor Schlegel, G. Nenadic, R. Batista-Navarro
{"title":"A survey of methods for revealing and overcoming weaknesses of data-driven Natural Language Understanding","authors":"Viktor Schlegel, G. Nenadic, R. Batista-Navarro","doi":"10.1017/s1351324922000171","DOIUrl":"https://doi.org/10.1017/s1351324922000171","url":null,"abstract":"Abstract Recent years have seen a growing number of publications that analyse Natural Language Understanding (NLU) datasets for superficial cues, whether they undermine the complexity of the tasks underlying those datasets and how they impact those models that are optimised and evaluated on this data. This structured survey provides an overview of the evolving research area by categorising reported weaknesses in models and datasets and the methods proposed to reveal and alleviate those weaknesses for the English language. We summarise and discuss the findings and conclude with a set of recommendations for possible future research directions. We hope that it will be a useful resource for researchers who propose new datasets to assess the suitability and quality of their data to evaluate various phenomena of interest, as well as those who propose novel NLU approaches, to further understand the implications of their improvements with respect to their model’s acquired capabilities.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1 - 31"},"PeriodicalIF":2.5,"publicationDate":"2022-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44296997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
NLE volume 28 issue 3 Cover and Front matter NLE第28卷第3期封面和封面问题
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2022-04-08 DOI: 10.1017/s1351324922000158
R. Mitkov, B. Boguraev
{"title":"NLE volume 28 issue 3 Cover and Front matter","authors":"R. Mitkov, B. Boguraev","doi":"10.1017/s1351324922000158","DOIUrl":"https://doi.org/10.1017/s1351324922000158","url":null,"abstract":"trans-lation, computer science or engineering. Its is to the computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing original research articles on a broad range of topics - from text analy- sis, machine translation, information retrieval, speech processing and generation to integrated systems and multi-modal interfaces - it also publishes special issues on specific natural language processing methods, tasks or applications. The journal welcomes survey papers describing the state of the art of a specific topic. The Journal of Natural Language Engineering also publishes the popular Industry Watch and Emerging Trends columns as well as book reviews.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":"f1 - f2"},"PeriodicalIF":2.5,"publicationDate":"2022-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45341304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NLE volume 28 issue 3 Cover and Back matter NLE第28卷第3期封面和封底
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2022-04-08 DOI: 10.1017/s135132492200016x
{"title":"NLE volume 28 issue 3 Cover and Back matter","authors":"","doi":"10.1017/s135132492200016x","DOIUrl":"https://doi.org/10.1017/s135132492200016x","url":null,"abstract":"","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":"b1 - b2"},"PeriodicalIF":2.5,"publicationDate":"2022-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46428676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The voice synthesis business: 2022 update 语音合成业务:2022年更新
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2022-04-08 DOI: 10.1017/S1351324922000146
R. Dale
{"title":"The voice synthesis business: 2022 update","authors":"R. Dale","doi":"10.1017/S1351324922000146","DOIUrl":"https://doi.org/10.1017/S1351324922000146","url":null,"abstract":"Abstract In the past few years, high-quality automated text-to-speech synthesis has effectively become a commodity, with easy access to cloud-based APIs provided by a number of major players. At the same time, developments in deep learning have broadened the scope of voice synthesis functionalities that can be delivered, leading to a growth in the range of commercially viable use cases. We take a look at the technology features and use cases that have attracted attention and investment in the past few years, identifying the major players and recent start-ups in the space.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"28 1","pages":"401 - 408"},"PeriodicalIF":2.5,"publicationDate":"2022-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44725390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信