Natural Language Engineering最新文献_第10页

Real-world sentence boundary detection using multitask learning: A case study on French 基于多任务学习的真实世界句子边界检测——以法语为例

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-04-06 DOI: 10.1017/s1351324922000134

Kyungtae Lim, Jungyeul Park

引用次数: 1

Gender bias in legal corpora and debiasing it 法律语料库中的性别偏见及其消除

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-03-30 DOI: 10.1017/s1351324922000122

Nurullah Sevim, Furkan Şahinuç, Aykut Koç

{"title":"Gender bias in legal corpora and debiasing it","authors":"Nurullah Sevim, Furkan Şahinuç, Aykut Koç","doi":"10.1017/s1351324922000122","DOIUrl":"https://doi.org/10.1017/s1351324922000122","url":null,"abstract":"\u0000 Word embeddings have become important building blocks that are used profoundly in natural language processing (NLP). Despite their several advantages, word embeddings can unintentionally accommodate some gender- and ethnicity-based biases that are present within the corpora they are trained on. Therefore, ethical concerns have been raised since word embeddings are extensively used in several high-level algorithms. Studying such biases and debiasing them have recently become an important research endeavor. Various studies have been conducted to measure the extent of bias that word embeddings capture and to eradicate them. Concurrently, as another subfield that has started to gain traction recently, the applications of NLP in the field of law have started to increase and develop rapidly. As law has a direct and utmost effect on people’s lives, the issues of bias for NLP applications in legal domain are certainly important. However, to the best of our knowledge, bias issues have not yet been studied in the context of legal corpora. In this article, we approach the gender bias problem from the scope of legal text processing domain. Word embedding models that are trained on corpora composed by legal documents and legislation from different countries have been utilized to measure and eliminate gender bias in legal documents. Several methods have been employed to reveal the degree of gender bias and observe its variations over countries. Moreover, a debiasing method has been used to neutralize unwanted bias. The preservation of semantic coherence of the debiased vector space has also been demonstrated by using high-level tasks. Finally, overall results and their implications have been discussed in the scope of NLP in legal domain.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"1 1","pages":"449-482"},"PeriodicalIF":2.5,"publicationDate":"2022-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78705462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Generating Arabic TAG for syntax-semantics analysis 生成用于语法语义分析的阿拉伯语TAG

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-03-24 DOI: 10.1017/s1351324922000109

Chérifa Ben Khelil, C. Zribi, D. Duchier, Y. Parmentier

引用次数: 0

In-depth analysis of the impact of OCR errors on named entity recognition and linking 深入分析了OCR错误对命名实体识别和链接的影响

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-03-18 DOI: 10.1017/s1351324922000110

Ahmed Hamdi, Elvys Linhares Pontes, Nicolas Sidère, Mickaël Coustaty, A. Doucet

{"title":"In-depth analysis of the impact of OCR errors on named entity recognition and linking","authors":"Ahmed Hamdi, Elvys Linhares Pontes, Nicolas Sidère, Mickaël Coustaty, A. Doucet","doi":"10.1017/s1351324922000110","DOIUrl":"https://doi.org/10.1017/s1351324922000110","url":null,"abstract":"\u0000 Named entities (NEs) are among the most relevant type of information that can be used to properly index digital documents and thus easily retrieve them. It has long been observed that NEs are key to accessing the contents of digital library portals as they are contained in most user queries. However, most digitized documents are indexed through their optical character recognition (OCRed) version which include numerous errors. Although OCR engines have considerably improved over the last few years, OCR errors still considerably impact document access. Previous works were conducted to evaluate the impact of OCR errors on named entity recognition (NER) and named entity linking (NEL) techniques separately. In this article, we experimented with a variety of OCRed documents with different levels and types of OCR noise to assess in depth the impact of OCR on named entity processing. We provide a deep analysis of OCR errors that impact the performance of NER and NEL. We then present the resulting exhaustive study and subsequent recommendations on the adequate documents, the OCR quality levels, and the post-OCR correction strategies required to perform reliable NER and NEL.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"84 1","pages":"425-448"},"PeriodicalIF":2.5,"publicationDate":"2022-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83850798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

MHeTRep: A multilingual semantically tagged health terms repository MHeTRep:多语言语义标记的运行状况术语存储库

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-02-25 DOI: 10.1017/s1351324922000055

J. Vivaldi, H. Rodríguez

{"title":"MHeTRep: A multilingual semantically tagged health terms repository","authors":"J. Vivaldi, H. Rodríguez","doi":"10.1017/s1351324922000055","DOIUrl":"https://doi.org/10.1017/s1351324922000055","url":null,"abstract":"Abstract This paper presents MHeTRep, a multilingual medical terminology and the methodology followed for its compilation. The multilingual terminology is organised into one vocabulary for each language. All the terms in the collection are semantically tagged with a tagset corresponding to the top categories of Snomed-CT ontology. When possible, the individual terms are linked to their equivalent in the other languages. Even though many NLP resources and tools claim to be domain independent, their application to specific tasks can be restricted to specific domains, otherwise their performance degrades notably. As the accuracy of NLP resources drops heavily when applied in environments different from which they were built, a tuning to the new environment is needed. Usually, having a domain terminology facilitates and accelerates the adaptation of general domain NLP applications to a new domain. This is particularly important in medicine, a domain living moments of great expansion. The proposed method takes Snomed-CT as starting point. From this point and using 13 multilingual resources, covering the most relevant medical concepts such as drugs, anatomy, clinical findings and procedures, we built a large resource covering seven languages totalling more than two million semantically tagged terms. The resulting collection has been intensively evaluated in several ways for the involved languages and domain categories. Our hypothesis is that MHeTRep can be used advantageously over the original resources for a number of NLP use cases and likely extended to other languages.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1364 - 1401"},"PeriodicalIF":2.5,"publicationDate":"2022-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44880191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An empirical study of cyclical learning rate on neural machine translation 神经机器翻译周期学习率的实证研究

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-02-09 DOI: 10.1017/s135132492200002x

Weixuan Wang, Choon Meng Lee, Jianfeng Liu, Talha Çolakoğlu, Wei Peng

引用次数: 3

Emerging Trends: SOTA-Chasing 新兴趋势：SOTA追逐

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-02-08 DOI: 10.1017/S1351324922000043

Kenneth Ward Church, Valia Kordoni

引用次数: 23

NLE volume 28 issue 2 Cover and Front matter NLE第28卷第2期封面和封面问题

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-02-08 DOI: 10.1017/s1351324922000067

R. Mitkov, B. Boguraev

引用次数: 0

NLE volume 28 issue 2 Cover and Back matter NLE第28卷第2期封面和封底

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-02-08 DOI: 10.1017/s1351324922000079

引用次数: 0

Towards improving the robustness of sequential labeling models against typographical adversarial examples using triplet loss 利用三重损失提高序列标记模型对排版对抗示例的鲁棒性

IF 2.5 3区计算机科学

Natural Language Engineering Pub Date : 2022-02-04 DOI: 10.1017/s1351324921000486

Can Udomcharoenchaikit, P. Boonkwan, P. Vateekul

{"title":"Towards improving the robustness of sequential labeling models against typographical adversarial examples using triplet loss","authors":"Can Udomcharoenchaikit, P. Boonkwan, P. Vateekul","doi":"10.1017/s1351324921000486","DOIUrl":"https://doi.org/10.1017/s1351324921000486","url":null,"abstract":"\u0000 Many fundamentaltasks in natural language processing (NLP) such as part-of-speech tagging, text chunking, and named-entity recognition can be formulated as sequence labeling problems. Although neural sequence labeling models have shown excellent results on standard test sets, they are very brittle when presented with misspelled texts. In this paper, we introduce an adversarial training framework that enhances the robustness against typographical adversarial examples. We evaluate the robustness of sequence labeling models with an adversarial evaluation scheme that includes typographical adversarial examples. We generate two types of adversarial examples without access (black-box) or with full access (white-box) to the target model’s parameters. We conducted a series of extensive experiments on three languages (English, Thai, and German) across three sequence labeling tasks. Experiments show that the proposed adversarial training framework provides better resistance against adversarial examples on all tasks. We found that we can further improve the model’s robustness on the chunking task by including a triplet loss constraint.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"37 1","pages":"287-315"},"PeriodicalIF":2.5,"publicationDate":"2022-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91318432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0