Natural Language Engineering最新文献

筛选
英文 中文
How you describe procurement calls matters: Predicting outcome of public procurement using call descriptions 你如何描述采购需求:使用需求描述预测公共采购的结果
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2023-08-10 DOI: 10.1017/s135132492300030x
U. Acikalin, Mustafa Kaan Gorgun, Mucahid Kutlu, B. Tas
{"title":"How you describe procurement calls matters: Predicting outcome of public procurement using call descriptions","authors":"U. Acikalin, Mustafa Kaan Gorgun, Mucahid Kutlu, B. Tas","doi":"10.1017/s135132492300030x","DOIUrl":"https://doi.org/10.1017/s135132492300030x","url":null,"abstract":"\u0000 A competitive and cost-effective public procurement (PP) process is essential for the effective use of public resources. In this work, we explore whether descriptions of procurement calls can be used to predict their outcomes. In particular, we focus on predicting four well-known economic metrics: (i) the number of offers, (ii) whether only a single offer is received, (iii) whether a foreign firm is awarded the contract, and (iv) whether the contract price exceeds the expected price. We extract the European Union’s multilingual PP notices, covering 22 different languages. We investigate fine-tuning multilingual transformer models and propose two approaches: (1) multilayer perceptron (MLP) models with transformer embeddings for each business sector in which the training data are filtered based on the procurement category and (2) a k-nearest neighbor (KNN)-based approach fine-tuned using triplet networks. The fine-tuned MBERT model outperforms all other models in predicting calls with a single offer and foreign contract awards, whereas our MLP-based filtering approach yields state-of-the-art results in predicting contracts in which the contract price exceeds the expected price. Furthermore, our KNN-based approach outperforms all the baselines in all tasks and our other proposed models in predicting the number of offers. Moreover, we investigate cross-lingual and multilingual training for our tasks and observe that multilingual training improves prediction accuracy in all our tasks. Overall, our experiments suggest that notice descriptions play an important role in the outcomes of PP calls.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42253722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SSL-GAN-RoBERTa: A robust semi-supervised model for detecting Anti-Asian COVID-19 hate speech on social media SSL-GAN-RoBERTa:一个用于检测社交媒体上反阿新冠肺炎仇恨言论的稳健半监督模型
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2023-08-03 DOI: 10.1017/s1351324923000396
Xuanyu Su, Yansong Li, Paula Branco, D. Inkpen
{"title":"SSL-GAN-RoBERTa: A robust semi-supervised model for detecting Anti-Asian COVID-19 hate speech on social media","authors":"Xuanyu Su, Yansong Li, Paula Branco, D. Inkpen","doi":"10.1017/s1351324923000396","DOIUrl":"https://doi.org/10.1017/s1351324923000396","url":null,"abstract":"\u0000 Anti-Asian speech during the COVID-19 pandemic has been a serious problem with severe consequences. A hate speech wave swept social media platforms. The timely detection of Anti-Asian COVID-19-related hate speech is of utmost importance, not only to allow the application of preventive mechanisms but also to anticipate and possibly prevent other similar discriminatory situations. In this paper, we address the problem of detecting Anti-Asian COVID-19-related hate speech from social media data. Previous approaches that tackled this problem used a transformer-based model, BERT/RoBERTa, trained on the homologous annotated dataset and achieved good performance on this task. However, this requires extensive and annotated datasets with a strong connection to the topic. Both goals are difficult to meet without employing reliable, vast, and costly resources. In this paper, we propose a robust semi-supervised model, SSL-GAN-RoBERTa, that learns from a limited heterogeneous dataset and whose performance is further enhanced by using vast amounts of unlabeled data from another related domain. Compared with the RoBERTa baseline model, the experimental results show that the model has substantial performance gains in terms of Accuracy and Macro-F1 score in different scenarios that use data from different domains. Our proposed model achieves state-of-the-art performance results while efficiently using unlabeled data, showing promising applicability to other complex classification tasks where large amounts of labeled examples are difficult to obtain.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44251205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Masked transformer through knowledge distillation for unsupervised text style transfer 屏蔽变压器通过知识升华实现无监督文本样式转移
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2023-07-25 DOI: 10.1017/s1351324923000323
Arthur Scalercio, A. Paes
{"title":"Masked transformer through knowledge distillation for unsupervised text style transfer","authors":"Arthur Scalercio, A. Paes","doi":"10.1017/s1351324923000323","DOIUrl":"https://doi.org/10.1017/s1351324923000323","url":null,"abstract":"\u0000 Text style transfer (TST) aims at automatically changing a text’s stylistic features, such as formality, sentiment, authorial style, humor, and complexity, while still trying to preserve its content. Although the scientific community has investigated TST since the 1980s, it has recently regained attention by adopting deep unsupervised strategies to address the challenge of training without parallel data. In this manuscript, we investigate how relying on sequence-to-sequence pretraining models affects the performance of TST when the pretraining step leverages pairs of paraphrase data. Furthermore, we propose a new technique to enhance the sequence-to-sequence model by distilling knowledge from masked language models. We evaluate our proposals on three unsupervised style transfer tasks with widely used benchmarks: author imitation, formality transfer, and polarity swap. The evaluation relies on quantitative and qualitative analyses and comparisons with the results of state-of-the-art models. For the author imitation and the formality transfer task, we show that using the proposed techniques improves all measured metrics and leads to state-of-the-art (SOTA) results in content preservation and an overall score in the author imitation domain. In the formality transfer domain, we paired with the SOTA method in the style control metric. Regarding the polarity swap domain, we show that the knowledge distillation component improves all measured metrics. The paraphrase pretraining increases content preservation at the expense of harming style control. Based on the results reached in these domains, we also discuss in the manuscript if the tasks we address have the same nature and should be equally treated as TST tasks.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46196456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessment of the E3C corpus for the recognition of disorders in clinical texts E3C语料库用于识别临床文本中的障碍的评估
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2023-07-18 DOI: 10.1017/s1351324923000335
Roberto Zanoli, A. Lavelli, Daniel Verdi do Amarante, Daniele Toti
{"title":"Assessment of the E3C corpus for the recognition of disorders in clinical texts","authors":"Roberto Zanoli, A. Lavelli, Daniel Verdi do Amarante, Daniele Toti","doi":"10.1017/s1351324923000335","DOIUrl":"https://doi.org/10.1017/s1351324923000335","url":null,"abstract":"\u0000 Disorder named entity recognition (DNER) is a fundamental task of biomedical natural language processing, which has attracted plenty of attention. This task consists in extracting named entities of disorders such as diseases, symptoms, and pathological functions from unstructured text. The European Clinical Case Corpus (E3C) is a freely available multilingual corpus (English, French, Italian, Spanish, and Basque) of semantically annotated clinical case texts. The entities of type disorder in the clinical cases are annotated at both mention and concept level. At mention -level, the annotation identifies the entity text spans, for example, abdominal pain. At concept level, the entity text spans are associated with their concept identifiers in Unified Medical Language System, for example, C0000737. This corpus can be exploited as a benchmark for training and assessing information extraction systems. Within the context of the present work, multiple experiments have been conducted in order to test the appropriateness of the mention-level annotation of the E3C corpus for training DNER models. In these experiments, traditional machine learning models like conditional random fields and more recent multilingual pre-trained models based on deep learning were compared with standard baselines. With regard to the multilingual pre-trained models, they were fine-tuned (i) on each language of the corpus to test per-language performance, (ii) on all languages to test multilingual learning, and (iii) on all languages except the target language to test cross-lingual transfer learning. Results show the appropriateness of the E3C corpus for training a system capable of mining disorder entities from clinical case texts. Researchers can use these results as the baselines for this corpus to compare their own models. The implemented models have been made available through the European Language Grid platform for quick and easy access.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43124465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Describe the house and I will tell you the price: House price prediction with textual description data 描述房子,我告诉你价格:房价预测用文字描述数据
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2023-07-18 DOI: 10.1017/s1351324923000360
Han Zhang, Yansong Li, Paula Branco
{"title":"Describe the house and I will tell you the price: House price prediction with textual description data","authors":"Han Zhang, Yansong Li, Paula Branco","doi":"10.1017/s1351324923000360","DOIUrl":"https://doi.org/10.1017/s1351324923000360","url":null,"abstract":"\u0000 House price prediction is an important problem that could benefit home buyers and sellers. Traditional models for house price prediction use numerical attributes such as the number of rooms but disregard the house description text. The recent developments in text processing suggest these can be valuable attributes, which motivated us to use house descriptions. This paper focuses on the house asking/advertising price and studies the impact of using house description texts to predict the final house price. To achieve this, we collected a large and diverse set of attributes on house postings, including the house advertising price. Then, we compare the performance of three scenarios: using only the house description, only numeric attributes, or both. We processed the description text through three word embedding techniques: TF-IDF, Word2Vec, and BERT. Four regression algorithms are trained using only textual data, non-textual data, or both. Our results show that by using exclusively the description data with Word2Vec and a Deep Learning model, we can achieve good performance. However, the best overall performance is obtained when using both textual and non-textual features. An \u0000 \u0000 \u0000 \u0000$R^2$\u0000\u0000 \u0000 of 0.7904 is achieved by the deep learning model using only description data on the testing data. This clearly indicates that using the house description text alone is a strong predictor for the house price. However, when observing the RMSE on the test data, the best model was gradient boosting using both numeric and description data. Overall, we observe that combining the textual and non-textual features improves the learned model and provides performance benefits when compared against using only one of the feature types. We also provide a freely available application for house price prediction, which is solely based on a house text description and uses our final developed model with Word2Vec and Deep Learning to predict the house price.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46065225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Navigating the text generation revolution: Traditional data-to-text NLG companies and the rise of ChatGPT 引导文本生成革命:传统的数据到文本的NLG公司和ChatGPT的兴起
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2023-07-01 DOI: 10.1017/S1351324923000347
R. Dale
{"title":"Navigating the text generation revolution: Traditional data-to-text NLG companies and the rise of ChatGPT","authors":"R. Dale","doi":"10.1017/S1351324923000347","DOIUrl":"https://doi.org/10.1017/S1351324923000347","url":null,"abstract":"Abstract Since the release of ChatGPT at the end of November 2022, generative AI has been talked about endlessly in both the technical press and the mainstream media. Large language model technology has been heralded as many things: the disruption of the search engine, the end of the student essay, the bringer of disinformation … but what does it mean for commercial providers of earlier iterations of natural language generation technology? We look at how the major players in the space are responding, and where things might go in the future.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41802752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Korean named entity recognition based on language-specific features 基于特定语言特征的韩文命名实体识别
3区 计算机科学
Natural Language Engineering Pub Date : 2023-06-29 DOI: 10.1017/s1351324923000311
Yige Chen, KyungTae Lim, Jungyeul Park
{"title":"Korean named entity recognition based on language-specific features","authors":"Yige Chen, KyungTae Lim, Jungyeul Park","doi":"10.1017/s1351324923000311","DOIUrl":"https://doi.org/10.1017/s1351324923000311","url":null,"abstract":"Abstract In this paper, we propose a novel way of improving named entity recognition (NER) in the Korean language using its language-specific features. While the field of NER has been studied extensively in recent years, the mechanism of efficiently recognizing named entities (NEs) in Korean has hardly been explored. This is because the Korean language has distinct linguistic properties that present challenges for modeling. Therefore, an annotation scheme for Korean corpora by adopting the CoNLL-U format, which decomposes Korean words into morphemes and reduces the ambiguity of NEs in the original segmentation that may contain functional morphemes such as postpositions and particles, is proposed herein. We investigate how the NE tags are best represented in this morpheme-based scheme and implement an algorithm to convert word-based and syllable-based Korean corpora with NEs into the proposed morpheme-based format. Analyses of the results of traditional and neural models reveal that the proposed morpheme-based format is feasible, and the varied performances of the models under the influence of various additional language-specific features are demonstrated. Extrinsic conditions were also considered to observe the variance of the performances of the proposed models, given different types of data, including the original segmentation and different types of tagging formats.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135049783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Linguistically aware evaluation of coreference resolution from the perspective of higher-level applications 从更高层次应用的角度对共指消解的语言意识评价
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2023-06-19 DOI: 10.1017/s1351324923000293
Voldemaras Žitkus, R. Butkienė, R. Butleris
{"title":"Linguistically aware evaluation of coreference resolution from the perspective of higher-level applications","authors":"Voldemaras Žitkus, R. Butkienė, R. Butleris","doi":"10.1017/s1351324923000293","DOIUrl":"https://doi.org/10.1017/s1351324923000293","url":null,"abstract":"\u0000 Coreference resolution is an important part of natural language processing used in machine translation, semantic search, and various other information retrieval and understanding systems. One of the challenges in this field is an evaluation of resolution approaches. There are many different metrics proposed, but most of them rely on certain assumptions, like equivalence between different mentions of the same discourse-world entity, and do not account for overrepresentation of certain types of coreferences present in the evaluation data. In this paper, a new coreference evaluation strategy that focuses on linguistic and semantic information is presented that can address some of these shortcomings. Evaluation model was developed in the broader context of developing coreference resolution capabilities for Lithuanian language; therefore, the experiment was also carried out using Lithuanian language resources, but the proposed evaluation strategy is not language-dependent.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47991277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A resampling-based method to evaluate NLI models 一种基于重采样的NLI模型评估方法
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2023-06-09 DOI: 10.1017/s1351324923000268
Felipe Salvatore, M. Finger, R. Hirata, A. G. Patriota
{"title":"A resampling-based method to evaluate NLI models","authors":"Felipe Salvatore, M. Finger, R. Hirata, A. G. Patriota","doi":"10.1017/s1351324923000268","DOIUrl":"https://doi.org/10.1017/s1351324923000268","url":null,"abstract":"\u0000 The recent progress of deep learning techniques has produced models capable of achieving high scores on traditional Natural Language Inference (NLI) datasets. To understand the generalization limits of these powerful models, an increasing number of adversarial evaluation schemes have appeared. These works use a similar evaluation method: they construct a new NLI test set based on sentences with known logic and semantic properties (the adversarial set), train a model on a benchmark NLI dataset, and evaluate it in the new set. Poor performance on the adversarial set is identified as a model limitation. The problem with this evaluation procedure is that it may only indicate a sampling problem. A machine learning model can perform poorly on a new test set because the text patterns presented in the adversarial set are not well represented in the training sample. To address this problem, we present a new evaluation method, the Invariance under Equivalence test (IE test). The IE test trains a model with sufficient adversarial examples and checks the model’s performance on two equivalent datasets. As a case study, we apply the IE test to the state-of-the-art NLI models using synonym substitution as the form of adversarial examples. The experiment shows that, despite their high predictive power, these models usually produce different inference outputs for equivalent inputs, and, more importantly, this deficiency cannot be solved by adding adversarial observations in the training data.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44366636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Explainable Natural Language Processing, by Anders Søgaard. San Rafael, CA: Morgan & Claypool, 2021. ISBN 978-1-636-39213-4. XV+107 pages. 《可解释的自然语言处理》,Anders Søgaard著。加利福尼亚州圣拉斐尔:Morgan&Claypool,2021。ISBN 978-1-636-39213-4。XV+107页。
IF 2.5 3区 计算机科学
Natural Language Engineering Pub Date : 2023-06-02 DOI: 10.1017/s1351324923000281
Zihao Zhang
{"title":"Explainable Natural Language Processing, by Anders Søgaard. San Rafael, CA: Morgan & Claypool, 2021. ISBN 978-1-636-39213-4. XV+107 pages.","authors":"Zihao Zhang","doi":"10.1017/s1351324923000281","DOIUrl":"https://doi.org/10.1017/s1351324923000281","url":null,"abstract":"","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41450642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信