GeoNLPlify: A spatial data augmentation enhancing text classification for crisis monitoring

IF 0.8 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Intelligent Data Analysis Pub Date : 2023-07-06 DOI:10.3233/ida-230040

R. Découpes, M. Roche, M. Teisseire

{"title":"GeoNLPlify: A spatial data augmentation enhancing text classification for crisis monitoring","authors":"R. Découpes, M. Roche, M. Teisseire","doi":"10.3233/ida-230040","DOIUrl":null,"url":null,"abstract":"Crises such as natural disasters and public health emergencies generate vast amounts of text data, making it challenging to classify the information into relevant categories. Acquiring expert-labeled data for such scenarios can be difficult, leading to limited training datasets for text classification by fine-tuning BERT-like models. Unfortunately, traditional data augmentation techniques only slightly improve F1-scores. How can data augmentation be used to obtain better results in this applied domain? In this paper, using neural network explicability methods, we aim to highlight that fine-tuned BERT-like models on crisis corpora give too much importance to spatial information to make their predictions. This overfitting of spatial information limits their ability to generalize especially when the event which occurs in a place has evolved and changed since the training dataset has been built. To reduce this bias, we propose GeoNLPlify,1 a novel data augmentation technique that leverages spatial information to generate new labeled data for text classification related to crises. Our approach aims to address overfitting without necessitating modifications to the underlying model architecture, distinguishing it from other prevalent methods employed to combat overfitting. Our results show that GeoNLPlify significantly improves F1-scores, demonstrating the potential of the spatial information for data augmentation for crisis-related text classification tasks. In order to evaluate the contribution of our method, GeoNLPlify is applied to three public datasets (PADI-web, CrisisNLP and SST2) and compared with classical natural language processing data augmentations.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":" ","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2023-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Data Analysis","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.3233/ida-230040","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Crises such as natural disasters and public health emergencies generate vast amounts of text data, making it challenging to classify the information into relevant categories. Acquiring expert-labeled data for such scenarios can be difficult, leading to limited training datasets for text classification by fine-tuning BERT-like models. Unfortunately, traditional data augmentation techniques only slightly improve F1-scores. How can data augmentation be used to obtain better results in this applied domain? In this paper, using neural network explicability methods, we aim to highlight that fine-tuned BERT-like models on crisis corpora give too much importance to spatial information to make their predictions. This overfitting of spatial information limits their ability to generalize especially when the event which occurs in a place has evolved and changed since the training dataset has been built. To reduce this bias, we propose GeoNLPlify,1 a novel data augmentation technique that leverages spatial information to generate new labeled data for text classification related to crises. Our approach aims to address overfitting without necessitating modifications to the underlying model architecture, distinguishing it from other prevalent methods employed to combat overfitting. Our results show that GeoNLPlify significantly improves F1-scores, demonstrating the potential of the spatial information for data augmentation for crisis-related text classification tasks. In order to evaluate the contribution of our method, GeoNLPlify is applied to three public datasets (PADI-web, CrisisNLP and SST2) and compared with classical natural language processing data augmentations.

查看原文本刊更多论文

GeoNLPlify：一种用于危机监测的增强文本分类的空间数据增强

自然灾害和突发公共卫生事件等危机产生了大量的文本数据，因此将信息分类为相关类别具有挑战性。获取此类场景的专家标记数据可能很困难，通过微调类似BERT的模型，导致文本分类的训练数据集有限。不幸的是，传统的数据扩充技术只能略微提高F1成绩。如何使用数据扩充在这个应用领域中获得更好的结果？在本文中，使用神经网络可解释性方法，我们旨在强调危机语料库上微调的类BERT模型过于重视空间信息，无法进行预测。这种空间信息的过拟合限制了它们的泛化能力，尤其是当自训练数据集建立以来，发生在一个地方的事件发生了演变和变化时。为了减少这种偏差，我们提出了GeoNLPlify，1这是一种新的数据增强技术，利用空间信息生成新的标记数据，用于与危机相关的文本分类。我们的方法旨在解决过拟合问题，而无需对底层模型架构进行修改，将其与其他用于对抗过拟合的流行方法区分开来。我们的结果表明，GeoNLPlify显著提高了F1分数，证明了空间信息在危机相关文本分类任务中用于数据增强的潜力。为了评估我们的方法的贡献，将GeoNLPlify应用于三个公共数据集（PADI-web、CrisisNLP和SST2），并与经典的自然语言处理数据增强进行了比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Intelligent Data Analysis 工程技术-计算机：人工智能

CiteScore

2.20

自引率

5.90%

发文量

审稿时长

3.3 months

期刊介绍： Intelligent Data Analysis provides a forum for the examination of issues related to the research and applications of Artificial Intelligence techniques in data analysis across a variety of disciplines. These techniques include (but are not limited to): all areas of data visualization, data pre-processing (fusion, editing, transformation, filtering, sampling), data engineering, database mining techniques, tools and applications, use of domain knowledge in data analysis, big data applications, evolutionary algorithms, machine learning, neural nets, fuzzy logic, statistical pattern recognition, knowledge filtering, and post-processing. In particular, papers are preferred that discuss development of new AI related data analysis architectures, methodologies, and techniques and their applications to various domains.