Fusion of spatiotemporal and thematic features of textual data for animal disease surveillance

IF 7.4 Q1 AGRICULTURE, MULTIDISCIPLINARY

Information Processing in Agriculture Pub Date : 2023-09-01 DOI:10.1016/j.inpa.2022.03.004

Sarah Valentin , Renaud Lancelot , Mathieu Roche

{"title":"Fusion of spatiotemporal and thematic features of textual data for animal disease surveillance","authors":"Sarah Valentin , Renaud Lancelot , Mathieu Roche","doi":"10.1016/j.inpa.2022.03.004","DOIUrl":null,"url":null,"abstract":"<div><p>Several internet-based surveillance systems have been created to monitor the web for animal health surveillance. These systems collect a large amount of news dealing with outbreaks related to animal diseases. Automatically identifying news articles that describe the same outbreak event is a key step to quickly detect relevant epidemiological information while alleviating manual curation of news content. This paper addresses the task of retrieving news articles that are related in epidemiological terms. We tackle this issue using text mining and feature fusion methods. The main objective of this paper is to identify a textual representation in which two articles that share the same epidemiological content are close. We compared two types of representations (i.e., features) to represent the documents: (i) morphosyntactic features (i.e., selection and transformation of all terms from the news, based on classical textual processing steps) and (ii) lexicosemantic features (i.e., selection, transformation and fusion of epidemiological terms including diseases, hosts, locations and dates). We compared two types of term weighing (i.e., Boolean and TF-IDF) for both representations. To combine and transform lexicosemantic features, we compared two data fusion techniques (i.e., early fusion and late fusion) and the effect of features generalisation, while evaluating the relative importance of each type of feature. We conducted our analysis using a corpus composed of a subset of news articles in English related to animal disease outbreaks. Our results showed that the combination of relevant lexicosemantic (epidemiological) features using fusion methods improves classical morphosyntactic representation in the context of disease-related news retrieval. The lexicosemantic representation based on TF-IDF and feature generalisation (F-measure = 0.92, r-precision = 0.58) outperformed the morphosyntactic representation (F-measure = 0.89, r-precision = 0.45), while reducing the features space. Converting the features into lower granular features (i.e., generalisation) contributed to improving the results of the lexicosemantic representation. Our results showed no difference between the early and late fusion approaches. Temporal features performed poorly on their own. Conversely, spatial features were the most discriminative features, highlighting the need for robust methods for spatial entity extraction, disambiguation and representation in internet-based surveillance systems.</p></div>","PeriodicalId":53443,"journal":{"name":"Information Processing in Agriculture","volume":"10 3","pages":"Pages 347-360"},"PeriodicalIF":7.4000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing in Agriculture","FirstCategoryId":"1091","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214317322000312","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AGRICULTURE, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 1

Abstract

Several internet-based surveillance systems have been created to monitor the web for animal health surveillance. These systems collect a large amount of news dealing with outbreaks related to animal diseases. Automatically identifying news articles that describe the same outbreak event is a key step to quickly detect relevant epidemiological information while alleviating manual curation of news content. This paper addresses the task of retrieving news articles that are related in epidemiological terms. We tackle this issue using text mining and feature fusion methods. The main objective of this paper is to identify a textual representation in which two articles that share the same epidemiological content are close. We compared two types of representations (i.e., features) to represent the documents: (i) morphosyntactic features (i.e., selection and transformation of all terms from the news, based on classical textual processing steps) and (ii) lexicosemantic features (i.e., selection, transformation and fusion of epidemiological terms including diseases, hosts, locations and dates). We compared two types of term weighing (i.e., Boolean and TF-IDF) for both representations. To combine and transform lexicosemantic features, we compared two data fusion techniques (i.e., early fusion and late fusion) and the effect of features generalisation, while evaluating the relative importance of each type of feature. We conducted our analysis using a corpus composed of a subset of news articles in English related to animal disease outbreaks. Our results showed that the combination of relevant lexicosemantic (epidemiological) features using fusion methods improves classical morphosyntactic representation in the context of disease-related news retrieval. The lexicosemantic representation based on TF-IDF and feature generalisation (F-measure = 0.92, r-precision = 0.58) outperformed the morphosyntactic representation (F-measure = 0.89, r-precision = 0.45), while reducing the features space. Converting the features into lower granular features (i.e., generalisation) contributed to improving the results of the lexicosemantic representation. Our results showed no difference between the early and late fusion approaches. Temporal features performed poorly on their own. Conversely, spatial features were the most discriminative features, highlighting the need for robust methods for spatial entity extraction, disambiguation and representation in internet-based surveillance systems.

查看原文本刊更多论文

动物疾病监测文本数据时空与主题特征融合研究

已经建立了几个基于互联网的监测系统来监测网络上的动物健康监测。这些系统收集了大量与动物疾病暴发有关的新闻。自动识别描述同一疫情事件的新闻文章是快速发现相关流行病学信息的关键步骤，同时减轻了对新闻内容的人工管理。本文解决了检索与流行病学术语相关的新闻文章的任务。我们使用文本挖掘和特征融合方法来解决这个问题。本文的主要目的是确定两篇具有相同流行病学内容的文章接近的文本表示。我们比较了两种类型的表征(即特征)来表示文件:(i)形态句法特征(即基于经典文本处理步骤从新闻中选择和转换所有术语)和(ii)词汇语义特征(即选择，转换和融合流行病学术语，包括疾病，宿主，地点和日期)。我们比较了两种表示的两种类型的术语加权(即布尔和TF-IDF)。为了组合和转换词汇语义特征，我们比较了两种数据融合技术(即早期融合和晚期融合)和特征泛化的效果，同时评估了每种类型特征的相对重要性。我们使用一个由与动物疾病暴发相关的英语新闻文章子集组成的语料库进行了分析。我们的研究结果表明，使用融合方法将相关的词汇语义(流行病学)特征组合在一起，可以改善疾病相关新闻检索中的经典形态句法表示。基于TF-IDF和特征泛化的词汇语义表示(F-measure = 0.92, r-precision = 0.58)优于形态句法表示(F-measure = 0.89, r-precision = 0.45)，同时减少了特征空间。将特征转换为更低粒度的特征(即泛化)有助于改善词汇语义表示的结果。我们的结果显示早期和晚期融合入路没有差异。时间特征本身表现不佳。相反，空间特征是最具区别性的特征，这突出了在基于互联网的监测系统中对空间实体提取、消歧和表示的强大方法的需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Processing in Agriculture Agricultural and Biological Sciences-Animal Science and Zoology

CiteScore

21.10

自引率

0.00%

发文量

期刊介绍： Information Processing in Agriculture (IPA) was established in 2013 and it encourages the development towards a science and technology of information processing in agriculture, through the following aims: • Promote the use of knowledge and methods from the information processing technologies in the agriculture; • Illustrate the experiences and publications of the institutes, universities and government, and also the profitable technologies on agriculture; • Provide opportunities and platform for exchanging knowledge, strategies and experiences among the researchers in information processing worldwide; • Promote and encourage interactions among agriculture Scientists, Meteorologists, Biologists (Pathologists/Entomologists) with IT Professionals and other stakeholders to develop and implement methods, techniques, tools, and issues related to information processing technology in agriculture; • Create and promote expert groups for development of agro-meteorological databases, crop and livestock modelling and applications for development of crop performance based decision support system. Topics of interest include, but are not limited to: • Smart Sensor and Wireless Sensor Network • Remote Sensing • Simulation, Optimization, Modeling and Automatic Control • Decision Support Systems, Intelligent Systems and Artificial Intelligence • Computer Vision and Image Processing • Inspection and Traceability for Food Quality • Precision Agriculture and Intelligent Instrument • The Internet of Things and Cloud Computing • Big Data and Data Mining