Parallel Network Analysis and Communities Detection (PANC) Pipeline for the Analysis and Visualization of COVID-19 Data

Parallel Process. Lett. Pub Date : 2021-09-22 DOI:10.1142/s0129626421420020

Giuseppe Agapito, Marianna Milano, M. Cannataro

{"title":"Parallel Network Analysis and Communities Detection (PANC) Pipeline for the Analysis and Visualization of COVID-19 Data","authors":"Giuseppe Agapito, Marianna Milano, M. Cannataro","doi":"10.1142/s0129626421420020","DOIUrl":null,"url":null,"abstract":"A new coronavirus, causing a severe acute respiratory syndrome (COVID-19), was started at Wuhan, China, in December 2019. The epidemic has rapidly spread across the world becoming a pandemic that, as of today, has affected more than 70 million people causing over 2 million deaths. To better understand the evolution of spread of the COVID-19 pandemic, we developed PANC (Parallel Network Analysis and Communities Detection), a new parallel preprocessing methodology for network-based analysis and communities detection on Italian COVID-19 data. The goal of the methodology is to analyze set of homogeneous datasets (i.e. COVID-19 data in several regions) using a statistical test to find similar/dissimilar behaviours, mapping such similarity information on a graph and then using community detection algorithm to visualize and analyze the initial dataset. The methodology includes the following steps: (i) a parallel methodology to build similarity matrices that represent similar or dissimilar regions with respect to data; (ii) an effective workload balancing function to improve performance; (iii) the mapping of similarity matrices into networks where nodes represent Italian regions, and edges represent similarity relationships; (iv) the discovering and visualization of communities of regions that show similar behaviour. The methodology is general and can be applied to world-wide data about COVID-19, as well as to all types of data sets in tabular and matrix format. To estimate the scalability with increasing workloads, we analyzed three synthetic COVID-19 datasets with the size of 90.0[Formula: see text]MB, 180.0[Formula: see text]MB, and 360.0[Formula: see text]MB. Experiments was performed on showing the amount of data that can be analyzed in a given amount of time increases almost linearly with the number of computing resources available. Instead, to perform communities detection, we employed the real data set.","PeriodicalId":422436,"journal":{"name":"Parallel Process. Lett.","volume":"81 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Parallel Process. Lett.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s0129626421420020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

A new coronavirus, causing a severe acute respiratory syndrome (COVID-19), was started at Wuhan, China, in December 2019. The epidemic has rapidly spread across the world becoming a pandemic that, as of today, has affected more than 70 million people causing over 2 million deaths. To better understand the evolution of spread of the COVID-19 pandemic, we developed PANC (Parallel Network Analysis and Communities Detection), a new parallel preprocessing methodology for network-based analysis and communities detection on Italian COVID-19 data. The goal of the methodology is to analyze set of homogeneous datasets (i.e. COVID-19 data in several regions) using a statistical test to find similar/dissimilar behaviours, mapping such similarity information on a graph and then using community detection algorithm to visualize and analyze the initial dataset. The methodology includes the following steps: (i) a parallel methodology to build similarity matrices that represent similar or dissimilar regions with respect to data; (ii) an effective workload balancing function to improve performance; (iii) the mapping of similarity matrices into networks where nodes represent Italian regions, and edges represent similarity relationships; (iv) the discovering and visualization of communities of regions that show similar behaviour. The methodology is general and can be applied to world-wide data about COVID-19, as well as to all types of data sets in tabular and matrix format. To estimate the scalability with increasing workloads, we analyzed three synthetic COVID-19 datasets with the size of 90.0[Formula: see text]MB, 180.0[Formula: see text]MB, and 360.0[Formula: see text]MB. Experiments was performed on showing the amount of data that can be analyzed in a given amount of time increases almost linearly with the number of computing resources available. Instead, to perform communities detection, we employed the real data set.

查看原文本刊更多论文

面向COVID-19数据分析和可视化的并行网络分析和社区检测(PANC)管道

2019年12月，一种引起严重急性呼吸系统综合征(COVID-19)的新型冠状病毒在中国武汉爆发。这一流行病已迅速蔓延到世界各地，成为一种大流行病，截至今天，已影响到7 000多万人，造成200多万人死亡。为了更好地了解COVID-19大流行的传播演变，我们开发了PANC(并行网络分析和社区检测)，这是一种新的并行预处理方法，用于对意大利COVID-19数据进行基于网络的分析和社区检测。该方法的目标是使用统计测试来分析一组同质数据集(即几个地区的COVID-19数据)，以发现相似/不相似的行为，将此类相似性信息映射到图表上，然后使用社区检测算法对初始数据集进行可视化和分析。该方法包括以下步骤:(i)建立相似矩阵的并行方法，表示数据的相似或不相似区域;(ii)有效的工作负载平衡功能，以改善工作表现;(iii)将相似矩阵映射到网络中，节点代表意大利地区，边缘代表相似关系;(iv)发现和可视化表现出相似行为的地区社区。该方法是通用的，可应用于有关COVID-19的全球数据，以及表格和矩阵格式的所有类型的数据集。为了评估随着工作量增加的可扩展性，我们分析了三个合成COVID-19数据集，大小分别为90.0[公式:见文]MB、180.0[公式:见文]MB和360.0[公式:见文]MB。实验表明，在给定时间内可以分析的数据量几乎随可用计算资源的数量线性增加。相反，为了进行社区检测，我们使用了真实的数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Parallel Process. Lett.

自引率

0.00%

发文量