Scaling Hawkes processes to one million COVID-19 cases

arXiv - STAT - Computation Pub Date : 2024-07-16 DOI:arxiv-2407.11349

Seyoon Ko, Marc A. Suchard, Andrew J. Holbrook

{"title":"Scaling Hawkes processes to one million COVID-19 cases","authors":"Seyoon Ko, Marc A. Suchard, Andrew J. Holbrook","doi":"arxiv-2407.11349","DOIUrl":null,"url":null,"abstract":"Hawkes stochastic point process models have emerged as valuable statistical\ntools for analyzing viral contagion. The spatiotemporal Hawkes process\ncharacterizes the speeds at which viruses spread within human populations.\nUnfortunately, likelihood-based inference using these models requires $O(N^2)$\nfloating-point operations, for $N$ the number of observed cases. Recent work\nresponds to the Hawkes likelihood's computational burden by developing\nefficient graphics processing unit (GPU)-based routines that enable Bayesian\nanalysis of tens-of-thousands of observations. We build on this work and\ndevelop a high-performance computing (HPC) strategy that divides 30 Markov\nchains between 4 GPU nodes, each of which uses multiple GPUs to accelerate its\nchain's likelihood computations. We use this framework to apply two\nspatiotemporal Hawkes models to the analysis of one million COVID-19 cases in\nthe United States between March 2020 and June 2023. In addition to brute-force\nHPC, we advocate for two simple strategies as scalable alternatives to\nsuccessful approaches proposed for small data settings. First, we use known\ncounty-specific population densities to build a spatially varying triggering\nkernel in a manner that avoids computationally costly nearest neighbors search.\nSecond, we use a cut-posterior inference routine that accounts for infections'\nspatial location uncertainty by iteratively sampling latent locations uniformly\nwithin their respective counties of occurrence, thereby avoiding full-blown\nlatent variable inference for 1,000,000 infection locations.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.11349","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Hawkes stochastic point process models have emerged as valuable statistical tools for analyzing viral contagion. The spatiotemporal Hawkes process characterizes the speeds at which viruses spread within human populations. Unfortunately, likelihood-based inference using these models requires $O(N^2)$ floating-point operations, for $N$ the number of observed cases. Recent work responds to the Hawkes likelihood's computational burden by developing efficient graphics processing unit (GPU)-based routines that enable Bayesian analysis of tens-of-thousands of observations. We build on this work and develop a high-performance computing (HPC) strategy that divides 30 Markov chains between 4 GPU nodes, each of which uses multiple GPUs to accelerate its chain's likelihood computations. We use this framework to apply two spatiotemporal Hawkes models to the analysis of one million COVID-19 cases in the United States between March 2020 and June 2023. In addition to brute-force HPC, we advocate for two simple strategies as scalable alternatives to successful approaches proposed for small data settings. First, we use known county-specific population densities to build a spatially varying triggering kernel in a manner that avoids computationally costly nearest neighbors search. Second, we use a cut-posterior inference routine that accounts for infections' spatial location uncertainty by iteratively sampling latent locations uniformly within their respective counties of occurrence, thereby avoiding full-blown latent variable inference for 1,000,000 infection locations.

查看原文本刊更多论文

将霍克斯过程扩展到一百万个 COVID-19 病例

霍克斯随机点过程模型已成为分析病毒传染的重要统计工具。时空霍克斯过程描述了病毒在人类种群中的传播速度。不幸的是，使用这些模型进行基于似然法的推断需要 $O(N^2)$ 的浮点运算，而 $N$ 是观察到的病例数。最近的工作通过开发基于图形处理器（GPU）的高效例程来解决霍克斯似然法的计算负担问题，这些例程可以对数以万计的观测数据进行贝叶斯分析。我们在此基础上开发了一种高性能计算（HPC）策略，将 30 个马尔可夫链划分为 4 个 GPU 节点，每个节点使用多个 GPU 加速其链的似然计算。我们利用这一框架将两个时空霍克斯模型应用于分析 2020 年 3 月至 2023 年 6 月期间美国的 100 万 COVID-19 病例。除了 "蛮力高性能计算"（brute-forceHPC）外，我们还主张采用两种简单的策略，作为针对小数据环境提出的成功方法的可扩展替代方案。首先，我们使用已知的特定县域人口密度来构建空间变化的触发核，这种方式避免了计算成本高昂的近邻搜索。其次，我们使用切后置推断例程，通过在各自的发生县域内均匀地迭代采样潜伏位置来考虑感染的空间位置不确定性，从而避免了对 1,000,000 个感染位置进行全吹式潜伏变量推断。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - STAT - Computation

自引率

0.00%

发文量