Sameh Abdulah, Yuxiao Li, Jian Cao, Hatem Ltaief, David E. Keyes, Marc G. Genton, Ying Sun
{"title":"Large-scale environmental data science with ExaGeoStatR","authors":"Sameh Abdulah, Yuxiao Li, Jian Cao, Hatem Ltaief, David E. Keyes, Marc G. Genton, Ying Sun","doi":"10.1002/env.2770","DOIUrl":null,"url":null,"abstract":"<p>Parallel computing in exact Gaussian process (GP) calculations becomes necessary for avoiding computational and memory restrictions associated with large-scale environmental data science applications. The exact evaluation of the Gaussian log-likelihood function requires <math>\n <mi>O</mi>\n <mo>(</mo>\n <msup>\n <mrow>\n <mi>n</mi>\n </mrow>\n <mrow>\n <mn>2</mn>\n </mrow>\n </msup>\n <mo>)</mo></math> storage and <math>\n <mi>O</mi>\n <mo>(</mo>\n <msup>\n <mrow>\n <mi>n</mi>\n </mrow>\n <mrow>\n <mn>3</mn>\n </mrow>\n </msup>\n <mo>)</mo></math> operations, where <math>\n <mrow>\n <mi>n</mi>\n </mrow></math> is the number of geographical locations. Thus, exactly computing the log-likelihood function with a large number of locations requires exploiting the power of existing parallel computing hardware systems, such as shared-memory, possibly equipped with GPUs, and distributed-memory systems, to solve this exact computational complexity. In this article, we present <i>ExaGeoStatR</i>, a package for exascale geostatistics in <i>R</i> that supports a parallel computation of the exact maximum likelihood function on a wide variety of parallel architectures. Furthermore, the package allows scaling existing GP methods to a large spatial/temporal domain. Prohibitive exact solutions for large geostatistical problems become possible with <i>ExaGeoStatR</i>. Parallelization in <i>ExaGeoStatR</i> depends on breaking down the numerical linear algebra operations in the log-likelihood function into a set of tasks and rendering them for a task-based programming model. The package can be used directly through the <i>R</i> environment on parallel systems without the user needing any <i>C</i>, <i>CUDA</i>, or <i>MPI</i> knowledge. Currently, <i>ExaGeoStatR</i> supports several maximum likelihood computation variants such as exact, diagonal super tile and tile low-rank approximations, and mixed-precision. <i>ExaGeoStatR</i> also provides a tool to simulate large-scale synthetic datasets. These datasets can help assess different implementations of the maximum log-likelihood approximation methods. Herein, we show the implementation details of <i>ExaGeoStatR</i>, analyze its performance on various parallel architectures, and assess its accuracy using synthetic datasets with up to 250K observations. The experimental analysis covers the exact computation of <i>ExaGeoStatR</i> to demonstrate the parallel capabilities of the package. We provide a hands-on tutorial to analyze a sea surface temperature real dataset. The performance evaluation involves comparisons with the popular packages <i>GeoR</i>, <i>fields</i>, and <i>bigGP</i> for exact Gaussian likelihood evaluation. The approximation methods in <i>ExaGeoStatR</i> are not considered in this article since they were analyzed in previous studies.</p>","PeriodicalId":50512,"journal":{"name":"Environmetrics","volume":"34 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2022-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmetrics","FirstCategoryId":"93","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/env.2770","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
引用次数: 4
Abstract
Parallel computing in exact Gaussian process (GP) calculations becomes necessary for avoiding computational and memory restrictions associated with large-scale environmental data science applications. The exact evaluation of the Gaussian log-likelihood function requires storage and operations, where is the number of geographical locations. Thus, exactly computing the log-likelihood function with a large number of locations requires exploiting the power of existing parallel computing hardware systems, such as shared-memory, possibly equipped with GPUs, and distributed-memory systems, to solve this exact computational complexity. In this article, we present ExaGeoStatR, a package for exascale geostatistics in R that supports a parallel computation of the exact maximum likelihood function on a wide variety of parallel architectures. Furthermore, the package allows scaling existing GP methods to a large spatial/temporal domain. Prohibitive exact solutions for large geostatistical problems become possible with ExaGeoStatR. Parallelization in ExaGeoStatR depends on breaking down the numerical linear algebra operations in the log-likelihood function into a set of tasks and rendering them for a task-based programming model. The package can be used directly through the R environment on parallel systems without the user needing any C, CUDA, or MPI knowledge. Currently, ExaGeoStatR supports several maximum likelihood computation variants such as exact, diagonal super tile and tile low-rank approximations, and mixed-precision. ExaGeoStatR also provides a tool to simulate large-scale synthetic datasets. These datasets can help assess different implementations of the maximum log-likelihood approximation methods. Herein, we show the implementation details of ExaGeoStatR, analyze its performance on various parallel architectures, and assess its accuracy using synthetic datasets with up to 250K observations. The experimental analysis covers the exact computation of ExaGeoStatR to demonstrate the parallel capabilities of the package. We provide a hands-on tutorial to analyze a sea surface temperature real dataset. The performance evaluation involves comparisons with the popular packages GeoR, fields, and bigGP for exact Gaussian likelihood evaluation. The approximation methods in ExaGeoStatR are not considered in this article since they were analyzed in previous studies.
期刊介绍:
Environmetrics, the official journal of The International Environmetrics Society (TIES), an Association of the International Statistical Institute, is devoted to the dissemination of high-quality quantitative research in the environmental sciences.
The journal welcomes pertinent and innovative submissions from quantitative disciplines developing new statistical and mathematical techniques, methods, and theories that solve modern environmental problems. Articles must proffer substantive, new statistical or mathematical advances to answer important scientific questions in the environmental sciences, or must develop novel or enhanced statistical methodology with clear applications to environmental science. New methods should be illustrated with recent environmental data.