Parallel Computing最新文献

筛选
英文 中文
Distributed consensus-based estimation of the leading eigenvalue of a non-negative irreducible matrix 基于分布式共识的非负不可还原矩阵前导特征值估算
IF 2 4区 计算机科学
Parallel Computing Pub Date : 2024-10-05 DOI: 10.1016/j.parco.2024.103113
Rahim Alizadeh , Shahriar Bijani , Fatemeh Shakeri
{"title":"Distributed consensus-based estimation of the leading eigenvalue of a non-negative irreducible matrix","authors":"Rahim Alizadeh ,&nbsp;Shahriar Bijani ,&nbsp;Fatemeh Shakeri","doi":"10.1016/j.parco.2024.103113","DOIUrl":"10.1016/j.parco.2024.103113","url":null,"abstract":"<div><div>This paper presents an algorithm to solve the problem of estimating the largest eigenvalue and its corresponding eigenvector for irreducible matrices in a distributed manner. The proposed algorithm utilizes a network of computational nodes that interact with each other, forming a strongly connected digraph where each node handles one row of the matrix, without the need for centralized storage or knowledge of the entire matrix. Each node possesses a solution space, and the intersection of all these solution spaces contains the leading eigenvector of the matrix. Initially, each node selects a random vector from its solution space, and then, while interacting with its neighbors, updates the vector at each step by solving a quadratically constrained linear program (QCLP). The updates are done so that the nodes reach a consensus on the leading eigenvector of the matrix. The numerical outcomes demonstrate the effectiveness of our proposed method.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103113"},"PeriodicalIF":2.0,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142424535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parallel Pattern Compiler for Automatic Global Optimizations 自动全局优化的并行模式编译器
IF 2 4区 计算机科学
Parallel Computing Pub Date : 2024-09-21 DOI: 10.1016/j.parco.2024.103112
Adrian Schmitz, Semih Burak, Julian Miller, Matthias S. Müller
{"title":"Parallel Pattern Compiler for Automatic Global Optimizations","authors":"Adrian Schmitz,&nbsp;Semih Burak,&nbsp;Julian Miller,&nbsp;Matthias S. Müller","doi":"10.1016/j.parco.2024.103112","DOIUrl":"10.1016/j.parco.2024.103112","url":null,"abstract":"<div><div>High-performance computing (HPC) systems enable scientific advances through simulation and data processing. The heterogeneity in HPC hardware and software increases the application complexity and reduces its maintainability and productivity. This work proposes a prototype implementation for a parallel pattern-based source-to-source compiler to address these challenges. The prototype limits the complexity of parallelism and heterogeneous architectures to parallel patterns that are optimized towards a given target architecture. By applying high-level optimizations and a mapping between parallel patterns and execution units during compile time, portability between systems is achieved. The compiler can address architectures with shared memory, distributed memory, and accelerator offloading.</div><div>The approach shows speedups for seven of the nine supported Rodinia benchmarks, reaching speedups of up to twelve times. Porting LULESH to the Parallel Pattern Language (PPL) shows a compression of code size by 65% (3.4 thousand lines of code) through a more concise expression and a higher level of abstraction. The tool’s limitations include dynamic algorithms that are challenging to analyze statically and overheads during the compile time optimization. This paper is an extended version of a previous PMAM publication (Schmitz et al., 2024).</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103112"},"PeriodicalIF":2.0,"publicationDate":"2024-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142323330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Task scheduling in cloud computing based on grey wolf optimization with a new encoding mechanism 基于灰狼优化和新编码机制的云计算任务调度
IF 2 4区 计算机科学
Parallel Computing Pub Date : 2024-09-17 DOI: 10.1016/j.parco.2024.103111
Xingwang Huang , Min Xie , Dong An , Shubin Su , Zongliang Zhang
{"title":"Task scheduling in cloud computing based on grey wolf optimization with a new encoding mechanism","authors":"Xingwang Huang ,&nbsp;Min Xie ,&nbsp;Dong An ,&nbsp;Shubin Su ,&nbsp;Zongliang Zhang","doi":"10.1016/j.parco.2024.103111","DOIUrl":"10.1016/j.parco.2024.103111","url":null,"abstract":"<div><p>Task scheduling in the cloud computing still remains challenging in terms of performance. Several evolutionary-derived algorithms have been proposed to solve or alleviate this problem. However, evolutionary algorithms have good exploration ability, but the performance drops significantly in high dimensions. To address this issue, considering the characteristic of task scheduling in cloud computing (i.e. all task-VM mappings are 1-dimensional and have the same search range), we propose a task scheduling algorithm based on grey wolf optimization using a new encoding mechanism (GWOEM) in this work. Through this new encoding mechanism, greedy and evolutionary algorithms are rationally integrated in GWOEM. Besides, based on the new mechanism, the dimension of search space is reduced to 1 and the key parameter (i.e., the population size) is eliminated. We apply the proposed GWOEM to the Google Cloud Jobs dataset (GoCJ) and demonstrate better performance than the prior state of the art in terms of makespan.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103111"},"PeriodicalIF":2.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An automated OpenMP mutation testing framework for performance optimization 用于性能优化的自动 OpenMP 突变测试框架
IF 2 4区 计算机科学
Parallel Computing Pub Date : 2024-08-21 DOI: 10.1016/j.parco.2024.103097
Dolores Miao , Ignacio Laguna , Giorgis Georgakoudis , Konstantinos Parasyris , Cindy Rubio-González
{"title":"An automated OpenMP mutation testing framework for performance optimization","authors":"Dolores Miao ,&nbsp;Ignacio Laguna ,&nbsp;Giorgis Georgakoudis ,&nbsp;Konstantinos Parasyris ,&nbsp;Cindy Rubio-González","doi":"10.1016/j.parco.2024.103097","DOIUrl":"10.1016/j.parco.2024.103097","url":null,"abstract":"<div><p>Performance optimization continues to be a challenge in modern HPC software. Existing performance optimization techniques, including profiling-based and auto-tuning techniques, fail to indicate program modifications at the source level thus preventing their portability across compilers. This paper describes <span>Muppet</span>, a new approach that identifies program modifications called <em>mutations</em> aimed at improving program performance. <span>Muppet</span>’s mutations help developers reason about performance defects and missed opportunities to improve performance at the source code level. In contrast to compiler techniques that optimize code at intermediate representations (IR), <span>Muppet</span> uses the idea of source-level <em>mutation testing</em> to relax correctness constraints and automatically discover optimization opportunities that otherwise are not feasible using the IR. We demonstrate the <span>Muppet</span>’s concept in the OpenMP programming model. <span>Muppet</span> generates a list of OpenMP mutations that alter the program parallelism in various ways, and is capable of running a variety of optimization algorithms such as delta debugging, Bayesian Optimization and decision tree optimization to find a subset of mutations which, when applied to the original program, cause the most speedup while maintaining program correctness. When <span>Muppet</span> is evaluated against a diverse set of benchmark programs and proxy applications, it is capable of finding sets of mutations that induce speedup in 75.9% of the evaluated programs.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103097"},"PeriodicalIF":2.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819124000358/pdfft?md5=139743a6196b36bc64bd1733300112aa&pid=1-s2.0-S0167819124000358-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142040335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Abstractions for C++ code optimizations in parallel high-performance applications 并行高性能应用程序中的 C++ 代码优化抽象
IF 2 4区 计算机科学
Parallel Computing Pub Date : 2024-08-14 DOI: 10.1016/j.parco.2024.103096
Jiří Klepl, Adam Šmelko, Lukáš Rozsypal, Martin Kruliš
{"title":"Abstractions for C++ code optimizations in parallel high-performance applications","authors":"Jiří Klepl,&nbsp;Adam Šmelko,&nbsp;Lukáš Rozsypal,&nbsp;Martin Kruliš","doi":"10.1016/j.parco.2024.103096","DOIUrl":"10.1016/j.parco.2024.103096","url":null,"abstract":"<div><p>Many computational problems consider memory throughput a performance bottleneck, especially in the domain of parallel computing. Software needs to be attuned to hardware features like cache architectures or concurrent memory banks to reach a decent level of performance efficiency. This can be achieved by selecting the right memory layouts for data structures or changing the order of data structure traversal. In this work, we present an abstraction for traversing a set of regular data structures (e.g., multidimensional arrays) that allows the design of traversal-agnostic algorithms. Such algorithms can easily optimize for memory performance and employ semi-automated parallelization or autotuning without altering their internal code. We also add an abstraction for autotuning that allows defining tuning parameters in one place and removes boilerplate code. The proposed solution was implemented as an extension of the Noarr library that simplifies a layout-agnostic design of regular data structures. It is implemented entirely using C<span>++</span> template meta-programming without any nonstandard dependencies, so it is fully compatible with existing compilers, including CUDA NVCC or Intel DPC++. We evaluate the performance and expressiveness of our approach on the Polybench-C benchmarks.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103096"},"PeriodicalIF":2.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819124000346/pdfft?md5=9cd8ac7a1eebfc9480655a05bba5ca50&pid=1-s2.0-S0167819124000346-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142012840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mobilizing underutilized storage nodes via job path: A job-aware file striping approach 通过工作路径调动未充分利用的存储节点:作业感知文件条带化方法
IF 2 4区 计算机科学
Parallel Computing Pub Date : 2024-08-10 DOI: 10.1016/j.parco.2024.103095
Gang Xian , Wenxiang Yang , Yusong Tan , Jinghua Feng , Yuqi Li , Jian Zhang , Jie Yu
{"title":"Mobilizing underutilized storage nodes via job path: A job-aware file striping approach","authors":"Gang Xian ,&nbsp;Wenxiang Yang ,&nbsp;Yusong Tan ,&nbsp;Jinghua Feng ,&nbsp;Yuqi Li ,&nbsp;Jian Zhang ,&nbsp;Jie Yu","doi":"10.1016/j.parco.2024.103095","DOIUrl":"10.1016/j.parco.2024.103095","url":null,"abstract":"<div><p>Users’ limited understanding of the storage system architecture prevents them from fully utilizing the parallel I/O capability of the storage system, leading to a negative impact on the overall performance of supercomputers. Therefore, exploring effective strategies for utilizing parallel I/O capabilities is imperative. In this regard, we conduct an analysis of the workload on two production supercomputers’ Object Storage Targets (OSTs) and study the potential inefficient I/O patterns for high performance computing jobs. Our research findings indicate that under the traditional stripe settings that most supercomputers use to ensure stability, the real-time load on OSTs is severely unbalanced. This imbalance results in I/O requests that fail to fully utilize the available OSTs. To tackle this issue, we propose a job-aware optimization approach, which includes static and dynamic file striping. Static file striping optimizes all user jobs, whereas dynamic file striping employs clustering of job names and job paths to extract similarities among jobs and predict partially stripe-optimizable jobs for users. Additionally, a stripe recovery mechanism is employed to mitigate the negative impact of stripe misconfigurations. This approach appropriately adjusts the file stripe layout based on the job’s I/O pattern, allowing for better mobilization of underutilized OSTs to enhance parallel I/O capabilities. Through experimental verification, the number of OSTs that jobs can use has been increased, effectively improving the parallel I/O performance of the job without significantly affecting operational stability.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103095"},"PeriodicalIF":2.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142041036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NxtSPR: A deadlock-free shortest path routing dedicated to relaying for Triplet-Based many-core Architecture NxtSPR:基于三核的多核架构的中继专用无死锁最短路径路由算法
IF 2 4区 计算机科学
Parallel Computing Pub Date : 2024-07-24 DOI: 10.1016/j.parco.2024.103094
Chunfeng Li, Karim Soliman, Fei Yin, Jin Wei, Feng Shi
{"title":"NxtSPR: A deadlock-free shortest path routing dedicated to relaying for Triplet-Based many-core Architecture","authors":"Chunfeng Li,&nbsp;Karim Soliman,&nbsp;Fei Yin,&nbsp;Jin Wei,&nbsp;Feng Shi","doi":"10.1016/j.parco.2024.103094","DOIUrl":"10.1016/j.parco.2024.103094","url":null,"abstract":"<div><p>Deadlock-free routing is a significant challenge in Network-on-Chip (NoC) design as it affects the network’s latency, power consumption, and load balance, impacting the performance of multi-processor systems-on-chip. However, achieving deadlock-free routing will routinely result in expensive overhead as previous solutions either sacrifice performance or power efficiency to proactively avoid deadlocks or impose high hardware complexity to resolve deadlocks when they occur reactively. Utilizing the various characteristics of NoC to implement deadlock-free routing can be significantly more cost-effective with less impact on performance. This paper proposes a relay routing algorithm (NxtSPR) with a shortest path property and a deadlock prevention mechanism based on a synchronized Hamiltonian ring. The proposal is based on an in-depth study of the characteristics of a Triplet-Based many-core Architecture (TriBA) NoC. We establish various important topology-related theories and perform a formal verification (proof-based) for them. By utilizing the critical subgraph and apex of TriBA, NxtSPR can pre-calculate downstream nodes forwarding ports for packets by using a concise judgment strategy. This significantly reduces the computational overhead required for data transmission while optimizing the pipeline design of routers to decrease packet transmission latency and power consumption compared to other TriBA routing algorithms. We group the data transmissions according to the levels of maximum Hamiltonian edges a packet will traverse during its transmission life cycle. Independent data transmissions between groups can avoid mutual interference and resource competition, eliminating potential deadlocks. Gem5 simulation results show that, under the synthetic traffic patterns, compared to the representative (Table) and up-to-date (SPR4T) routing algorithms, NxtSPR achieves a 20.19%, 14.76%, and 5.54%, 4.66% reduction in average packet latency and per-packet power consumption, respectively. Moreover, it has an average of 18.50% and 4.34% improvement in throughput, as compared to them. PARSEC benchmark results show that NxtSPR reduces application runtime by up to a maximum of 22.30% and 12.82% compared to Table and SPR4T, and running the same applications with TriBA results in a maximum runtime reduction of 10.77% compared to 2D-Mesh.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103094"},"PeriodicalIF":2.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141785971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-GPU 3D k-nearest neighbors computation with application to ICP, point cloud smoothing and normals computation 多 GPU 3D k 最近邻计算在 ICP、点云平滑和法线计算中的应用
IF 2 4区 计算机科学
Parallel Computing Pub Date : 2024-07-02 DOI: 10.1016/j.parco.2024.103093
Alexander Agathos , Philip Azariadis
{"title":"Multi-GPU 3D k-nearest neighbors computation with application to ICP, point cloud smoothing and normals computation","authors":"Alexander Agathos ,&nbsp;Philip Azariadis","doi":"10.1016/j.parco.2024.103093","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103093","url":null,"abstract":"<div><p>The k-Nearest Neighbors algorithm is a fundamental algorithm that finds applications in many fields like Machine Learning, Computer Graphics, Computer Vision, and others. The algorithm determines the closest points (d-dimensional) of a reference set R according to a query set of points Q under a specific metric (Euclidean, Mahalanobis, Manhattan, etc.). This work focuses on the utilization of multiple Graphical Processing Units for the acceleration of the k-Nearest Neighbors algorithm with large or very large sets of 3D points. With the proposed approach the space of the reference set is divided into a 3D grid which is used to facilitate the search for the nearest neighbors. The search in the grid is performed in a multiresolution manner starting from a high-resolution grid and ending up in a coarse one, thus accounting for point clouds that may have non-uniform sampling and/or outliers. Three important algorithms in reverse engineering are revisited and new multi-GPU versions are proposed based on the introduced KNN algorithm. More specifically, the new multi-GPU approach is applied to the Iterative Closest Point algorithm, to the point cloud smoothing, and to the point cloud normal vectors computation and orientation problem. A series of tests and experiments have been conducted and discussed in the paper showing the merits of the proposed multi-GPU approach.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103093"},"PeriodicalIF":2.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141607940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
WBSP: Addressing stragglers in distributed machine learning with worker-busy synchronous parallel WBSP:利用工作繁忙同步并行技术解决分布式机器学习中的落后问题
IF 2 4区 计算机科学
Parallel Computing Pub Date : 2024-06-29 DOI: 10.1016/j.parco.2024.103092
Duo Yang , Bing Hu , An Liu , A-Long Jin , Kwan L. Yeung , Yang You
{"title":"WBSP: Addressing stragglers in distributed machine learning with worker-busy synchronous parallel","authors":"Duo Yang ,&nbsp;Bing Hu ,&nbsp;An Liu ,&nbsp;A-Long Jin ,&nbsp;Kwan L. Yeung ,&nbsp;Yang You","doi":"10.1016/j.parco.2024.103092","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103092","url":null,"abstract":"<div><p>Parameter server is widely used in distributed machine learning to accelerate training. However, the increasing heterogeneity of workers’ computing capabilities leads to the issue of stragglers, making parameter synchronization challenging. To address this issue, we propose a solution called Worker-Busy Synchronous Parallel (WBSP). This approach eliminates the waiting time of fast workers during the synchronization process and decouples the gradient upload and model download of fast workers into asymmetric parts. By doing so, it allows fast workers to complete multiple steps of local training and upload more gradients to the server, improving computational resource utilization. Additionally, the global model is only updated when the slowest worker uploads the gradients, ensuring the consistency of global models that are pulled down by all workers and the convergence of the global model. Building upon WBSP, we propose an optimized version to further reduce the communication overhead. It enables parallel execution of communication and computation tasks on workers to shorten the global synchronization interval, thereby improving training speed. We conduct theoretical analyses for the proposed mechanisms. Extensive experiments verify that our mechanism can reduce the required time to achieve the target accuracy by up to 60% compared with the fastest method and increase the proportion of computation time from 55%–72% in existing methods to 91%.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103092"},"PeriodicalIF":2.0,"publicationDate":"2024-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141607939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Extending the limit of LR-TDDFT on two different approaches: Numerical algorithms and new Sunway heterogeneous supercomputer 用两种不同方法扩展 LR-TDDFT 的极限:数值算法和新型 Sunway 异构超级计算机
IF 1.4 4区 计算机科学
Parallel Computing Pub Date : 2024-05-04 DOI: 10.1016/j.parco.2024.103085
Qingcai Jiang , Zhenwei Cao , Xinhui Cui , Lingyun Wan , Xinming Qin , Huanqi Cao , Hong An , Junshi Chen , Jie Liu , Wei Hu , Jinlong Yang
{"title":"Extending the limit of LR-TDDFT on two different approaches: Numerical algorithms and new Sunway heterogeneous supercomputer","authors":"Qingcai Jiang ,&nbsp;Zhenwei Cao ,&nbsp;Xinhui Cui ,&nbsp;Lingyun Wan ,&nbsp;Xinming Qin ,&nbsp;Huanqi Cao ,&nbsp;Hong An ,&nbsp;Junshi Chen ,&nbsp;Jie Liu ,&nbsp;Wei Hu ,&nbsp;Jinlong Yang","doi":"10.1016/j.parco.2024.103085","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103085","url":null,"abstract":"<div><p>First-principles time-dependent density functional theory (TDDFT) is a powerful tool to accurately describe the excited-state properties of molecules and solids in condensed matter physics, computational chemistry, and materials science. However, a perceived drawback in TDDFT calculations is its ultrahigh computational cost <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>5</mn></mrow></msup><mo>∼</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>6</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> and large memory usage <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>4</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> especially for plane-wave basis set, confining its applications to large systems containing thousands of atoms. Here, we present a massively parallel implementation of linear-response TDDFT (LR-TDDFT) and accelerate LR-TDDFT in two different aspects: (1) numerical algorithms on the X86 supercomputer and (2) optimizations on the heterogeneous architecture of the new Sunway supercomputer. Furthermore, we carefully design the parallel data and task distribution schemes to accommodate the physical nature of different computation steps. By utilizing these two different methods, our implementation can gain an overall speedup of 10x and 80x and efficiently scales to large systems up to 4096 and 2744 atoms within dozens of seconds.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103085"},"PeriodicalIF":1.4,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140894775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信