Parallel Computing最新文献_第10页

SVM-SMO-SGD: A hybrid-parallel support vector machine algorithm using sequential minimal optimization with stochastic gradient descent SVM-SMO-SGD:一种基于随机梯度下降的序列最小优化混合并行支持向量机算法

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102955

Gizen Mutlu, Çiğdem İnan Acı

{"title":"SVM-SMO-SGD: A hybrid-parallel support vector machine algorithm using sequential minimal optimization with stochastic gradient descent","authors":"Gizen Mutlu, Çiğdem İnan Acı","doi":"10.1016/j.parco.2022.102955","DOIUrl":"10.1016/j.parco.2022.102955","url":null,"abstract":"<div>The Support Vector Machine (SVM) method is one of the popular machine learning algorithms as it gives high accuracy. However, like most machine learning algorithms, the resource consumption of the SVM algorithm in terms of time and memory increases linearly as the dataset grows. In this study, a parallel-hybrid algorithm that combines SVM, Sequential Minimal Optimization (SMO) with Stochastic Gradient Descent (SGD) methods have been proposed to optimize the calculation of the weight costs. The performance of the proposed SVM-SMO-SGD algorithm was compared with classical SMO and Compute Unified Device Architecture (CUDA) based approaches on the well-known datasets (i.e., Diabetes, Healthcare Stroke Prediction, Adults) with 520, 5110, and 32,560 samples, respectively. According to the results, Sequential SVM-SMO-SGD is 3.81 times faster in terms of time, and 1.04 times more efficient RAM consumption than the classical SMO algorithm. The parallel SVM-SMO-SGD algorithm, on the other hand, is 75.47 times faster than the classical SMO algorithm in terms of time. It is also 1.9 times more efficient in RAM consumption. The overall classification accuracy of all algorithms is 87% in the Diabetes dataset, 95% in the Healthcare Stroke Prediction dataset, and 82% in the Adults dataset.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102955"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73437828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Routing brain traffic through the von Neumann bottleneck: Efficient cache usage in spiking neural network simulation code on general purpose computers 通过冯诺依曼瓶颈路由大脑流量:通用计算机上尖峰神经网络仿真代码的高效缓存使用

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102952

J. Pronold , J. Jordan , B.J.N. Wylie , I. Kitayama , M. Diesmann , S. Kunkel

{"title":"Routing brain traffic through the von Neumann bottleneck: Efficient cache usage in spiking neural network simulation code on general purpose computers","authors":"J. Pronold , J. Jordan , B.J.N. Wylie , I. Kitayama , M. Diesmann , S. Kunkel","doi":"10.1016/j.parco.2022.102952","DOIUrl":"10.1016/j.parco.2022.102952","url":null,"abstract":"<div>Simulation is a third pillar next to experiment and theory in the study of complex dynamic systems such as biological neural networks. Contemporary brain-scale networks correspond to directed random graphs of a few million nodes, each with an in-degree and out-degree of several thousands of edges, where nodes and edges correspond to the fundamental biological units, neurons and synapses, respectively. The activity in neuronal networks is also sparse. Each neuron occasionally transmits a brief signal, called spike, via its outgoing synapses to the corresponding target neurons. In distributed computing these targets are scattered across thousands of parallel processes. The spatial and temporal sparsity represents an inherent bottleneck for simulations on conventional computers: irregular memory-access patterns cause poor cache utilization. Using an established neuronal network simulation code as a reference implementation, we investigate how common techniques to recover cache performance such as software-induced prefetching and software pipelining can benefit a real-world application. The algorithmic changes reduce simulation time by up to 50%. The study exemplifies that many-core systems assigned with an intrinsically parallel computational problem can alleviate the von Neumann bottleneck of conventional computer architectures.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102952"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000461/pdfft?md5=b8e7064aa5b20b2508d68e7bff9b38e4&pid=1-s2.0-S0167819122000461-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76371194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Fast calculation of isostatic compensation correction using the GPU-parallel prism method 用GPU平行棱镜法快速计算等静压补偿校正

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102970

Yan Huang , Qingbin Wang , Minghao Lv , Xingguang Song , Jinkai Feng , Xuli Tan , Ziyan Huang , Chuyuan Zhou

{"title":"Fast calculation of isostatic compensation correction using the GPU-parallel prism method","authors":"Yan Huang , Qingbin Wang , Minghao Lv , Xingguang Song , Jinkai Feng , Xuli Tan , Ziyan Huang , Chuyuan Zhou","doi":"10.1016/j.parco.2022.102970","DOIUrl":"10.1016/j.parco.2022.102970","url":null,"abstract":"<div>Isostatic compensation is a crucial component of crustal structure analysis and geoid calculations in cases of gravity reduction. However, large-scale and high-precision calculations are limited by the inefficiencies of the strict prism method and the low accuracy of the approximate calculation formula. In this study, we propose a new method of terrain grid re-encoding and an eight-component strict prism integral disassembly using a compute unified device architecture parallel programming platform. We use a fast parallel algorithm for the isostatic compensation correction, using the strict prism method based on CPU + GPU heterogeneous parallelization with efficient task allocation and GPU thread overloading procedure. The results of this study provide a rigorous, fast, and accurate solution for high-resolution and high-precision isostatic compensation corrections. To ensure an absolute calculation accuracy of 10−6 mGal, the maximum acceleration ratio of the calculation was set to at least 730 using one GPU and 2241 using four GPUs, which shortens the calculation time and improves the calculation efficiency.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102970"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000618/pdfft?md5=c2b82b5c153d0daba6ac23f42fb2b152&pid=1-s2.0-S0167819122000618-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45936763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

parGeMSLR: A parallel multilevel Schur complement low-rank preconditioning and solution package for general sparse matrices parGeMSLR:一般稀疏矩阵的并行多级Schur补低秩预处理和解包

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102956

Tianshi Xu , Vassilis Kalantzis , Ruipeng Li , Yuanzhe Xi , Geoffrey Dillon , Yousef Saad

{"title":"parGeMSLR: A parallel multilevel Schur complement low-rank preconditioning and solution package for general sparse matrices","authors":"Tianshi Xu , Vassilis Kalantzis , Ruipeng Li , Yuanzhe Xi , Geoffrey Dillon , Yousef Saad","doi":"10.1016/j.parco.2022.102956","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102956","url":null,"abstract":"<div>This paper discusses parGeMSLR, a C++/MPI software library for the solution of sparse systems of linear algebraic equations via preconditioned Krylov subspace methods in distributed-memory computing environments. The preconditioner implemented in parGeMSLR is based on algebraic domain decomposition and partitions the symmetrized adjacency graph recursively into several non-overlapping partitions via a <math><mi>p</mi></math>-way vertex separator, where <math><mi>p</mi></math> is an integer multiple of the total number of MPI processes. From a numerical perspective, parGeMSLR builds a Schur complement approximate inverse preconditioner as the sum between the matrix inverse of the interface coupling matrix and a low-rank correction term. To reduce the cost associated with the computation of the approximate inverse matrices, parGeMSLR exploits a multilevel partitioning of the algebraic domain. The parGeMSLR library is implemented on top of the Message Passing Interface and can solve both real and complex linear systems. Furthermore, parGeMSLR can take advantage of hybrid computing environments with in-node access to one or more Graphics Processing Units. Finally, the parallel efficiency (weak and strong scaling) of parGeMSLR is demonstrated on a few model problems arising from discretizations of 3D Partial Differential Equations.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102956"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91978783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Characterizing the Performance of Node-Aware Strategies for Irregular Point-to-Point Communication on Heterogeneous Architectures 异构体系结构中不规则点对点通信节点感知策略的性能表征

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-09-13 DOI: 10.48550/arXiv.2209.06141

S. Lockhart, Amanda Bienz, W. Gropp, Luke N. Olson

{"title":"Characterizing the Performance of Node-Aware Strategies for Irregular Point-to-Point Communication on Heterogeneous Architectures","authors":"S. Lockhart, Amanda Bienz, W. Gropp, Luke N. Olson","doi":"10.48550/arXiv.2209.06141","DOIUrl":"https://doi.org/10.48550/arXiv.2209.06141","url":null,"abstract":"Supercomputer architectures are trending toward higher computational throughput due to the inclusion of heterogeneous compute nodes. These multi-GPU nodes increase on-node computational efficiency, while also increasing the amount of data to be communicated and the number of potential data flow paths. In this work, we characterize the performance of irregular point-to-point communication with MPI on heterogeneous compute environments through performance modeling, demonstrating the limitations of standard communication strategies for both device-aware and staging-through-host communication techniques. Presented models suggest staging communicated data through host processes then using node-aware communication strategies for high inter-node message counts. Notably, the models also predict that node-aware communication utilizing all available CPU cores to communicate inter-node data leads to the most performant strategy when communicating with a high number of nodes. Model validation is provided via a case study of irregular point-to-point communication patterns in distributed sparse matrix-vector products. Importantly, we include a discussion on the implications model predictions have on communication strategy design for emerging supercomputer architectures.","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"20 1","pages":"103021"},"PeriodicalIF":1.4,"publicationDate":"2022-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82207056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Energy-efficient scheduling algorithms based on task clustering in heterogeneous spark clusters 异构火花集群中基于任务聚类的节能调度算法

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102947

Wenhu Shi, Hongjian Li, Junzhe Guan, Hang Zeng, Rafe Misskat jahan

{"title":"Energy-efficient scheduling algorithms based on task clustering in heterogeneous spark clusters","authors":"Wenhu Shi, Hongjian Li, Junzhe Guan, Hang Zeng, Rafe Misskat jahan","doi":"10.1016/j.parco.2022.102947","DOIUrl":"10.1016/j.parco.2022.102947","url":null,"abstract":"<div>Spark is widely used for its fast in-memory processing. It is important to improve energy efficiency under deadline constrains. In this paper, a Task Performance Clustering of Best Fitting Decrease (TPCBFD) scheduling algorithm is proposed. It divides tasks in Spark into three types, with the different types of tasks being placed on nodes with superior performance. However, the basic computation time for TPCBFD takes up a large proportion of the task execution time, so the Energy-Aware TPCBFD (EATPCBFD) algorithm based on the proposed energy consumption model is proposed, focusing on optimizing energy efficiency and Service Level Agreement (SLA) service times. The experimental results show that EATPCBFD increases the average energy efficiency in Spark by 77% and the average passing rate of SLA service time by 14% compared to comparison algorithms. EATPCBFD has higher energy efficiency on average than comparison algorithms under deadline. The average energy efficiency of EATPCBFD with the deadline constraint is higher than the comparison algorithm.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102947"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78038927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Resource allocation for task-level speculative scientific applications: A proof of concept using Parallel Trajectory Splicing 任务级思辨科学应用的资源分配:使用平行轨迹拼接的概念证明

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102936

Andrew Garmon , Vinay Ramakrishnaiah , Danny Perez

引用次数: 1

Improving cryptanalytic applications with stochastic runtimes on GPUs and multicores 改进在gpu和多核上随机运行的密码分析应用程序

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102944

Lena Oden, Jörg Keller

引用次数: 0

Optimizing convolutional neural networks on multi-core vector accelerator 在多核矢量加速器上优化卷积神经网络

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102945

Zhong Liu, Xin Xiao, Chen Li, Sheng Ma, Deng Rangyu

{"title":"Optimizing convolutional neural networks on multi-core vector accelerator","authors":"Zhong Liu, Xin Xiao, Chen Li, Sheng Ma, Deng Rangyu","doi":"10.1016/j.parco.2022.102945","DOIUrl":"10.1016/j.parco.2022.102945","url":null,"abstract":"<div>Vector Accelerators have been widely used in scientific computing. It also shows great potential to accelerate the computational performance of convolutional neural networks (CNNs). However, previous general CNN-mapping methods introduced a large amount of intermediate data and additional conversion, and the resulting memory overhead would cause great performance loss.To address these issues and achieve high computational efficiency, this paper proposes an efficient CNN-mapping method dedicated to vector accelerators, including: 1) Data layout method: establishing a set of efficient data storage and computing models for various CNN networks on vector accelerators. It achieves high memory access efficiency and high vectorization efficiency. 2) A conversion method: convert the computation of convolutional layers and fully connected layers into large-scale matrix multiplication, and convert the computation of pooling layers into row computation of matrix. All conversions are implemented by extracting rows from a two-dimensional matrix, with high data access and transmission efficiency, and without additional memory overhead and data conversion.Based on these methods, we design a vectorization mechanism to vectorize convolutional, pooling and fully connected layers on a vector accelerator, which can be applied for various CNN models. This mechanism takes full advantage of the parallel computing capability of the multi-core vector accelerator and further improves the performance of deep convolutional neural networks. The experimental results show that the average computational efficiency of the convolutional layers and full connected layers of AlexNet, VGG-19, GoogleNet and ResNet-50 is 93.3% and 93.4% respectively, and the average data access efficiency of pooling layer is 70%. Compared to NVIDIA inference GPUs, our accelerator achieves a 36.1% performance improvement, comparable to NVIDIA V100 GPUs. Compared with Matrix2000 of similar architecture, our accelerator achieves a 17-45% improvement in computational efficiency.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102945"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83744906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Performance and accuracy predictions of approximation methods for shortest-path algorithms on GPUs gpu上最短路径算法近似方法的性能和精度预测

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102942

Busenur Aktılav, Işıl Öz

{"title":"Performance and accuracy predictions of approximation methods for shortest-path algorithms on GPUs","authors":"Busenur Aktılav, Işıl Öz","doi":"10.1016/j.parco.2022.102942","DOIUrl":"10.1016/j.parco.2022.102942","url":null,"abstract":"<div>Approximate computing techniques, where less-than-perfect solutions are acceptable, present performance-accuracy trade-offs by performing inexact computations. Moreover, heterogeneous architectures, a combination of miscellaneous compute units, offer high performance as well as energy efficiency. Graph algorithms utilize the parallel computation units of heterogeneous GPU architectures as well as performance improvements offered by approximation methods. Since different approximations yield different speedup and accuracy loss for the target execution, it becomes impractical to test all methods with various parameters. In this work, we perform approximate computations for the three shortest-path graph algorithms and propose a machine learning framework to predict the impact of the approximations on program performance and output accuracy. We evaluate random predictions for both synthetic and real road-network graphs, and predictions of the large graph cases from small graph instances. We achieve less than 5% prediction error rates for speedup and inaccuracy values.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102942"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89425951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0