Workshop Proceedings of the 49th International Conference on Parallel Processing最新文献

筛选
英文 中文
Accelerating Forward-Backward Sweep Power Flow Computation on the GPU 加速GPU的正向向后扫描功率流计算
Saumya Shah, M. Zarghami, Pınar Muyan-Özçelik
{"title":"Accelerating Forward-Backward Sweep Power Flow Computation on the GPU","authors":"Saumya Shah, M. Zarghami, Pınar Muyan-Özçelik","doi":"10.1145/3409390.3409397","DOIUrl":"https://doi.org/10.1145/3409390.3409397","url":null,"abstract":"In this study, we accelerate power flow computation used in modeling and analysis of electric power distribution systems utilizing the GPU. We use kernels and parallel computation patterns (i.e., segmented scan and reduction) running on the GPU to accelerate a common method that is used to perform power flow computation called “forward-backward sweep”. To evaluate our approach, we compare the GPU-accelerated parallel implementation of this method written in CUDA to the serial implementation that runs on the CPU. We perform our tests on binary power distribution trees that have number of nodes between 1K to 256K. Our results show that the parallel implementation brings up to 3.9x total speedup over the serial implementation. As expected, for the parts of the computation that entirely run on the GPU, larger speedups are achieved as the size of the distribution tree increases. We also provide a discussion on how the topology of the tree would affect the results.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134461910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rumor Has It: Optimizing the Belief Propagation Algorithm for Parallel Processing 谣言有它:优化并行处理的信念传播算法
Michael Trotter, Timothy Wood, H. H. Huang
{"title":"Rumor Has It: Optimizing the Belief Propagation Algorithm for Parallel Processing","authors":"Michael Trotter, Timothy Wood, H. H. Huang","doi":"10.1145/3409390.3409401","DOIUrl":"https://doi.org/10.1145/3409390.3409401","url":null,"abstract":"By modelling how the probability distributions of individuals’ states evolve as new information flows through a network, belief propagation has broad applicability ranging from image correction to virus propagation to even social networks. Yet, its scant implementations confine themselves largely to the realm of small Bayesian networks. Applications of the algorithm to graphs of large scale are thus unfortunately out of reach. To promote its broad acceptance, we enable belief propagation for both small and large scale graphs utilizing GPU processing. We therefore explore a host of optimizations including a new simple yet extensible input format enabling belief propagation to operate at massive scale, along with significant workload processing updates and meticulous memory management to enable our implementation to outperform prior works in terms of raw execution time and input size on a single machine. Utilizing a suite of parallelization technologies and techniques against a diverse set of graphs, we demonstrate that our implementations can efficiently process even massive networks, achieving up to nearly 121x speedups versus our control yet optimized single threaded implementations while supporting graphs of over ten million nodes in size in contrast to previous works’ support for thousands of nodes using CPU-based multi-core and host solutions. To assist in choosing the optimal implementation for a given graph, we provide a promising method utilizing a random forest classifier and graph metadata with a nearly 95% F1-score from our initial benchmarking and is portable to different GPU architectures to achieve over an F1-score of over 72% accuracy and a speedup of nearly 183x versus our control running in this new environment.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128349285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments 动态不对称环境中任务并行应用的调度
J. Chen, Pirah Noor Soomro, M. Abduljabbar, M. Manivannan, M. Pericàs
{"title":"Scheduling Task-parallel Applications in Dynamically Asymmetric Environments","authors":"J. Chen, Pirah Noor Soomro, M. Abduljabbar, M. Manivannan, M. Pericàs","doi":"10.1145/3409390.3409408","DOIUrl":"https://doi.org/10.1145/3409390.3409408","url":null,"abstract":"Shared resource interference is observed by applications as dynamic performance asymmetry. Prior art has developed approaches to reduce the impact of performance asymmetry mainly at the operating system and architectural levels. In this work, we study how application-level scheduling techniques can leverage moldability (i.e. flexibility to work as either single-threaded or multithreaded task) and explicit knowledge on task criticality to handle scenarios in which system performance is not only unknown but also changing over time. Our proposed task scheduler dynamically learns the performance characteristics of the underlying platform and uses this knowledge to devise better schedules aware of dynamic performance asymmetry, hence reducing the impact of interference. Our evaluation shows that both criticality-aware scheduling and parallelism tuning are effective schemes to address interference in both shared and distributed memory applications.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"575 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123127486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Automated Partitioning of Data-Parallel Kernels using Polyhedral Compilation 使用多面体编译的数据并行核的自动分区
Alexander Matz, J. Doerfert, H. Fröning
{"title":"Automated Partitioning of Data-Parallel Kernels using Polyhedral Compilation","authors":"Alexander Matz, J. Doerfert, H. Fröning","doi":"10.1145/3409390.3409403","DOIUrl":"https://doi.org/10.1145/3409390.3409403","url":null,"abstract":"GPUs are well-established in domains outside of computer graphics, including scientific computing, artificial intelligence, data warehousing, and other computationally intensive areas. Their execution model is based on a thread hierarchy and suggests that GPU workloads can generally be safely partitioned along the boundaries of thread blocks. However, the most efficient partitioning strategy is highly dependent on the application’s memory access patterns, and usually a tedious task for programmers in terms of decision and implementation. We leverage this observation for a concept that automatically compiles single-GPU code to multi-GPU applications. We present the idea and a prototype implementation of this concept and validate both on a selection of benchmarks. In particular, we illustrate our use of 1) polyhedral compilation to model memory accesses, 2) a runtime library to track GPU buffers and identify stale data, 3) IR transformations for the partitioning of GPU kernels, and 4) a custom preprocessor that rewrites CUDA host code to utilize multiple GPUs. This work focuses on applications with regular access patterns on global memory and the toolchain to fully automatically compile CUDA applications without requiring any user intervention. Our benchmarks compare single-device CUDA binaries produced by NVIDIA’s reference compiler to binaries produced for multiple GPUs using our toolchain. We report speedups of up to 12.4x for 16 Kepler-class GPUs.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133327967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Feature-preserving Lossy Compression for In Situ Data Analysis 原位数据分析中的特征保留有损压缩
I. Yakushin, Kshitij Mehta, Jieyang Chen, M. Wolf, Ian T Foster, S. Klasky, T. Munson
{"title":"Feature-preserving Lossy Compression for In Situ Data Analysis","authors":"I. Yakushin, Kshitij Mehta, Jieyang Chen, M. Wolf, Ian T Foster, S. Klasky, T. Munson","doi":"10.1145/3409390.3409400","DOIUrl":"https://doi.org/10.1145/3409390.3409400","url":null,"abstract":"The traditional model of having simulations write data to disk for offline analysis can be prohibitively expensive on computers with limited storage capacity or I/O bandwidth. In situ data analysis has emerged as a necessary paradigm to address this issue and is expected to play an important role in exascale computing. We demonstrate the various aspects and challenges involved in setting up a comprehensive in situ data analysis pipeline that consists of a simulation coupled with compression and feature tracking routines, a framework for assessing compression quality, a middleware library for I/O and data management, and a workflow tool for composing and running the pipeline. We perform studies of compression mechanisms and parameters on two supercomputers, Summit at Oak Ridge National Laboratory and Theta at Argonne National Laboratory, for two example application pipelines. We show that the optimal choice of compression parameters varies with data, time, and analysis, and that periodic retuning of the in situ pipeline can improve compression quality. Finally, we discuss our perspective on the wider adoption of in situ data analysis and management practices and technologies in the HPC community.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130585927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Developing Checkpointing and Recovery Procedures with the Storage Services of Amazon Web Services 使用Amazon Web Services的存储服务开发检查点和恢复程序
Luan Teylo, R. Brum, L. Arantes, Pierre Sens, Lúcia M. A. Drummond
{"title":"Developing Checkpointing and Recovery Procedures with the Storage Services of Amazon Web Services","authors":"Luan Teylo, R. Brum, L. Arantes, Pierre Sens, Lúcia M. A. Drummond","doi":"10.1145/3409390.3409407","DOIUrl":"https://doi.org/10.1145/3409390.3409407","url":null,"abstract":"In recent years, cloud computing has grown in popularity as they give users easy and almost instantaneous access to different computational resources. Some cloud providers, like Amazon, took advantage of the growing popularity and offered their VMs in some different hiring types: on-demand, reserved, and spot. The last type is usually offered at lower prices but can be terminated by the provider at any time. To deal with those failures, checkpoint and recovery procedures are typically used. In this context, we propose and analyze checkpoint and recovery procedures using three different storage services from Amazon: Amazon Simple Storage Service (S3), Amazon Elastic Block Store (EBS) and Amazon Elastic File System (EFS), considering spot VMs. These procedures were built upon the HADS framework, designed to schedule bag-of-tasks applications to spot and on-demand VMs. Our results showed that EBS outperformed the other approaches in terms of time spent on recording a checkpoint. But it required more time in the recovery procedure. EFS presented checkpointing and recovery times close to EBS but with higher monetary costs than the other services. S3 proved to be the best option in terms of monetary cost but required a longer time for recording a checkpoint, individually. However, when concurrent checkpoints were analysed, which can occur in a real application with lots of tasks, in our tests, S3 outperformed EFS in terms of execution time also.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115514201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Devise Sparse Compression Schedulers to Enhance FastText Methods 设计稀疏压缩调度器来增强FastText方法
Chen-Ting Chao, Wei-Hsu Chu, Chao-Lin Lee, Jenq-Kuen Lee, Ming-Yu Hung, Hsiang-Wei Sung
{"title":"Devise Sparse Compression Schedulers to Enhance FastText Methods","authors":"Chen-Ting Chao, Wei-Hsu Chu, Chao-Lin Lee, Jenq-Kuen Lee, Ming-Yu Hung, Hsiang-Wei Sung","doi":"10.1145/3409390.3409394","DOIUrl":"https://doi.org/10.1145/3409390.3409394","url":null,"abstract":"In natural language processing(NLP), the general way to understand the meaning of a word is via word embedding. The word embedding training model can convert words into multidimensional vectors and make the words that do not know “meaning” into vectors with “meaning”. Famous word embedding training models, include models such as FastText, Word2Vec, and GloVe. They can train words into vectors and then they are used for further semantic classifications. In this paper, we work on the efficient support for the FastText. FastText is an open source library created by Facebook(FAIR) lab that allows users to learn word embedding and text classification. We focus on the word representation application in FastText, in which general matrix-Vector multiplication(GEMV) is one of the most computationally intensive operations. In this paper, we adjust the software architecture of FastText, and pre-process the pre-trained model offline. In addition, we introduce a new accelerating method with sparse matrix compression in Halide, which improves performance by compressing the matrix. Our support with Halide sparse compression schedulers include hybrid compression schemes and re-ordering methods to improve the performance.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"177 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122866417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Characterizing the Cost-Accuracy Performance of Cloud Applications 描述云应用程序的成本-准确性性能
Sunimal Rathnayake, Lavanya Ramapantulu, Y. M. Teo
{"title":"Characterizing the Cost-Accuracy Performance of Cloud Applications","authors":"Sunimal Rathnayake, Lavanya Ramapantulu, Y. M. Teo","doi":"10.1145/3409390.3409409","DOIUrl":"https://doi.org/10.1145/3409390.3409409","url":null,"abstract":"Emergence of applications that produce results with different accuracy allows cloud consumers to leverage the advantages of elastic cloud resources and pay-per-use pricing model. However, the trade-off between cost, accuracy and execution time of cloud applications has not been well studied due to multiple challenges. A key challenge faced by a cloud consumer is tuning the application and determining cloud resource configuration that achieves the desired application accuracy among the configuration space. This paper proposes an approach to improve the cost-accuracy performance of cloud applications for a given cost and accuracy. To illustrate our approach, we use two popular convolution neural networks’ (CNN) inference as examples with pruning as a tuning tool for changing the accuracy, and yield several insights. Firstly, we show the existence of multiple degrees of pruning as “sweet-spots”, where inference time and cost can be reduced without losing accuracy. Combining such sweet-spots can halve inference cost and time with one-tenth reduction in accuracy for Caffenet CNN. Secondly, we show that in the large resource configuration space, these “sweet-spots” form the cost-accuracy and time-accuracy Pareto-frontiers whereby a Pareto-optimal configuration can reduce cost and execution time by 55% and 50% respectively for achieving the highest possible inference accuracy. Lastly, to quantify the accuracy performance of cloud applications, we introduce Time Accuracy Ratio (TAR) and Cost Accuracy Ratio (CAR) metrics. We show that using TAR and CAR reduces the time complexity in determining cloud resource configurations from exponential to polynomial-time.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114606086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Exploiting Dynamism in HPC Applications to Optimize Energy-Efficiency 利用高性能计算应用中的动态优化能源效率
Madhura Kumaraswamy, M. Gerndt
{"title":"Exploiting Dynamism in HPC Applications to Optimize Energy-Efficiency","authors":"Madhura Kumaraswamy, M. Gerndt","doi":"10.1145/3409390.3409399","DOIUrl":"https://doi.org/10.1145/3409390.3409399","url":null,"abstract":"The growing need for computational performance is resulting in an increase in the energy consumption of HPC systems, which is a major challenge to reach Exascale computing. To overcome this challenge, we developed a tuning plugin that targets applications that exhibit dynamically changing characteristics between iterations of the time loop as well as change in the control flow within the time loop itself. To analyze the inter-loop dynamism, we propose features to characterize the behaviour of loops for clustering via DBSCAN and spectral clustering. To save tuning time and costs, we implemented a random search strategy with a Gaussian probability distribution model to test a large number of system configurations in a single application run. The goal is to select the best configurations of the CPU and uncore frequencies for groups of similarly behaving loops, as well as individual instances of regions called within these loops based on their unique computational characteristics. During production runs, the configurations are dynamically switched for different code regions. The results of our experiments for two highly dynamic real-world applications highlight the effectiveness of our methodology in optimizing energy-efficiency.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115274250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Enabling Android NNAPI Flow for TVM Runtime 为TVM运行时启用Android NNAPI流
Ming-Yi Lai, Chia-Yu Sung, Jenq-Kuen Lee, Ming-Yu Hung
{"title":"Enabling Android NNAPI Flow for TVM Runtime","authors":"Ming-Yi Lai, Chia-Yu Sung, Jenq-Kuen Lee, Ming-Yu Hung","doi":"10.1145/3409390.3409393","DOIUrl":"https://doi.org/10.1145/3409390.3409393","url":null,"abstract":"With machine learning on the rise, mobile platforms are striving to offer inference acceleration on edge devices so that related applications can achieve satisfiable performance. With this background, this work aims at interfacing inference on Android with TVM, an inference-focusing compiler for machine learning, and NNAPI, the official neural network API provided by Android. This work presents a flow to integrate NNAPI into TVM-generated inference model with a partition algorithm to determine which parts of the model should be computed on NNAPI and which should not. Conducted experiments show that properly partitioned models can achieve significant speedup using NNAPI when compared to pure TVM-generated CPU inference. In addition, our enable flow potentially benefits both frameworks by allowing them to leverage each other in AI model deployments.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125204430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信