2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)最新文献

筛选
英文 中文
A Scalable Algorithm for Simulating the Structural Plasticity of the Brain 模拟大脑结构可塑性的可扩展算法
S. Rinke, Markus Butz-Ostendorf, Marc-André Hermanns, M. Naveau, F. Wolf
{"title":"A Scalable Algorithm for Simulating the Structural Plasticity of the Brain","authors":"S. Rinke, Markus Butz-Ostendorf, Marc-André Hermanns, M. Naveau, F. Wolf","doi":"10.1109/SBAC-PAD.2016.9","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.9","url":null,"abstract":"The neural network in the brain is not hard-wired. Even in the mature brain, new connections between neurons are formed and existing ones are deleted, which is called structural plasticity. The dynamics of the connectome is key to understanding how learning, memory, and healing after lesions such as stroke work. However, with current experimental techniques even the creation of an exact static connectivity map, which is required for various brain simulations, is very difficult. One alternative is to use simulation based on network models to predict the evolution of synapses between neurons, based on their specified activity targets. This is particularly useful as experimental measurements of the spiking frequency of neurons are more easily accessible and reliable than biological connectivity data. The Model of Structural Plasticity (MSP) by Butz et al. is an example of this approach. However, to predict which neurons connect to each other, the current MSP model computes probabilities for all pairs of neurons, resulting in a complexity O(n2). To enable large-scale simulations with millions of neurons and beyond, this quadratic term is prohibitive. Inspired by hierarchical methods for solving n-body problems in particle physics, we propose a scalable approximation algorithm for MSP that reduces the complexity to O(n log2 n) without any notable impact on the quality of the results. An MPI-based parallel implementation of our scalable algorithm can simulate neuron counts that exceed the state of the art by two orders of magnitude.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134390365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Partitioning GPUs for Improved Scalability 划分gpu以提高可伸缩性
Johan Janzen, D. Black-Schaffer, Andra Hugo
{"title":"Partitioning GPUs for Improved Scalability","authors":"Johan Janzen, D. Black-Schaffer, Andra Hugo","doi":"10.1109/SBAC-PAD.2016.14","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.14","url":null,"abstract":"To port applications to GPUs, developers need to express computational tasks as highly parallel executions with tens of thousands of threads to fill the GPU's compute resources. However, while this will fill the GPU's resources, it does not necessarily deliver the best efficiency, as the task may scale poorly when run with sufficient parallelism to fill the GPU. In this work we investigate how we can improve throughput by co-scheduling poorly-scaling tasks on sub-partitions of the GPU to increase utilization efficiency. We first investigate the scalability of typical HPC tasks on GPUs, and then use this insight to improve throughput by extending the StarPU framework to co-schedule tasks on the GPU. We demonstrate that co-scheduling poorly-scaling GPU tasks accelerates the execution of the critical tasks of a Cholesky Factorization and improves the overall performance of the application by 9% across a wide range of block sizes.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127662217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
STOMP: Statistical Techniques for Optimizing and Modeling Performance of Blocked Sparse Matrix Vector Multiplication STOMP:阻塞稀疏矩阵向量乘法优化和建模性能的统计技术
S. Monteiro, F. Iandola, Daniel Wong
{"title":"STOMP: Statistical Techniques for Optimizing and Modeling Performance of Blocked Sparse Matrix Vector Multiplication","authors":"S. Monteiro, F. Iandola, Daniel Wong","doi":"10.1109/SBAC-PAD.2016.20","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.20","url":null,"abstract":"Sparse-matrix vector multiplication (SpMV) is the core compute routine for several scientific and commercial codebases. Because of its extremely irregular memory accesses (low temporal locality), indirect memory referencing (low spatial locality), low arithmetic intensity, and the non-zero pattern and non-zero density of the matrix, SpMV achieves a mere 10% of peak system performance. Because sparse matrices have extremely varied non-zero patterns and densities, performance of SpMV is hard to predict. Blocking sparse matrices increases arithmetic intensity and spatial locality during SpMV operations, thereby improving SpMV performance. However, selection of an incorrect block size can produce performance degradation as high as 70%. In this study, we describe the STOMP approach of using statistical techniques to predict run time of SpMV in PETSc for new matrices with mean accuracy of 93.52%. We use these statistical prediction models to guide block size selection to achieve up to 100% of optimal performance, comparable to that attained through exhaustive block size search. Our block size selection results produce an average of 55.56% speedup over default SpMV options. On the same set of matrices used in the SPARSITY SpMV framework, STOMP yields a 54.46% speedup while SPARSITY yields a 31.62% speedup over the same default.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115706040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Empirical, Analytical Study of Hardware-Based Page Swap in Hybrid Main Memory System 基于硬件的混合主存系统页面交换的实证分析研究
J. Jung, R. Melhem
{"title":"Empirical, Analytical Study of Hardware-Based Page Swap in Hybrid Main Memory System","authors":"J. Jung, R. Melhem","doi":"10.1109/SBAC-PAD.2016.21","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.21","url":null,"abstract":"Emerging persistent memories (PM) such as PCM or STT-MRAM promise to make up for the shortcomings of DRAM which undergoes a scaling problem and a wasteful refresh power consumption. Hence, a future system memory is anticipated to be a hybrid of DRAM and PM. For such a system to achieve better performance, it is paramount to exploit the heterogeneity of memory access latencies with page swaps that place hot pages in faster, smaller DRAM and cold pages in slower, larger PM. The goal of this paper is to study the impact of a hardware-based page swap in a hybrid memory on the application performance. To this end, we propose a simple analytical model that evaluates the profitability of a page swap by considering a distribution ratio of memory requests between two memories and a varying access latency to each memory. By comparing the outcome of the model to the architecture simulation performance, we show that the proposed model is a useful tool to analyze the behavior of a page swap. Also, we propose and evaluate a model-guided, hardware-driven page swap mechanism which regulates page swaps online. Our experimental results show that the model appraises the profitability of a page swap with an accuracy of 90.9% for the studied workloads. Meanwhile, the model-guided page swap improves IPC performance, on average, by 28.9% and 13.3% compared to no page swap and static page swap schemes. In addition, our model-guided page swap dramatically reduces the number of page swaps by up to 17.3× over static page swap schemes, thus improving performance.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129404860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Using Balanced Data Placement to Address I/O Contention in Production Environments 使用平衡数据放置来解决生产环境中的I/O争用问题
Sarah Neuwirth, Feiyi Wang, S. Oral, Sudharshan S. Vazhkudai, James H. Rogers, U. Brüning
{"title":"Using Balanced Data Placement to Address I/O Contention in Production Environments","authors":"Sarah Neuwirth, Feiyi Wang, S. Oral, Sudharshan S. Vazhkudai, James H. Rogers, U. Brüning","doi":"10.1109/SBAC-PAD.2016.10","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.10","url":null,"abstract":"Designed for capacity and capability, HPC I/O systems are inherently complex and shared among multiple, concurrent jobs competing for resources. Lack of centralized coordination and control often render the end-to-end I/O paths vulnerable to load imbalance and contention. With the emergence of data-intensive HPC applications, storage systems are further contended for performance and scalability. This paper proposes to unify two key approaches to tackle the imbalanced use of I/O resources and to achieve an end-to-end I/O performance improvement in the most transparent way. First, it utilizes a topology-aware, Balanced Placement I/O method (BPIO) for mitigating resource contention. Second, it takes advantage of the platform-neutral ADIOS middleware, which provides a flexible I/O mechanism for scientific applications. By integrating BPIO with ADIOS, referred to as Aequilibro, we obtain an end-to-end and per job I/O performance improvement for ADIOS-enabled HPC applications without requiring any code changes. Aequilibro can be applied to almost any HPC platform and is mostly suitable for systems that lack a centralized file system resource manager. We demonstrate the effectiveness of our integration on the Titan system at the Oak Ridge National Laboratory. Our experiments with a synthetic benchmark and real-world HPC workload show that, even in a noisy production environment, Aequilibro can improve large-scale application performance significantly.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128755854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Optimisation of a Molecular Dynamics Simulation of Chromosome Condensation 染色体凝聚分子动力学模拟的优化
T. Law, Jonny Hancox, Tammy M. K. Cheng, Raphael A. G. Chaleil, Steven A. Wright, P. Bates, S. Jarvis
{"title":"Optimisation of a Molecular Dynamics Simulation of Chromosome Condensation","authors":"T. Law, Jonny Hancox, Tammy M. K. Cheng, Raphael A. G. Chaleil, Steven A. Wright, P. Bates, S. Jarvis","doi":"10.1109/SBAC-PAD.2016.24","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.24","url":null,"abstract":"We present optimisations applied to a bespoke bio-physical molecular dynamics simulation designed to investigate chromosome condensation. Our primary focus is on domain-specific algorithmic improvements to determining short-range interaction forces between particles, as certain qualities of the simulation render traditional methods less effective. We implement tuned versions of the code for both traditional CPU architectures and the modern many-core architecture found in the Intel Xeon Phi coprocessor and compare their effectiveness. We achieve speed-ups starting at a factor of 10 over the original code, facilitating more detailed and larger-scale experiments.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122664704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parallel Pairwise Correlation Computation on Intel Xeon Phi Clusters 基于Intel Xeon Phi集群的并行两两相关计算
Yongchao Liu, Tony Pan, S. Aluru
{"title":"Parallel Pairwise Correlation Computation on Intel Xeon Phi Clusters","authors":"Yongchao Liu, Tony Pan, S. Aluru","doi":"10.1109/SBAC-PAD.2016.26","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.26","url":null,"abstract":"Co-expression network is a critical technique for the identification of inter-gene interactions, which usually relies on all-pairs correlation (or similar measure) computation between gene expression profiles across multiple samples. Pearson's correlation coefficient (PCC) is one widely used technique for gene co-expression network construction. However, all-pairs PCC computation is computationally demanding for large numbers of gene expression profiles, thus motivating our acceleration of its execution using high-performance computing. In this paper, we present LightPCC, the first parallel and distributed all-pairs PCC computation on Intel Xeon Phi (Phi) clusters. It achieves high speed by exploring the SIMD-instruction-level and thread-level parallelism within Phis as well as accelerator-level parallelism among multiple Phis. To facilitate balanced workload distribution, we have proposed a general framework for symmetric all-pairs computation by building bijective functions between job identifier and coordinate space for the first time. We have evaluated LightPCC and compared it to two CPU-based counterparts: a sequential C++ implementation in ALGLIB and an implementation based on a parallel general matrix-matrix multiplication routine in Intel Math Kernel Library (MKL) (all use double precision), using a set of gene expression datasets. Performance evaluation revealed that with one 5110P Phi and 16 Phis, LightPCC runs up to 20.6× and 218.2× faster than ALGLIB, and up to 6.8× and 71.4× faster than single-threaded MKL, respectively. In addition, LightPCC demonstrated good parallel scalability in terms of number of Phis. Source code of LightPCC is publicly available at http://lightpcc.sourceforge.net.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116991989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信