Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms最新文献

筛选
英文 中文
Tensor-matrix products with a compressed sparse tensor 压缩稀疏张量的张量矩阵积
Shaden Smith, G. Karypis
{"title":"Tensor-matrix products with a compressed sparse tensor","authors":"Shaden Smith, G. Karypis","doi":"10.1145/2833179.2833183","DOIUrl":"https://doi.org/10.1145/2833179.2833183","url":null,"abstract":"The Canonical Polyadic Decomposition (CPD) of tensors is a powerful tool for analyzing multi-way data and is used extensively to analyze very large and extremely sparse datasets. The bottleneck of computing the CPD is multiplying a sparse tensor by several dense matrices. Algorithms for tensor-matrix products fall into two classes. The first class saves floating point operations by storing a compressed tensor for each dimension of the data. These methods are fast but suffer high memory costs. The second class uses a single uncompressed tensor at the cost of additional floating point operations. In this work, we bridge the gap between the two approaches and introduce the compressed sparse fiber (CSF) a data structure for sparse tensors along with a novel parallel algorithm for tensor-matrix multiplication. CSF offers similar operation reductions as existing compressed methods while using only a single tensor structure. We validate our contributions with experiments comparing against state-of-the-art methods on a diverse set of datasets. Our work uses 58% less memory than the state-of-the-art while achieving 81% of the parallel performance on 16 threads.","PeriodicalId":215872,"journal":{"name":"Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123623497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 111
Improving graph partitioning for modern graphs and architectures 改进现代图和架构的图分区
Dominique LaSalle, Md. Mostofa Ali Patwary, N. Satish, N. Sundaram, P. Dubey, G. Karypis
{"title":"Improving graph partitioning for modern graphs and architectures","authors":"Dominique LaSalle, Md. Mostofa Ali Patwary, N. Satish, N. Sundaram, P. Dubey, G. Karypis","doi":"10.1145/2833179.2833188","DOIUrl":"https://doi.org/10.1145/2833179.2833188","url":null,"abstract":"Graph partitioning is an important preprocessing step in applications dealing with sparse-irregular data. As such, the ability to efficiently partition a graph in parallel is crucial to the performance of these applications. The number of compute cores in a compute node continues to increase, demanding ever more scalability from shared-memory graph partitioners. In this paper we present algorithmic improvements to the multithreaded graph partitioner mt-Metis. We experimentally evaluate our methods on a 36 core machine, using 20 different graphs from a variety of domains. Our improvements decrease the runtime by 1.5-11.7X and improve strong scaling by 82%.","PeriodicalId":215872,"journal":{"name":"Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116284255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
GAIL: the graph algorithm iron law 图算法铁律
S. Beamer, K. Asanović, D. Patterson
{"title":"GAIL: the graph algorithm iron law","authors":"S. Beamer, K. Asanović, D. Patterson","doi":"10.1145/2833179.2833187","DOIUrl":"https://doi.org/10.1145/2833179.2833187","url":null,"abstract":"As new applications for graph algorithms emerge, there has been a great deal of research interest in improving graph processing. However, it is often difficult to understand how these new contributions improve performance. Execution time, the most commonly reported metric, distinguishes which alternative is the fastest but does not give any insight as to why. A new contribution may have an algorithmic innovation that allows it to examine fewer graph edges. It could also have an implementation optimization that reduces communication. It could even have optimizations that allow it to increase its memory bandwidth utilization. More interestingly, a new innovation may simultaneously affect all three of these factors (algorithmic work, communication volume, and memory bandwidth utilization). We present the Graph Algorithm Iron Law (GAIL) to quantify these tradeoffs to help understand graph algorithm performance.","PeriodicalId":215872,"journal":{"name":"Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124887254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Hybrid memory cube performance characterization on data-centric workloads 以数据为中心的工作负载上的混合内存多维数据集性能表征
M. Gokhale, G. S. Lloyd, C. Macaraeg
{"title":"Hybrid memory cube performance characterization on data-centric workloads","authors":"M. Gokhale, G. S. Lloyd, C. Macaraeg","doi":"10.1145/2833179.2833184","DOIUrl":"https://doi.org/10.1145/2833179.2833184","url":null,"abstract":"The Hybrid Memory Cube is an early commercial product embodying attributes of future stacked DRAM architectures, namely large capacity, high bandwidth, on-package memory controller, and high speed serial interface. We study the performance and energy of a Gen2 HMC on data-centric workloads through a combination of emulation and execution on an HMC FPGA board. An in-house FPGA emulator has been used to obtain memory traces for a small collection of data-centric benchmarks. Our FPGA emulator is based on a 32-bit ARM processor and non-intrusively captures complete memory access traces at only 20X slowdown from real time. We have developed tools to run combined trace fragments from multiple benchmarks on the HMC board, giving a unique capability to characterize HMC performance and power usage under a data parallel workload. We find that the HMC's separate read and write channels are not well exploited by read-dominated data-centric workloads. Our benchmarks achieve between 66% -- 80% of peak bandwidth (80 GB/s for 32-byte packets with 50--50 read/write mix) on the HMC, suggesting that combined read/write channels might show higher utilization on these access patterns. Bandwidth scales linearly up to saturation with increased demand on highly concurrent application workloads with many independent memory requests. There is a corresponding increase in latency, ranging from 80 ns on an extremely light load to 130 ns at high bandwidth.","PeriodicalId":215872,"journal":{"name":"Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131063001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
PathFinder: a signature-search miniapp and its runtime characteristics PathFinder:一个签名搜索小应用程序及其运行时特性
Aditya M. Deshpande, J. Draper, J. Rigdon, R. Barrett
{"title":"PathFinder: a signature-search miniapp and its runtime characteristics","authors":"Aditya M. Deshpande, J. Draper, J. Rigdon, R. Barrett","doi":"10.1145/2833179.2833190","DOIUrl":"https://doi.org/10.1145/2833179.2833190","url":null,"abstract":"Graphs are widely used in data analytics applications in a variety of fields and are rapidly gaining attention in the computational scientific and engineering (CSE) application community. An important application of graphs concerns binary (executable) signature search to address the potential of a suspect binary evading binary signature detection via obfuscation. A control flow graph generated from a binary allows identification of a pattern of system calls, an ordered sequence of which can then be used as signatures in the search. An application proxy, named PathFinder, represents these properties, allowing examination of the performance characteristics of algorithms used in the search. In this work, we describe PathFinder, its signature search algorithm, which is a modified depth-first recursive search wherein adjacent nodes are compared before recursing down its edges for labels, and its general performance and cache characteristics. We highlight some important differences between PathFinder and traditional CSE applications. For example, the L2 cache hit ratio (less than 60%) in PathFinder is observed to be substantially lower than those observed for traditional CSE applications.","PeriodicalId":215872,"journal":{"name":"Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124904513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
PL2AP: fast parallel cosine similarity search PL2AP:快速并行余弦相似度搜索
D. Anastasiu, G. Karypis
{"title":"PL2AP: fast parallel cosine similarity search","authors":"D. Anastasiu, G. Karypis","doi":"10.1145/2833179.2833182","DOIUrl":"https://doi.org/10.1145/2833179.2833182","url":null,"abstract":"Solving the AllPairs similarity search problem entails finding all pairs of vectors in a high dimensional sparse dataset that have a similarity value higher than a given threshold. The output form this problem is a crucial component in many real-world applications, such as clustering, online advertising, recommender systems, near-duplicate document detection, and query refinement. A number of serial algorithms have been proposed that solve the problem by pruning many of the possible similarity candidates for each query object, after accessing only a few of their non-zero values. The pruning process results in unpredictable memory access patterns that can reduce search efficiency. In this context, we introduce pL2AP, which efficiently solves the AllPairs cosine similarity search problem in a multi-core environment. Our method uses a number of cache-tiling optimizations, combined with fine-grained dynamically balanced parallel tasks, to solve the problem 1.5x-238x faster than existing parallel baselines on datasets with hundreds of millions of non-zeros.","PeriodicalId":215872,"journal":{"name":"Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131563370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Betweenness centrality on Multi-GPU systems 多gpu系统的中间性中心性
M. Bernaschi, Giancarlo Carbone, Flavio Vella
{"title":"Betweenness centrality on Multi-GPU systems","authors":"M. Bernaschi, Giancarlo Carbone, Flavio Vella","doi":"10.1145/2833179.2833192","DOIUrl":"https://doi.org/10.1145/2833179.2833192","url":null,"abstract":"Betweenness Centrality (BC) is steadily growing in popularity as a metrics of the influence of a vertex in a graph. The exact BC computation for a large scale graph is an extraordinary challenging and requires high performance computing techniques to provide results in a reasonable amount of time. Here, we present the techniques we developed to speed-up the computation of the BC on Multi-GPU systems. Our approach combines the bi-dimensional (2-D) decomposition of the graph and multi-level parallelism. Experimental results show that the proposed techniques are well suited to compute BC scores in graphs which are too large to fit in single GPU memory. In particular, the computation time of a 234 million edges graph is reduced to less than 2 hours.","PeriodicalId":215872,"journal":{"name":"Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133811138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Generalised vectorisation for sparse matrix: vector multiplication 稀疏矩阵的广义向量化:向量乘法
A. N. Yzelman
{"title":"Generalised vectorisation for sparse matrix: vector multiplication","authors":"A. N. Yzelman","doi":"10.1145/2833179.2833185","DOIUrl":"https://doi.org/10.1145/2833179.2833185","url":null,"abstract":"This work generalises the various ways in which a sparse matrix--vector (SpMV) multiplication can be vectorised. It arrives at a novel data structure that generalises three earlier well-known data structures for sparse computations: the Blocked CRS format, the (sliced) ELLPACK format, and segmented scan based formats. The new data structure is relevant since efficient use of new hardware requires the use of increasingly wide vector registers. Normally, the use of vectorisation for sparse computations is limited due to bandwidth constraints. In cases where computations are limited by memory latencies instead of memory bandwidth, however, vectorisation can still help performance. The Intel Xeon Phi, appearing as a component in several top-500 supercomputers, displays exactly this behaviour for SpMV multiplication. On this architecture the use of the new generalised vectorisation scheme increases performance up to 178 percent.","PeriodicalId":215872,"journal":{"name":"Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121753986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Data-centric GPU-based adaptive mesh refinement 以数据为中心的基于gpu的自适应网格细化
M. Wahib, N. Maruyama
{"title":"Data-centric GPU-based adaptive mesh refinement","authors":"M. Wahib, N. Maruyama","doi":"10.1145/2833179.2833181","DOIUrl":"https://doi.org/10.1145/2833179.2833181","url":null,"abstract":"It has been demonstrated that explicit stencil computations of high-resolution scheme can highly benefit from GPUs. This includes Adaptive Mesh Refinement (AMR), which is a model for adapting the resolution of a stencil grid locally. Unlike uniform grid stencils, however, adapting the grid is typically done on the CPU side. This requires transferring the stencil data arrays to/from CPU every time the grid is adapted. We propose a data-centric approach to GPU-based AMR. That is, porting all the mesh adaptation operations touching the data arrays to the GPU. This would allow the stencil data arrays to reside on the GPU memory for the entirety of the simulation. Thus, the GPU code would specialize on the data residing on its memory while the CPU specializes on the AMR metadata residing on CPU memory. We compare the performance of the proposed method to a basic GPU implementation and an optimized GPU implementation that overlaps communication and computation. The performance of two GPU-based AMR applications is enhanced by 2.21x, and 2.83x compared to the basic implementation.","PeriodicalId":215872,"journal":{"name":"Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130183376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
A scalable architecture for ordered irregular parallelism 有序的不规则并行的可伸缩架构
Daniel Sánchez
{"title":"A scalable architecture for ordered irregular parallelism","authors":"Daniel Sánchez","doi":"10.1145/2833179.2833193","DOIUrl":"https://doi.org/10.1145/2833179.2833193","url":null,"abstract":"We present a new parallel architecture that exploits ordered irregular parallelism, which is abundant but hard to mine with current software and hardware techniques. In this architecture, called Swarm, programs consist of short tasks, as small as tens of instructions each, with programmer-specified order constraints. Swarm executes tasks speculatively and out of order, and efficiently speculates thousands of tasks ahead of the earliest active task to uncover enough parallelism. Furthermore, Swarm sends task to run close to their data whenever possible, reducing data movement. We contribute several new techniques that allow Swarm to scale to large core counts and speculation windows, including a new execution model, speculation-aware hardware task management, selective aborts, and scalable ordered task commits.","PeriodicalId":215872,"journal":{"name":"Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117104891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信