2016 45th International Conference on Parallel Processing (ICPP)最新文献

筛选
英文 中文
Performance Maximization via Frequency Oscillation on Temperature Constrained Multi-core Processors 基于频率振荡的温度约束多核处理器性能最大化
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.67
Shi Sha, Wujie Wen, Ming Fan, Shaolei Ren, Gang Quan
{"title":"Performance Maximization via Frequency Oscillation on Temperature Constrained Multi-core Processors","authors":"Shi Sha, Wujie Wen, Ming Fan, Shaolei Ren, Gang Quan","doi":"10.1109/ICPP.2016.67","DOIUrl":"https://doi.org/10.1109/ICPP.2016.67","url":null,"abstract":"While multi-core architectures, by exploring the thread/process level parallelism, help to lower down the power/thermal barrier for single core architectures, power/thermal issues are still the primary limiting factors to achieve high computing performance. In this paper, we study the problem of how to maximize the computing performance of multi-core platforms without violating their peak temperature constraint. As different cores may exhibit different thermal behaviors, we propose to run each core with different working frequencies and develop a schedule based on two novel concepts, i.e. the step-up schedule and the m-Oscillating schedule, for multi-core platforms. We formally prove that the proposed schedule can guarantee the peak temperature constraint for a given multi-core platform. Compared with the traditional exhaustive search-based approach, our approach can reduce the computation time by orders of magnitude and improve the throughput up to 89%, with an average improvement of 11%.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129150021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Sparse Matrix Format Selection with Multiclass SVM for SpMV on GPU GPU上SpMV的多类SVM稀疏矩阵格式选择
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.64
Akrem Benatia, Weixing Ji, Yizhuo Wang, Feng Shi
{"title":"Sparse Matrix Format Selection with Multiclass SVM for SpMV on GPU","authors":"Akrem Benatia, Weixing Ji, Yizhuo Wang, Feng Shi","doi":"10.1109/ICPP.2016.64","DOIUrl":"https://doi.org/10.1109/ICPP.2016.64","url":null,"abstract":"Sparse Matrix-Vector Multiplication (SpMV) kernel dominates the computing cost in numerous scientific applications. Many implementations based on different sparse formats were proposed recently for this kernel on the GPU side. Since the performance of these sparse formats varies significantly according to the sparsity characteristics of the input matrix and the hardware specifications, no one of them can be considered as the best one to use for every sparse matrix. In this paper, we address the problem of selecting the best representation for a given sparse matrix on GPU by using a machine learning approach. First, we present some interesting and easy to compute features for characterizing the sparse matrices on GPU. Second, we use a multiclass Support Vector Machine (SVM) classifier to select the best format for each input matrix. We consider in this paper four popular formats (COO, CSR, ELL, and HYB), but our work can be extended to support more sparse representations. Experimental results on two different GPUs (Fermi GTX 580 and Maxwell GTX 980 Ti) show that we achieved more than 98% of the performance possible with a perfect selection.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126670121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
In Situ Storage Layout Optimization for AMR Spatio-temporal Read Accesses AMR时空读访问的原位存储布局优化
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.53
Houjun Tang, S. Byna, Steve Harenberg, Wenzhao Zhang, Xiaocheng Zou, Daniel F. Martin, Bin Dong, D. Devendran, Kesheng Wu, D. Trebotich, S. Klasky, N. Samatova
{"title":"In Situ Storage Layout Optimization for AMR Spatio-temporal Read Accesses","authors":"Houjun Tang, S. Byna, Steve Harenberg, Wenzhao Zhang, Xiaocheng Zou, Daniel F. Martin, Bin Dong, D. Devendran, Kesheng Wu, D. Trebotich, S. Klasky, N. Samatova","doi":"10.1109/ICPP.2016.53","DOIUrl":"https://doi.org/10.1109/ICPP.2016.53","url":null,"abstract":"Analyses of large simulation data often concentrate on regions in space and in time that contain important information. As simulations adopt Adaptive Mesh Refinement (AMR), the data records from a region of interest could be widely scattered on storage devices and accessing interesting regions results in significantly reduced I/O performance. In this work, we study the organization of block-structured AMR data on storage to improve performance of spatio-temporal data accesses. AMR has a complex hierarchical multi-resolution data structure that does not fit easily with the existing approaches that focus on uniform mesh data. To enable efficient AMR read accesses, we develop an in situ data layout optimization framework. Our framework automatically selects from a set of candidate layouts based on a performance model, and reorganizes the data before writing to storage. We evaluate this framework with three AMR datasets and access patterns derived from scientific applications. Our performance model is able to identify the best layout scheme and yields up to a 3X read performance improvement compared to the original layout. Though it is not possible to turn all read accesses into contiguous reads, we are able to achieve 90% of contiguous read throughput with the optimized layouts on average.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129249378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
RRect: A Novel Server-centric Data Center Network with High Availability rect:一种新型的以服务器为中心的高可用性数据中心网络
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.12
Zhenhua Li, Yuanyuan Yang
{"title":"RRect: A Novel Server-centric Data Center Network with High Availability","authors":"Zhenhua Li, Yuanyuan Yang","doi":"10.1109/ICPP.2016.12","DOIUrl":"https://doi.org/10.1109/ICPP.2016.12","url":null,"abstract":"In this paper, we propose a novel server-centric network for data centers, called RRect. Compared to existing server-centric networks, RRect has a linear diameter to the network order and abundant parallel paths with near-equal lengths, so that traffic in RRect enjoys a short and predictable communication latency. We present an efficient routing algorithm to find paths between any pair of servers in RRect. A complete addressing scheme and recursive RRect construction procedure are also provided in this paper. Meanwhile, to meet today's stringent high availability requirement, unlike existing server-centric network structures, RRect can be configured into redundancy and failover scheme, in which the backup server can fully take the place of the corresponding malfunctioning server without losing topological advantages, such as multiple near-equal parallel paths. Our comprehensive simulations show that RRect gives a better average path lengths and a more balanced path distribution among all pairs of servers. Meanwhile, RRect can maintain the same performance on many critical metrics as BCube, including short diameter and excellent aggregate throughput. All these features make RRect a very empirical structure for enterprise dater center network products.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129393402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
AppBag: Application-Aware Bandwidth Allocation for Virtual Machines in Cloud Environment AppBag:云环境下虚拟机的应用感知带宽分配
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.10
Dian Shen, Junzhou Luo, Fang Dong, Junxue Zhang
{"title":"AppBag: Application-Aware Bandwidth Allocation for Virtual Machines in Cloud Environment","authors":"Dian Shen, Junzhou Luo, Fang Dong, Junxue Zhang","doi":"10.1109/ICPP.2016.10","DOIUrl":"https://doi.org/10.1109/ICPP.2016.10","url":null,"abstract":"It is challenging to allocate the network bandwidth to virtual machines(VMs) hosting communication-intensive applications. Due to the temporal and spatial variability of the hosted applications, it is crucial how much bandwidth to be reserved for each VM and when to adjust it. Prior approaches typically resort to predicting the applications' network demands, according to which the VMs are placed once for all or periodically migrated. However, recent works conceded that the network demands of applications can only be accurately derived right before each execution phase. In this paper, we propose AppBag, an Application-aware Bandwidth guarantee framework which allocates the bandwidth to VMs using only one-stepahead information. An efficient VM migration algorithm is then proposed to adjust the bandwidth allocation and corresponding VM placement, subjected to the network demands variation in future execution phases. We further implement AppBag with OpenStack and deploy it on the testbed environment in our data center. Extensive evaluations using popular applications show that AppBag can handle the bandwidth requests at run-time while improving applications' performance and reducing the global traffic in the data center fabric.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133327105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Fast RFID Polling Protocols 快速RFID轮询协议
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.42
Jia Liu, Bin Xiao, Xuan Liu, Lijun Chen
{"title":"Fast RFID Polling Protocols","authors":"Jia Liu, Bin Xiao, Xuan Liu, Lijun Chen","doi":"10.1109/ICPP.2016.42","DOIUrl":"https://doi.org/10.1109/ICPP.2016.42","url":null,"abstract":"Polling is a widely used anti-collision protocol that interrogates RFID tags in a request-response way. In conventional polling, the reader needs to broadcast 96-bit tag IDs to separate each tag from others, leading to long interrogation delay. This paper takes the first step to design fast polling protocols by shortening the polling vector. We first propose an efficient Hash Polling Protocol (HPP) that uses hash indices rather than tag IDs as the polling vector to query each tag. The length of the polling vector is dropped from 96 bits to no more than log(n) bits (n is the number of tags). We then enhance HPP (EHPP) to make it not only more efficient but also more steady with respect to the number of tags. To avoid redundant transmissions in both HPP and EHPP, we finally propose a Tree-based Polling Protocol (TPP) that reserves the invariant portion of the polling vector while updates only the discrepancy by constructing and broadcasting a polling tree. Theoretical analysis shows that the average length of the polling vector in TPP levels off at only 3.44, 28 times less than 96-bit tag IDs. We also apply our protocols to collect tag information and simulation results demonstrate that our best protocol TPP outperforms the state-of-the-art information collection protocol.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121118373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Declarative Tuning for Locality in Parallel Programs 并行程序中局部性的声明调优
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.58
S. Chatterjee, Nick Vrvilo, Zoran Budimlic, K. Knobe, Vivek Sarkar
{"title":"Declarative Tuning for Locality in Parallel Programs","authors":"S. Chatterjee, Nick Vrvilo, Zoran Budimlic, K. Knobe, Vivek Sarkar","doi":"10.1109/ICPP.2016.58","DOIUrl":"https://doi.org/10.1109/ICPP.2016.58","url":null,"abstract":"Optimized placement of data and computation for locality is critical for improving performance and reducing energy consumption on modern computing systems. However, for most programming models, modifying data and computation placements typically requires rewriting large portions of the application, thereby posing a huge performance portability challenge in today's rapidly evolving architecture landscape. In this paper we present TunedCnC, a novel, declarative and flexible CnC tuning framework for controlling the spatial and temporal placement of data and computation by specifying hierarchical affinity groups and distribution functions. TunedCnC emphasizes a separation of concerns: the domain expert specifies a parallel application by defining data and control dependences, while the tuning expert specifies how the application should be executed on a given architecture - defining when and where for data and computation placement. The application remains unchanged when tuned for a different platform or towards different performance goals. We evaluate the utility of TunedCnC on several applications, and demonstrate that varying the tuning specification can have a significant impact on an application's performance. Our evaluation is performed using an implementation of the Concurrent Collections (CnC) declarative parallel programming model, but our results should be applicable to tuning of other data-flow task-parallel programming models as well.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126138604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
On the Impact of Widening Vector Registers on Sequence Alignment 向量寄存器加宽对序列对齐的影响
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.65
J. Daily, A. Kalyanaraman, S. Krishnamoorthy, Bin Ren
{"title":"On the Impact of Widening Vector Registers on Sequence Alignment","authors":"J. Daily, A. Kalyanaraman, S. Krishnamoorthy, Bin Ren","doi":"10.1109/ICPP.2016.65","DOIUrl":"https://doi.org/10.1109/ICPP.2016.65","url":null,"abstract":"Vector extensions, such as SSE, have been part of the x86 since the 1990s, with applications in graphics, signal processing, and scientific applications. Although many algorithms and applications can naturally benefit from automatic vectorization techniques, there are still many that are difficult to vectorize due to their dependence on irregular data structures, dense branch operations, or data dependencies. Sequence alignment, one of the most widely used operations in bioinformatics workflows, has a computational footprint that features complex data dependencies. In this paper, we demonstrate that the trend of widening vector registers adversely affects the state-of-the-art sequence alignment algorithm based on striped data layouts. We present a practically efficient SIMD implementation of a parallel scan based sequence alignment algorithm that can better exploit wider SIMD units. We conduct comprehensive workload and use case analyses to characterize the relative behavior of the striped and scan approaches and identify the best choice of algorithm based on input length and SIMD width.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121944533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
RCHC: A Holistic Runtime System for Concurrent Heterogeneous Computing RCHC:并发异构计算的整体运行时系统
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.31
Jinsu Park, Woongki Baek
{"title":"RCHC: A Holistic Runtime System for Concurrent Heterogeneous Computing","authors":"Jinsu Park, Woongki Baek","doi":"10.1109/ICPP.2016.31","DOIUrl":"https://doi.org/10.1109/ICPP.2016.31","url":null,"abstract":"Concurrent heterogeneous computing (CHC) is rapidly emerging as a promising solution for high-performance and energy-efficient computing. The fundamental challenges for efficient CHC are how to partition the workload of the target application across the devices in the underlying CHC system and how to control the operating frequency of each device in order to maximize the overall efficiency. Despite the extensive prior work on the system software techniques for CHC, efficient runtime support for CHC that robustly supports both functional and performance heterogeneity without the need for extensive offline profiling still remains unexplored. To bridge this gap, we propose RCHC, a holistic runtime system for concurrent heterogeneous computing. RCHC dynamically profiles the target application and constructs the performance and power estimation models based on the runtime information. Guided by the estimation models, RCHC explores the system state space, determines the best system state that is expected to maximize the efficiency of the target application, and accordingly executes it. Our experimental results demonstrate that RCHC significantly outperforms the baseline version (e.g., 61.0% higher energy efficiency on average) that employs the GPU and achieves the efficiency comparable with that of the static best version, which requires extensive offline profiling.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114640900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Scalability Comparison Study of Data Management Approaches for Smart Metering Systems 智能计量系统数据管理方法的可扩展性比较研究
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.61
Houssem-Eddine Chihoub, C. Collet
{"title":"A Scalability Comparison Study of Data Management Approaches for Smart Metering Systems","authors":"Houssem-Eddine Chihoub, C. Collet","doi":"10.1109/ICPP.2016.61","DOIUrl":"https://doi.org/10.1109/ICPP.2016.61","url":null,"abstract":"Nowadays, more and more data are being generated and collected in electrical smart grids. Most of these data are coming from smart meters and sensors deployed massively throughout the power grid. As the generation of data is becoming ever more frequent and with the constantly increasing volumes, it is becoming harder and harder to manage and process these data at the scale of a smart grid within legacy systems. In this work, we focus on investigating the scalability and performance of different data management approaches for meter data processing. To this end, we conduct a thorough experimental study of various systems including a parallel relational database system, MapReduce based systems including Hadoop and Spark, and a NoSQL datastore system. Our experiment sets were conducted on up to 140 nodes on Grid5000 and up to 1.4 TB of meter data. Our results demonstrate that parallel relational systems are more suited for most processing types on smart meter data in the smart grid but at the cost of very slow data loading. In contrast, we show that with the appropriate distribution model, data partitioning and modeling choices we achieve very fast and scalable bill computations, the main complex processing for utilities providers.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117154674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信