2011 International Conference on Parallel Processing最新文献

Efficient Energy Management Using Adaptive Reinforcement Learning-Based Scheduling in Large-Scale Distributed Systems 大规模分布式系统中基于自适应强化学习调度的高效能源管理

2011 International Conference on Parallel Processing Pub Date : 2011-09-13 DOI: 10.1109/ICPP.2011.18

M. Hussin, Young Choon Lee, Albert Y. Zomaya

{"title":"Efficient Energy Management Using Adaptive Reinforcement Learning-Based Scheduling in Large-Scale Distributed Systems","authors":"M. Hussin, Young Choon Lee, Albert Y. Zomaya","doi":"10.1109/ICPP.2011.18","DOIUrl":"https://doi.org/10.1109/ICPP.2011.18","url":null,"abstract":"Energy consumption in large-scale distributed systems, such as computational grids and clouds gains a lot of attention recently due to its significant performance, environmental and economic implications. These systems consume a massive amount of energy not only for powering them, but also cooling them. More importantly, the explosive increase in energy consumption is not linear to resource utilization as only a marginal percentage of energy is consumed for actual computational works. This energy problem becomes more challenging with uncertainty and variability of workloads and heterogeneous resources in those systems. This paper presents a dynamic scheduling algorithm incorporating reinforcement learning for good performance and energy efficiency. This incorporation helps the scheduler observe and adapt to various processing requirements (tasks) and different processing capacities (resources). The learning process of our scheduling algorithm develops an association between the best action (schedule) and the current state of the environment (parallel system). We have also devised a task-grouping technique to help the decision-making process of our algorithm. The grouping technique is adaptive in nature since it incorporates current workload and energy consumption for the best action. Results from our extensive simulations with varying processing capacities and a diverse set of tasks demonstrate the effectiveness of this learning approach.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115402266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Gradient-Based Aggregation in Forest of Sensors (GrAFS) 基于梯度的传感器森林聚集算法

2011 International Conference on Parallel Processing Pub Date : 2011-09-13 DOI: 10.1109/ICPP.2011.64

R. Prakash, Ehsan Nourbakhsh

{"title":"Gradient-Based Aggregation in Forest of Sensors (GrAFS)","authors":"R. Prakash, Ehsan Nourbakhsh","doi":"10.1109/ICPP.2011.64","DOIUrl":"https://doi.org/10.1109/ICPP.2011.64","url":null,"abstract":"In several sensing applications the parameter being sensed exhibits a high spatial correlation. For example, if the temperature of a region is being monitored, there are distinct hot and cold spots. The area close to the hot spots is usually warmer than average, with a temperature gradient between the hot and cold spots. We exploit this correlation of sensor data to form a forest of logical trees, with the trees collectively spanning all the sensor nodes. The root of a tree corresponds to a sensor reporting the local peak value. The tree nodes represent the value gradient: each node's sensed value is smaller than that of its parent, and greater than that of its children. GrAFS provides a mechanism to maintain some information at the local peaks and the sink. Using this information the sink can answer several queries either directly, or by probing the region of the sensor field that holds the answer. Thus, queries can be answered in a time and/or bandwidth efficient manner. The GrAFS approach to data aggregation can easily adapt to changes in the spatial distribution of sensed values, and also cope with message losses and sensor node failures. Implementation on MICA2 motes and simulation experiments conducted using TinyOS quantify the performance of GrAFS.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129134688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Bloom Filter Performance on Graphics Engines 布隆过滤器性能的图形引擎

2011 International Conference on Parallel Processing Pub Date : 2011-09-13 DOI: 10.1109/ICPP.2011.27

Lin Ma, R. Chamberlain, J. Buhler, M. Franklin

引用次数: 29

Cache Accurate Time Skewing in Iterative Stencil Computations 迭代模板计算中的缓存精确时间倾斜

2011 International Conference on Parallel Processing Pub Date : 2011-09-13 DOI: 10.1109/ICPP.2011.47

R. Strzodka, Mohammed Shaheen, Dawid Pajak, H. Seidel

{"title":"Cache Accurate Time Skewing in Iterative Stencil Computations","authors":"R. Strzodka, Mohammed Shaheen, Dawid Pajak, H. Seidel","doi":"10.1109/ICPP.2011.47","DOIUrl":"https://doi.org/10.1109/ICPP.2011.47","url":null,"abstract":"We present a time skewing algorithm that breaks the memory wall for certain iterative stencil computations. A stencil computation, even with constant weights, is a completely memory-bound algorithm. For example, for a large 3D domain of $500^3$ doubles and 100 iterations on a quad-core Xeon X5482 3.2GHz system, a hand-vectorized and parallelized naive 7-point stencil implementation achieves only 1.4 GFLOPS because the system memory bandwidth limits the performance. Although many efforts have been undertaken to improve the performance of such nested loops, for large data sets they still lag far behind synthetic benchmark performance. The state-of-art automatic locality optimizer PluTo achieves 3.7 GFLOPS for the above stencil, whereas a parallel benchmark executing the inner stencil computation directly on registers performs at 25.1 GFLOPS. In comparison, our algorithm achieves 13.0 GFLOPS (52% of the stencil peak benchmark).We present results for 2D and 3D domains in double precision including problems with gigabyte large data sets. The results are compared against hand-optimized naive schemes, PluTo, the stencil peak benchmark and results from literature. For constant stencils of slope one we break the dependence on the low system bandwidth and achieve at least 50% of the stencil peak, thus performing within a factor two of an ideal system with infinite bandwidth (the benchmark runs on registers without memory access). For large stencils and banded matrices the additional data transfers let the limitations of the system bandwidth come into play again, however, our algorithm still gains a large improvement over the other schemes.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130792286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 74

Accelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU 利用GPU加速稀疏矩阵向量乘法迭代方法

2011 International Conference on Parallel Processing Pub Date : 2011-09-13 DOI: 10.1109/ICPP.2011.82

Kiran Kumar Matam, Kishore Kothapalli

{"title":"Accelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU","authors":"Kiran Kumar Matam, Kishore Kothapalli","doi":"10.1109/ICPP.2011.82","DOIUrl":"https://doi.org/10.1109/ICPP.2011.82","url":null,"abstract":"Multiplying a sparse matrix with a vector (spmv for short) is a fundamental operation in many linear algebra kernels. Having an efficient spmv kernel on modern architectures such as the GPUs is therefore of principal interest. The computational challenges that spmv poses are significantlydifferent compared to that of the dense linear algebra kernels. Recent work in this direction has focused on designing data structures to represent sparse matrices so as to improve theefficiency of spmv kernels. However, as the nature of sparseness differs across sparse matrices, there is no clear answer as to which data structure to use given a sparse matrix. In this work, we address this problem by devising techniques to understand the nature of the sparse matrix and then choose appropriate data structures accordingly. By using our technique, we are able to improve the performance of the spmv kernel on an Nvidia Tesla GPU (C1060) by a factor of up to80% in some instances, and about 25% on average compared to the best results of Bell and Garland [3] on the standard dataset (cf. Williams et al. SC'07) used in recent literature. We also use our spmv in the conjugate gradient method and show an average 20% improvement compared to using HYB spmv of [3], on the dataset obtained from the The University of Florida Sparse Matrix Collection [9].","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133209769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

PC-Mesh: A Dynamic Parallel Concentrated Mesh PC-Mesh:一个动态并行集中网格

2011 International Conference on Parallel Processing Pub Date : 2011-09-13 DOI: 10.1109/ICPP.2011.21

Jesús Camacho Villanueva, J. Flich, Antoni Roca, J. Duato

{"title":"PC-Mesh: A Dynamic Parallel Concentrated Mesh","authors":"Jesús Camacho Villanueva, J. Flich, Antoni Roca, J. Duato","doi":"10.1109/ICPP.2011.21","DOIUrl":"https://doi.org/10.1109/ICPP.2011.21","url":null,"abstract":"We present a novel network on-chip topology, PC-Mesh (Parallel Concentrated Mesh), suitable for tiled CMP systems. The topology is built using four concentrated mesh (C-Mesh) networks and a new network interface able to inject packets through different networks. The goal of the new combined topology is to minimize the power consumption of the network when running applications exhibiting low traffic rates and maximize throughput when applications require high traffic rates. Thus, the topology is dynamically adjusted (switching on and off network components) with a proper injection algorithm, adapting itself to the network on-chip traffic requirements. The PC-Mesh network performs as a C-Mesh network (using one sub network) when the traffic is low obtaining large savings in power consumption. When the load network increases, new sub networks are opened and thus higher traffic rates are supported, thus providing comparable results as the mesh network. Additional benefits of the PC-Mesh network is its fault tolerance degree and the lower latency in terms of hops. An alternative PC-Mesh version is provided to optimize the fault-tolerance degree. Comparative results with detailed evaluations (in area, power, and delay) are provided both for the network interface and switches. Results demonstrate PC-Mesh is able to dynamically adapt to the current traffic situations. Experimental results with a system-level simulation platform (including the application being run and the operating system) are provided. Results show how the PC-Mesh network achieves the same results as the C-Mesh topology reducing execution time of applications by 20% as well as energy consumption by also 20%, when compared with the 2D-Mesh network topology. However, when challenged with higher traffic demands, PC-Mesh outperforms the C-Mesh network by achieving much lower execution time of applications and lower energy consumption. In some scenarios, execution time is reduced by a factor of 2 and power consumption by 50%.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133743953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Virtual Topologies for Scalable Resource Management and Contention Attenuation in a Global Address Space Model on the Cray XT5 基于Cray XT5的全局地址空间模型中可扩展资源管理和争用衰减的虚拟拓扑

2011 International Conference on Parallel Processing Pub Date : 2011-09-13 DOI: 10.1109/ICPP.2011.38

Weikuan Yu, V. Tipparaju, Xinyu Que, J. Vetter

{"title":"Virtual Topologies for Scalable Resource Management and Contention Attenuation in a Global Address Space Model on the Cray XT5","authors":"Weikuan Yu, V. Tipparaju, Xinyu Que, J. Vetter","doi":"10.1109/ICPP.2011.38","DOIUrl":"https://doi.org/10.1109/ICPP.2011.38","url":null,"abstract":"Global Address Space (GAS) programming models enable a convenient, shared-memory style addressing model, and support completely asynchronous data movement. Their underlying runtime systems face critical challenges in (1) scalably managing resources (such as memory for communication buffers), and (2) gracefully handling unpredictable communication patterns and any associated contention. In this research, we investigate these challenges for a popular GAS runtime library, Aggregate Remote Memory Copy Interface (ARMCI) on, large-scale Cray XT5 systems. We represent the management of communication resources as directed graphs, and propose two new scalable virtual topologies, Meshed Fully Connected Graphs (MFCG) and Cubic Fully Connected Graphs (CFCG), for scalable resource management and contention attenuation. To ensure deadlock-free communication in these multi-dimensional topologies, we design and develop Lowest Dimension First (LDF) forwarding to support fully- or partially-populated MFCG and CFCG on any number of nodes. We have extensively evaluated the benefits of these virtual topologies on the petascale Jaguar Cray XT5 system at Oak Ridge National Laboratory. Our experimental results demonstrate MFCG as the most suitable virtual topology because of its benefits in resource management, contention mitigation, and the resulting benefit to scientific applications.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115003257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Checkpoint and Run-Time Adaptation with Pluggable Parallelisation 具有可插拔并行化的检查点和运行时适应

2011 International Conference on Parallel Processing Pub Date : 2011-09-13 DOI: 10.1109/ICPP.2011.83

Bruno Medeiros, J. Sobral

{"title":"Checkpoint and Run-Time Adaptation with Pluggable Parallelisation","authors":"Bruno Medeiros, J. Sobral","doi":"10.1109/ICPP.2011.83","DOIUrl":"https://doi.org/10.1109/ICPP.2011.83","url":null,"abstract":"Enabling applications for computational Grids requires new approaches to develop applications that can effectively cope with resource volatility. Applications must be resilient to resource faults, adapting the behaviour to available resources. This paper describes an approach to application-level adaptation that efficiently supports application-level check pointing. The key of this work is the concept of pluggable parallelisation, which localises parallelisation issues into multiple modules that can be (un)plugged to match resource availability. This paper shows how pluggable parallelisation can be extended to effectively support check pointing and run-time adaptation. We present the developed pluggable mechanism that helps the programmer to include check pointing in the base (sequential). Based on these mechanisms and on previous work on pluggable parallelisation, our approach is able to automatically add support for check pointing in parallel execution environments. Moreover, applications can adapt from a sequential execution to a multi-cluster configuration. Adaptation can be performed by check pointing the application and restarting on a different mode or can be performed during run-time. Pluggable parallelisation intrinsically promotes the separation of software functionality from fault-tolerance and adaptation issues facilitating their analysis and evolution. The work presented in this paper reinforces this idea by showing the feasibility of the approach and performance benefits that can be achieved.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129716237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

WAVNet: Wide-Area Network Virtualization Technique for Virtual Private Cloud 面向虚拟私有云的广域网虚拟化技术

2011 International Conference on Parallel Processing Pub Date : 2011-09-13 DOI: 10.1109/ICPP.2011.90

Zheming Xu, S. Di, Weida Zhang, Luwei Cheng, Cho-Li Wang

引用次数: 13

Energy-Efficient Cache Coherence Protocols in Chip-Multiprocessors for Server Consolidation 用于服务器整合的芯片多处理器节能缓存一致性协议

2011 International Conference on Parallel Processing Pub Date : 2011-09-13 DOI: 10.1109/ICPP.2011.44

Antonio García-Guirado, Ricardo Fernández Pascual, Alberto Ros, José M. García

{"title":"Energy-Efficient Cache Coherence Protocols in Chip-Multiprocessors for Server Consolidation","authors":"Antonio García-Guirado, Ricardo Fernández Pascual, Alberto Ros, José M. García","doi":"10.1109/ICPP.2011.44","DOIUrl":"https://doi.org/10.1109/ICPP.2011.44","url":null,"abstract":"As the number of cores in a chip increases, power consumption is becoming a major constraint in the design of chip multiprocessors. At the same time, server consolidation is gaining importance to take advantage of such a number of cores. Our goal is to alleviate this constraint by reducing the power consumption of chip multiprocessors used for consolidated workloads by means of the cache coherence protocol. For this, we statically divide the chip in areas, which allows us to reduce the directory overhead needed to support coherence and to reduce the network traffic. This translates into less power consumption without performance degradation. Cache coherence is maintained per area and pointers are used to link the areas, thereby achieving isolation among virtual machines and savings in memory requirements. Additionally, the coherence protocol dynamically selects one node per area as responsible for providing the data on a cache miss, thus lessening the average cache miss latency and the traffic among areas. Compared to a highly-optimized directory implementation, the leakage power consumption is reduced by 54% and the dynamic power consumption of the caches and the network-on-chip by up to 38% for a 64-tile chip multiprocessor with 4 virtual machines, showing no performance degradation.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"184 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127060700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8