ACM International Conference on Computing Frontiers最新文献_第9页

Scalable memory registration for high performance networks using helper threads 使用helper线程为高性能网络注册可扩展内存

ACM International Conference on Computing Frontiers Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016652

Dong Li, K. Cameron, Dimitrios S. Nikolopoulos, B. Supinski, M. Schulz

{"title":"Scalable memory registration for high performance networks using helper threads","authors":"Dong Li, K. Cameron, Dimitrios S. Nikolopoulos, B. Supinski, M. Schulz","doi":"10.1145/2016604.2016652","DOIUrl":"https://doi.org/10.1145/2016604.2016652","url":null,"abstract":"Remote DMA (RDMA) enables high performance networks to reduce data copying between an application and the operating system (OS). However RDMA operations in some high performance networks require communication memory explicitly registered with the network adapter and pinned by the OS. Memory registration and pinning limits the flexibility of the memory system and reduces the amount of memory that user processes can allocate. These issues become more significant on multicore platforms, since registered memory demand grows linearly with the number of processor cores. In this paper we propose a new memory registration/deregistration strategy to reduce registered memory on multicore architectures for HPC applications. We hide the cost of dynamic memory management by offloading all dynamic memory registration and deregistration requests to a dedicated memory management helper thread. We investigate design policies and performance implications of the helper thread approach. We evaluate our framework with the NAS parallel benchmarks, for which our registration scheme significantly reduces the registered memory (23.62% on average and up to 49.39%) and avoids memory registration/deregistration costs for reused communication memory. We show that our system enables the execution of problem sizes that could not complete under existing memory registration strategies.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129419108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Parametrizing multicore architectures for multiple sequence alignment 多序列比对的多核结构参数化

ACM International Conference on Computing Frontiers Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016642

S. Isaza, Friman Sánchez, F. Cabarcas, Alex Ramírez, G. Gaydadjiev

{"title":"Parametrizing multicore architectures for multiple sequence alignment","authors":"S. Isaza, Friman Sánchez, F. Cabarcas, Alex Ramírez, G. Gaydadjiev","doi":"10.1145/2016604.2016642","DOIUrl":"https://doi.org/10.1145/2016604.2016642","url":null,"abstract":"Sequence alignment is one of the fundamental tasks in bioinformatics. Due to the exponential growth of biological data and the computational complexity of the algorithms used, high performance computing systems are required. Although multicore architectures have the potential of exploiting the task-level parallelism found in these workloads, efficiently harnessing systems with hundreds of cores requires deep understanding of the applications and the architecture. When incorporating large numbers of cores, performance scalability will likely saturate shared hardware resources like buses and memories. In this paper we evaluate the performance impact of various configurations of an accelerator-based multicore architecture with the aim of revealing and quantifying the bottlenecks. Then, we compare against a multicore using general-purpose processors and discuss the performance gap. Our target application is ClustalW, one of the most popular programs for Multiple Sequence Alignment. Different input data sets are characterized and we show how they influence performance. Simulation results show that due to the high computation-to-communication ratio and the transfer of data in large chunks, memory latency is well tolerated. However, bandwidth is critical to achieving maximum performance. Using a 32KB cache configuration with 4 banks can capture most of the memory traffic and therefore avoid expensive off-chip transactions. On the other hand, using a hardware queue for the tasks synchronization allows us to handle a large number of cores. Finally, we show that using a simple load balancing strategy, we can increase performance of general-purpose cores by 28%.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126276994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Tolerating correlated failures for generalized Cartesian distributions via bipartite matching 广义笛卡儿分布的二部匹配容忍相关失效

ACM International Conference on Computing Frontiers Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016649

N. Ali, S. Krishnamoorthy, M. Halappanavar, J. Daily

{"title":"Tolerating correlated failures for generalized Cartesian distributions via bipartite matching","authors":"N. Ali, S. Krishnamoorthy, M. Halappanavar, J. Daily","doi":"10.1145/2016604.2016649","DOIUrl":"https://doi.org/10.1145/2016604.2016649","url":null,"abstract":"Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance (ABFT) is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra (FTLA) algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. The evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116909501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Efficient stack distance computation for priority replacement policies 优先级替换策略的高效堆栈距离计算

ACM International Conference on Computing Frontiers Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016607

G. Bilardi, K. Ekanadham, P. Pattnaik

{"title":"Efficient stack distance computation for priority replacement policies","authors":"G. Bilardi, K. Ekanadham, P. Pattnaik","doi":"10.1145/2016604.2016607","DOIUrl":"https://doi.org/10.1145/2016604.2016607","url":null,"abstract":"The concept of stack distance, applicable to the important class of inclusion replacement policies for the memory hierarchy, enables to efficiently compute the number of misses incurred on a given address trace, for all cache sizes. The concept was introduced by Mattson, Gecsei, Sluts, and Traiger (Evaluation techniques for storage hierarchies, IBM System Journal, (9)2:78-117, 1970), together with a Linear-Scan algorithm, which takes time O(V) per access, in the worst case, where V is the number of distinct (virtual) items referenced within the trace. While subsequent work has lowered the time bound to O(log V) per access in the special case of the Least Recently Used policy, no improvements have been obtained for the general case.\u0000 This work introduces a class of inclusion policies called policies with nearly static priorities, which encompasses several of the policies considered in the literature. The Min-Tree algorithm is proposed for these policies. The performance of the Min-Tree algorithm is very sensitive to the replacement policy as well as to the address trace. Under suitable probabilistic assumptions, the expected time per access is O(log2 V). Experimental evidence collected on a mix of benchmarks shows that the Min-Tree algorithm is significantly faster than Linear-Scan, for interesting policies such as OPT (or Belady), Least Frequently Used (LFU), and Most Recently Used (MRU). As a further advantage, Min-Tree can be parallelized to run in time O(log V) using O(V/log V) processors, in the worst case.\u0000 A more sophisticated Lazy Min-Tree algorithm is also developed which achieves O(√ log V) worst-case time per access. This bound applies, in particular, to the policies OPT, LFU, and Least Recently/Frequently Used (LRFU), for which the best previously known bound was O(V).","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114108638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Pruning hardware evaluation space via correlation-driven application similarity analysis 通过关联驱动的应用相似度分析来修剪硬件评估空间

ACM International Conference on Computing Frontiers Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016610

Rosario Cammarota, A. Kejariwal, P. D'Alberto, Sapan Panigrahi, A. Veidenbaum, A. Nicolau

{"title":"Pruning hardware evaluation space via correlation-driven application similarity analysis","authors":"Rosario Cammarota, A. Kejariwal, P. D'Alberto, Sapan Panigrahi, A. Veidenbaum, A. Nicolau","doi":"10.1145/2016604.2016610","DOIUrl":"https://doi.org/10.1145/2016604.2016610","url":null,"abstract":"System evaluation is routinely performed in industry to select one amongst a set of different systems to improve performance of proprietary applications. However, a wide range of system configurations is available every year on the market. This makes an exhaustive system evaluation progressively challenging and expensive.\u0000 In this paper we propose a novel similarity-based methodology for system selection. Our methodology prunes the set of candidate systems by eliminating those systems that are likely to reduce performance of a given proprietary application. The pruning process relies on applications that are similar to a given application of interest whose performance on the candidte systems is known. This obviates the need to install and run the given application on each and every candidate system.\u0000 The concept of similarity we introduce is performance centric. For a given application, we compute the Pearson's correlation between different types of resource stall and cycles per instruction. We refer to the vector of Pearson's correlation coefficients as an application signature. Next, we assess similarity between two applications as Spearman's correlation between their respective signature. We use the former type of correlation to quantify the association between pipeline stalls and cycles per instruction, whereas we use the latter type of correlation to quantify the association of two signatures, hence to assess similarity, based on the difference in terms of rank ordering of their components.\u0000 We evaluate the proposed methodology on three different micro-architectures, viz., Intel's Harpertown, Nehalem and Westmere, using industry-standard SPEC CINT2006. We assess performance centric similarity among applications in SPEC CINT2006. We show how our methodology clusters applications with common performance issues. Finally, we show how to use the notion of similarity among applications to compare the three architectures with respect to a given Yahoo! property.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115259375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

BarrierWatch: characterizing multithreaded workloads across and within program-defined epochs BarrierWatch:描述跨程序定义时代和程序定义时代内的多线程工作负载

ACM International Conference on Computing Frontiers Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016611

Socrates Demetriades, Sangyeun Cho

{"title":"BarrierWatch: characterizing multithreaded workloads across and within program-defined epochs","authors":"Socrates Demetriades, Sangyeun Cho","doi":"10.1145/2016604.2016611","DOIUrl":"https://doi.org/10.1145/2016604.2016611","url":null,"abstract":"Characterizing the dynamic behavior of a program is essential for optimizing the program on a given system. Once the program's repetitive execution phases (and their boundaries) have been correctly identified, various phase-aware optimizations can be applied. Multithreaded workloads exhibit dynamic behavior that is further affected by the sharing of data and platform resources. As computer systems and workloads become denser and more parallel, this effect will intensify the dynamicity of the executed workload.\u0000 In this work, we introduce a new relaxed concept for a parallel program phase, called epoch. Epochs are defined as time intervals between global synchronization points that programmers insert into their program codes for correct parallel execution. We characterize the behavior of multithreaded workloads across and within epochs and show that epochs have consistent and repetitive behaviors while their boundaries naturally indicate a shift in program behavior. We show that epoch changes can be easily captured at run time without complex monitoring and decision mechanisms and we employ simple run-time techniques to enable epoch-based adaptation. To highlight the efficacy of our approach, we present a case study of an epoch-based adaptive chip multiprocessor (CMP) architecture. We conclude that our approach provides an attractive new framework for lightweight phase-based resource management for future CMPs.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132635627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

SoftHV: a HW/SW co-designed processor with horizontal and vertical fusion SoftHV:硬件/软件协同设计的水平和垂直融合处理器

ACM International Conference on Computing Frontiers Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016606

Abhishek Deb, J. M. Codina, Antonio González

{"title":"SoftHV: a HW/SW co-designed processor with horizontal and vertical fusion","authors":"Abhishek Deb, J. M. Codina, Antonio González","doi":"10.1145/2016604.2016606","DOIUrl":"https://doi.org/10.1145/2016604.2016606","url":null,"abstract":"In this paper we propose SoftHV, a high-performance HW/SW co-designed in-order processor that performs horizontal and vertical fusion of instructions.\u0000 SoftHV consists of a co-designed virtual machine (Cd-VM) which reorders, removes and fuses instructions from frequently executed regions of code. On the hardware front, SoftHV implements HW features for efficient execution of Cd-VM and efficient execution of the fused instructions. In particular, (1) Interlock Collapsing ALU (ICALU) are included to execute pairs of dependent simple arithmetic operations in a single cycle, and (2) Vector Load units (VLDU) are added to execute parallel loads.\u0000 The key novelty of SoftHV resides on the efficient usage of HW using a Cd-VM in order to provide high-performance by drastically cutting down processor complexity. Co-designed processor provides efficient mechanisms to exploit ILP and reduce the latency of certain code sequences.\u0000 Results presented in this paper show that SoftHV produces average performance improvements of 85% in SPECFP and 52% in SPECINT, and up-to 2.35x, over a conventional four-way in-order processor. For a two-way in-order processor configuration SoftHV obtains improvements in performance of 72% and 47% for SPECFP and SPECINT, respectively. Overall, we show that such a co-designed processor based on an in-order core provides a compelling alternative to out-of-order processors for the low-end domain where high-performance at a low-complexity is a key feature.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115499096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Quantitative analysis of parallelism and data movement properties across the Berkeley computational motifs 并行性和数据移动特性的定量分析跨越伯克利计算基元

ACM International Conference on Computing Frontiers Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016625

V. Cabezas, Phillip Stanley-Marbell

引用次数: 4

Increasing power/performance resource efficiency on virtualized enterprise servers 提高虚拟化企业服务器的电源/性能资源效率

ACM International Conference on Computing Frontiers Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016615

Emmanuel Arzuaga, D. Kaeli

引用次数: 0

Elastic pipeline: addressing GPU on-chip shared memory bank conflicts 弹性管道:解决GPU片上共享内存库冲突

ACM International Conference on Computing Frontiers Pub Date : 2011-05-03 DOI: 10.1145/2016604.2016608

C. Gou, G. Gaydadjiev

引用次数: 20