2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)最新文献_第2页

Meeting points: Using thread criticality to adapt multicore hardware to parallel regions 会议要点:使用线程临界性使多核硬件适应并行区域

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454149

Qiong Cai, José González, R. Rakvic, G. Magklis, P. Chaparro, Antonio González

{"title":"Meeting points: Using thread criticality to adapt multicore hardware to parallel regions","authors":"Qiong Cai, José González, R. Rakvic, G. Magklis, P. Chaparro, Antonio González","doi":"10.1145/1454115.1454149","DOIUrl":"https://doi.org/10.1145/1454115.1454149","url":null,"abstract":"We present a novel mechanism, called meeting point thread characterization, to dynamically detect critical threads in a parallel region. We define the critical thread the one with the longest completion time in the parallel region. Knowing the criticality of each thread has many potential applications. In this work, we propose two applications: thread delaying for multi-core systems and thread balancing for simultaneous multi-threaded (SMT) cores. Thread delaying saves energy consumptions by running the core containing the critical thread at maximum frequency while scaling down the frequency and voltage of the cores containing non-critical threads. Thread balancing improves overall performance by giving higher priority to the critical thread in the issue queue of an SMT core. Our experiments on a detailed microprocessor simulator with the Recognition, Mining, and Synthesis applications from Intel research laboratory reveal that thread delaying can achieve energy savings up to more than 40% with negligible performance loss. Thread balancing can improve performance from 1% to 20%.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114158041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 94

(How) can programmers conquer the multicore menace? 程序员如何克服多核威胁?

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454117

Saman P. Amarasinghe

引用次数: 8

Leveraging on-chip networks for data cache migration in chip multiprocessors 在芯片多处理器中利用片上网络进行数据缓存迁移

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454144

Noel Eisley, L. Peh, L. Shang

{"title":"Leveraging on-chip networks for data cache migration in chip multiprocessors","authors":"Noel Eisley, L. Peh, L. Shang","doi":"10.1145/1454115.1454144","DOIUrl":"https://doi.org/10.1145/1454115.1454144","url":null,"abstract":"Recently, chip multiprocessors (CMPs) have arisen as the de facto design for modern high-performance processors, with increasing core counts. An important property of CMPs is that remote, but on-chip, L2 cache accesses are less costly than off-chip accesses; this is in contrast to earlier chip-to-chip or board-to-board multiprocessors, where an access to a remote node is just as costly if not more so than a main memory access. This motivates on-chip cache migration as a means to retain more data on-chip. However, previously proposed techniques do not scale to high core counts: they do not leverage the on-chip caches of all cores nor have a scalable migration mechanism. In this paper we propose ascalable in-network migration technique which uses hints embedded within the router microarchitecture to steer L2 cache evictions towards free/invalid cache slots in any on-chip core cache, rather than evicting it off-chip. We show that our technique can provide an average of a 19% reduction in the number of off-chip memory accesses over the state-of-the-art, beating the performance of a pseudo-optimal migration technique. This can be done with negligible area overhead and a manageable traffic overhead of 13.4%.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126314601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Pangaea: A tightly-coupled IA32 heterogeneous chip multiprocessor Pangaea:一个紧密耦合的IA32异构芯片多处理器

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454125

Henry Wong, Anne Bracy, E. Schuchman, Tor M. Aamodt, Jamison D. Collins, P. Wang, G. Chinya, Ankur Khandelwal Groen, Hong Jiang, Hong Wang

{"title":"Pangaea: A tightly-coupled IA32 heterogeneous chip multiprocessor","authors":"Henry Wong, Anne Bracy, E. Schuchman, Tor M. Aamodt, Jamison D. Collins, P. Wang, G. Chinya, Ankur Khandelwal Groen, Hong Jiang, Hong Wang","doi":"10.1145/1454115.1454125","DOIUrl":"https://doi.org/10.1145/1454115.1454125","url":null,"abstract":"Moore's Law and the drive towards performance efficiency have led to the on-chip integration of general-purpose cores with special-purpose accelerators. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with non-IA32 GPU-class multi-cores, extending the current state-of-the-art CPU-GPU integration that physically “fuses” existing CPU and GPU designs. Pangaea introduces (1) a resource repartitioning of the GPU, where the hardware budget dedicated for 3D-specific graphics processing is used to build more general-purpose GPU cores, and (2) a 3-instruction extension to the IA32 ISA that supports tighter architectural integration and fine-grain shared memory collaborative multithreading between the IA32 CPU cores and the non-IA32 GPU cores. We implement Pangaea and the current CPU-GPU designs in fully-functional synthesizable RTL based on the production quality RTL of an IA32 CPU and an Intel GMA X4500 GPU. On a 65 nm ASIC process technology, the legacy graphics-specific fixed-function hardware has the area of 9 GPU cores and total power consumption of 5 GPU cores. With the ISA extensions, the latency from the time an IA32 core spawns a GPU thread to the time the thread begins execution is reduced from thousands of cycles to fewer than 30 cycles. Pangaea is synthesized on a FPGA-based prototype and runs off-the-shelf IA32 OSes. A set of general-purpose non-graphics workloads demonstrate speedups of up to 8.8×.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115709234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 62

Runtime optimization of vector operations on large scale SMP clusters 大规模SMP集群上矢量运算的运行时优化

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454134

Costin Iancu, S. Hofmeyr

{"title":"Runtime optimization of vector operations on large scale SMP clusters","authors":"Costin Iancu, S. Hofmeyr","doi":"10.1145/1454115.1454134","DOIUrl":"https://doi.org/10.1145/1454115.1454134","url":null,"abstract":"“Vector” style communication operations transfer multiple disjoint memory regions within one logical step. These operations are widely used in applications, they do improve application performance, and their behavior has been studied and optimized using different implementation techniques across a large variety of systems. In this paper we present a methodology for the selection of the best performing implementation of a vector operation from multiple alternative implementations. Our approach is designed to work for systems with wide SMP nodes where we believe that most published studies fail to correctly predict performance. Due to the emergence of multi-core processors we believe that techniques similar to ours will be incorporated for performance reasons in communication libraries or language runtimes. The methodology relies on the exploration of the application space and a classification of the regions within this space where a particular implementation method performs best. We use micro-benchmarks to measure the performance of an implementation for a given point in the application space and then compose profiles that compare the performance of two given implementations. These profiles capture an empirical upper bound for the performance degradation of a given protocol under heavy node load. At runtime, the application selects the implementation according to these performance profiles. Our approach provides performance portability and using our dynamic multi-protocol selection we have been able to improve the performance of a NAS Parallel Benchmarks workload by 22% on an IBM large scale cluster. Very positive results have also been obtained on large scale InfiniBand and Cray XT systems. This work indicates that perhaps the most important factor for application performance on wide SMP systems is the successful management of load on the Network Interface Cards.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125544044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Redundancy elimination revisited 重审冗余消除

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454120

K. Cooper, J. Eckhardt, K. Kennedy

引用次数: 15

Outer-loop vectorization - revisited for short SIMD architectures 外环矢量化——为短SIMD体系结构重新设计

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454119

Dorit Nuzman, A. Zaks

{"title":"Outer-loop vectorization - revisited for short SIMD architectures","authors":"Dorit Nuzman, A. Zaks","doi":"10.1145/1454115.1454119","DOIUrl":"https://doi.org/10.1145/1454115.1454119","url":null,"abstract":"Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multimedia and embedded applications on short SIMD architectures such as MMX, SSE and AltiVec. Most of the focus has been directed at innermost loops, effectively executing their iterations concurrently as much as possible. Outer loop vectorization refers to vectorizing a level of a loop nest other than the innermost, which can be beneficial if the outer loop exhibits greater data-level parallelism and locality than the innermost loop. Outer loop vectorization has traditionally been performed by interchanging an outer-loop with the innermost loop, followed by vectorizing it at the innermost position. A more direct unroll-and-jam approach can be used to vectorize an outer-loop without involving loop interchange, which can be especially suitable for short SIMD architectures. In this paper we revisit the method of outer loop vectorization, paying special attention to properties of modern short SIMD architectures. We show that even though current optimizing compilers for such targets do not apply outer-loop vectorization in general, it can provide significant performance improvements over innermost loop vectorization. Our implementation of direct outer-loop vectorization, available in GCC 4.3, achieves speedup factors of 3.13 and 2.77 on average across a set of benchmarks, compared to 1.53 and 1.39 achieved by innermost loop vectorization, when running on a Cell BE SPU and PowerPC970 processors respectively. Moreover, outer-loop vectorization provides new reuse opportunities that can be vital for such short SIMD architectures, including efficient handling of alignment. We present an optimization tapping such opportunities, capable of further boosting the performance obtained by outer-loop vectorization to achieve average speedup factors of 5.26 and 3.64.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122769472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 160

GPU evolution: Will graphics morph into compute? GPU的进化:图形会变成计算吗?

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454116

Norman Rubin

{"title":"GPU evolution: Will graphics morph into compute?","authors":"Norman Rubin","doi":"10.1145/1454115.1454116","DOIUrl":"https://doi.org/10.1145/1454115.1454116","url":null,"abstract":"In the last several years GPU devices have started to evolve into supercomputers. New, non-graphics, features are rapidly appearing along with new more general programming languages. One reason for the quick pace of change is that, games and hardware evolve together: Hardware vendors review the most popular games, looking for places to add hardware while game developers review new hardware, looking for places to add more realism. Today, we see both GPU devices and games moving from a model of looks real to one of acts real. One consequence of acts real is that evaluating physics, simulations, and artificial intelligence on a GPU is becoming an element of future game programs. We will review the difference between a CPU and a GPU. Then we will describe hardware changes added to the current generation of AMD graphics processors, including the introduction of traditional compute operations such as double precision, scatter/gather and local memory. Along with new features, we have added new metrics like performance/watt and performance/dollar. The current AMD GPU processor delivers 9 gigaflops/watt and 5 gigaflops/dollar. For the last two generations, each AMD GPU has provided double the performance/watt of the prior machine. We believe the software community needs to become more aware and appreciate these metrics.\u0000 Because this has been a kind of co-evolution and not a process of radical change, current GPU devices have retained a number of odd sounding transitional features, including fixed functions like memory systems that can do filtering, depth buffers, a rasterizer and the like. Today, each of these remain because they are important for graphics performance. Software on GPU devices also shows transitional features. As AI/physics virtual reality starts to become important, development frameworks have started to shift. Graphics APIs have added compute shaders. Finally, there has been a set of transitional programs implemented by graphics programmers but whose only real connection with graphics is that the result is rendered. One early example is toy shop which contains a weak physical simulation of rain on window (it looks great but the random number generator would not pass any kind of test). A more recent and better acting program is March of the Froblins an AI program related to robotic path calculations. This program both simulates large crowds of independent creatures and shows how massively parallel compute can benefit character-centric entertainment.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124198098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Distributed Cooperative Caching 分布式协同缓存

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454136

E. Herrero, José González, R. Canal

{"title":"Distributed Cooperative Caching","authors":"E. Herrero, José González, R. Canal","doi":"10.1145/1454115.1454136","DOIUrl":"https://doi.org/10.1145/1454115.1454136","url":null,"abstract":"This paper presents the Distributed Cooperative Caching, a scalable and energy-efficient scheme to manage chip multiprocessor (CMP) cache resources. The proposed configuration is based in the Cooperative Caching framework [3] but it is intended for large scale CMPs. Both centralized and distributed configurations have the advantage of combining the benefits of private and shared caches. In our proposal, the Coherence Engine has been redesigned to allow its partitioning and thus, eliminate the size constraints imposed by the duplication of all tags. At the same time, a global replacement mechanism has been added to improve the usage of cache space. Our framework uses several Distributed Coherence Engines spread across all the nodes to improve scalability. The distribution permits a better balance of the network traffic over the entire chip avoiding bottlenecks and increasing performance for a 32-core CMP by 21% over a traditional shared memory configuration and by 57% over the Cooperative Caching scheme. Furthermore, we have reduced the power consumption of the entire system by using a different tag allocation method and by reducing the number of tags compared on each request. For a 32-core CMP the Distributed Cooperative Caching framework provides an average improvement of the power/performance relation (MIPS3/W) of 3.66× over a traditional shared memory configuration and 4.30× over Cooperative Caching.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116895861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Profiler and compiler assisted adaptive I/O prefetching for shared storage caches 分析器和编译器辅助共享存储缓存的自适应I/O预取

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454133

S. Son, Sai Prashanth Muralidhara, O. Ozturk, M. Kandemir, I. Kolcu, Mustafa Karaköy

{"title":"Profiler and compiler assisted adaptive I/O prefetching for shared storage caches","authors":"S. Son, Sai Prashanth Muralidhara, O. Ozturk, M. Kandemir, I. Kolcu, Mustafa Karaköy","doi":"10.1145/1454115.1454133","DOIUrl":"https://doi.org/10.1145/1454115.1454133","url":null,"abstract":"I/O prefetching has been employed in the past as one of the mechanisms to hide large disk latencies. However, I/O prefetching in parallel applications is problematic when multiple CPUs share the same set of disks due to the possibility that prefetches from different CPUs can interact on shared memory caches in the I/O nodes in complex and unpredictable ways. In this paper, we (i) quantify the impact of compiler-directed I/O prefetching - developed originally in the context of sequential execution - on shared caches at I/O nodes. The experimental data collected shows that while I/O prefetching brings benefits, its effectiveness reduces significantly as the number of CPUs is increased; (ii) identify inter-CPU misses due to harmful prefetches as one of the main sources for this reduction in performance with the increased number of CPUs; and (iii) propose and experimentally evaluate a profiler and compiler assisted adaptive I/O prefetching scheme targeting shared storage caches. The proposed scheme obtains inter-thread data sharing information using profiling and, based on the captured data sharing patterns, divides the threads into clusters and assigns a separate (customized) I/O prefetcher thread for each cluster. In our approach, the compiler generates the I/O prefetching threads automatically. We implemented this new I/O prefetching scheme using a compiler and the PVFS file system running on Linux, and the empirical data collected clearly underline the importance of adapting I/O prefetching based on program phases. Specifically, our proposed scheme improves performance, on average, by 19.9%, 11.9% and 10.3% over the cases without I/O prefetching, with independent I/O prefetching (each CPU is performing compiler-directed I/O prefetching independently), and with one CPU prefetching (one CPU is reserved for prefetching on behalf of others), respectively, when 8 CPUs are used.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"165 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122295806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13