Proceedings. International Symposium on Computer Architecture最新文献_第6页

Implementing a GPU programming model on a Non-GPU accelerator architecture 在非GPU加速器架构上实现GPU编程模型

Proceedings. International Symposium on Computer Architecture Pub Date : 2010-06-19 DOI: 10.1007/978-3-642-24322-6_5

Stephen M. Kofsky, Daniel R. Johnson, John A. Stratton, Wen-mei W. Hwu, Sanjay J. Patel, S. Lumetta

引用次数: 3

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness 具有内存级和线程级并行意识的GPU架构的分析模型

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-20 DOI: 10.1145/1555754.1555775

Sunpyo Hong, Hyesoon Kim

{"title":"An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness","authors":"Sunpyo Hong, Hyesoon Kim","doi":"10.1145/1555754.1555775","DOIUrl":"https://doi.org/10.1145/1555754.1555775","url":null,"abstract":"GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications.\u0000 To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"28 1","pages":"152-163"},"PeriodicalIF":0.0,"publicationDate":"2009-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83800451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 699

Scalable high performance main memory system using phase-change memory technology 采用相变存储器技术的可扩展高性能主存储器系统

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555760

Moinuddin K. Qureshi, V. Srinivasan, J. Rivers

{"title":"Scalable high performance main memory system using phase-change memory technology","authors":"Moinuddin K. Qureshi, V. Srinivasan, J. Rivers","doi":"10.1145/1555754.1555760","DOIUrl":"https://doi.org/10.1145/1555754.1555760","url":null,"abstract":"The memory subsystem accounts for a significant cost and power budget of a computer system. Current DRAM-based main memory systems are starting to hit the power and cost limit. An alternative memory technology that uses resistance contrast in phase-change materials is being actively investigated in the circuits community. Phase Change Memory (PCM) devices offer more density relative to DRAM, and can help increase main memory capacity of future systems while remaining within the cost and power constraints.\u0000 In this paper, we analyze a PCM-based hybrid main memory system using an architecture level model of PCM.We explore the trade-offs for a main memory system consisting of PCMstorage coupled with a small DRAM buffer. Such an architecture has the latency benefits of DRAM and the capacity benefits of PCM. Our evaluations for a baseline system of 16-cores with 8GB DRAM show that, on average, PCM can reduce page faults by 5X and provide a speedup of 3X. As PCM is projected to have limited write endurance, we also propose simple organizational and management solutions of the hybrid memory that reduces the write traffic to PCM, boosting its lifetime from 3 years to 9.7 years.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"29 1","pages":"24-33"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89935064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1358

Multi-execution: multicore caching for data-similar executions 多执行:用于数据类似执行的多核缓存

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555777

Susmit Biswas, D. Franklin, Alan Savage, Ryan Dixon, T. Sherwood, F. Chong

{"title":"Multi-execution: multicore caching for data-similar executions","authors":"Susmit Biswas, D. Franklin, Alan Savage, Ryan Dixon, T. Sherwood, F. Chong","doi":"10.1145/1555754.1555777","DOIUrl":"https://doi.org/10.1145/1555754.1555777","url":null,"abstract":"While microprocessor designers turn to multicore architectures to sustain performance expectations, the dramatic increase in parallelism of such architectures will put substantial demands on off-chip bandwidth and make the memory wall more significant than ever. This paper demonstrates that one profitable application of multicore processors is the execution of many similar instantiations of the same program. We identify that this model of execution is used in several practical scenarios and term it as \"multi-execution.\" Often, each such instance utilizes very similar data. In conventional cache hierarchies, each instance would cache its own data independently. We propose the Mergeable cache architecture that detects data similarities and merges cache blocks, resulting in substantial savings in cache storage requirements. This leads to reductions in off-chip memory accesses and overall power usage, and increases in application performance. We present cycle-accurate simulation results of 8 benchmarks (6 from SPEC2000) to demonstrate that our technique provides a scalable solution and leads to significant speedups due to reductions in main memory accesses. For 8 cores running 8 similar executions of the same application and sharing an exclusive 4-MB, 8-way L2 cache, the Mergeable cache shows a speedup in execution by 2.5x on average (ranging from 0.93x to 6.92x), while posing an overhead of only 4.28% on cache area and 5.21% on power when it is used.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"61 1","pages":"164-173"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75108586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Firefly: illuminating future network-on-chip with nanophotonics 萤火虫:用纳米光子学照亮未来的片上网络

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555808

Yan Pan, Prabhat Kumar, John Kim, G. Memik, Yu Zhang, A. Choudhary

{"title":"Firefly: illuminating future network-on-chip with nanophotonics","authors":"Yan Pan, Prabhat Kumar, John Kim, G. Memik, Yu Zhang, A. Choudhary","doi":"10.1145/1555754.1555808","DOIUrl":"https://doi.org/10.1145/1555754.1555808","url":null,"abstract":"Future many-core processors will require high-performance yet energy-efficient on-chip networks to provide a communication substrate for the increasing number of cores. Recent advances in silicon nanophotonics create new opportunities for on-chip networks. To efficiently exploit the benefits of nanophotonics, we propose Firefly - a hybrid, hierarchical network architecture. Firefly consists of clusters of nodes that are connected using conventional, electrical signaling while the inter-cluster communication is done using nanophotonics - exploiting the benefits of electrical signaling for short, local communication while nanophotonics is used only for global communication to realize an efficient on-chip network. Crossbar architecture is used for inter-cluster communication. However, to avoid global arbitration, the crossbar is partitioned into multiple, logical crossbars and their arbitration is localized. Our evaluations show that Firefly improves the performance by up to 57% compared to an all-electrical concentrated mesh (CMESH) topology on adversarial traffic patterns and up to 54% compared to an all-optical crossbar (OP XBAR) on traffic patterns with locality. If the energy-delay-product is compared, Firefly improves the efficiency of the on-chip network by up to 51% and 38% compared to CMESH and OP XBAR, respectively.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"12 1","pages":"429-440"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73953550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 420

Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors 用于芯片多处理器中动态性能、功率和资源管理的线程临界性预测器

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555792

A. Bhattacharjee, M. Martonosi

{"title":"Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors","authors":"A. Bhattacharjee, M. Martonosi","doi":"10.1145/1555754.1555792","DOIUrl":"https://doi.org/10.1145/1555754.1555792","url":null,"abstract":"With the shift towards chip multiprocessors (CMPs), exploiting and managing parallelism has become a central problem in computing systems. Many issues of parallelism management boil down to discerning which running threads or processes are critical, or slowest, versus which are non-critical. If one can accurately predict critical threads in a parallel program, then one can respond in a variety of ways. Possibilities include running the critical thread at a faster clock rate, performing load balancing techniques to offload work onto currently non-critical threads, or giving the critical thread more on-chip resources to execute faster.\u0000 This paper proposes and evaluates simple but effective thread criticality predictors for parallel applications. We show that accurate predictors can be built using counters that are typically already available on-chip. Our predictor, based on memory hierarchy statistics, identifies thread criticality with an average accuracy of 93% across a range of architectures.\u0000 We also demonstrate two applications of our predictor. First, we show how Intel's Threading Building Blocks (TBB) parallel runtime system can benefit from task stealing techniques that use our criticality predictor to reduce load imbalance. Using criticality prediction to guide TBB's task-stealing decisions improves performance by 13-32% for TBB-based PARSEC benchmarks running on a 32-core CMP. As a second application, criticality prediction guides dynamic energy optimizations in barrier-based applications. By running the predicted critical thread at the full clock rate and frequency-scaling non-critical threads, this approach achieves average energy savings of 15% while negligibly degrading performance for SPLASH-2 and PARSEC benchmarks.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"14 1","pages":"290-301"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81310358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 192

Dynamic MIPS rate stabilization in out-of-order processors 乱序处理器中的动态MIPS速率稳定

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555763

Jinho Suh, M. Dubois

{"title":"Dynamic MIPS rate stabilization in out-of-order processors","authors":"Jinho Suh, M. Dubois","doi":"10.1145/1555754.1555763","DOIUrl":"https://doi.org/10.1145/1555754.1555763","url":null,"abstract":"Today's microprocessor cores reach high performance levels not only by their high clock rate but also by the concurrent execution of a large number of instructions. Because of the relationship between power and frequency, it becomes attractive to run an OoO (Out-of-Order) core at a frequency lower than its nominal frequency in the context of embedded or real-time systems. Unfortunately, whereas OoO pipelines have high average throughput, their highly variable and hard-to-predict execution rate makes them unsuitable for real-time systems with hard or even soft deadlines. In this paper, we demonstrate that the execution time of an OoO processor can be stable and predictable by controlling its MIPS (Mega Instructions Per Second) rate via a PID (Proportional, Integral, and Differential gain) feedback controller and DVFS (Dynamic Voltage and Frequency Scaling). The stabilized processor uses much less power per committed instruction, because of the reduced average frequency. The EPI (Energy Per Instruction) is also cut by an average of 28% across our benchmark programs. Since a stable MIPS rate is maintained consistently with lower power/energy per instruction, OoO processors stabilized by a feedback controller can realistically be deployed in real-time systems. To demonstrate this capability we select a subset of the MiBench benchmarks that displays the widest execution rate variations and stabilize their MIPS rate in the context of a 1GHz Pentium III-like microarchitecture.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"18 1","pages":"46-56"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81759299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

A fault tolerant, area efficient architecture for Shor's factoring algorithm 肖尔因子分解算法的容错高效结构

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555802

M. Whitney, Nemanja Isailovic, Yatish Patel, J. Kubiatowicz

{"title":"A fault tolerant, area efficient architecture for Shor's factoring algorithm","authors":"M. Whitney, Nemanja Isailovic, Yatish Patel, J. Kubiatowicz","doi":"10.1145/1555754.1555802","DOIUrl":"https://doi.org/10.1145/1555754.1555802","url":null,"abstract":"We optimize the area and latency of Shor's factoring while simultaneously improving fault tolerance through: (1) balancing the use of ancilla generators, (2) aggressive optimization of error correction, and (3) tuning the core adder circuits. Our custom CAD flow produces detailed layouts of the physical components and utilizes simulation to analyze circuits in terms of area, latency, and success probability. We introduce a metric, called ADCR, which is the probabilistic equivalent of the classic Area-Delay product. Our error correction optimization can reduce ADCR by order of magnitude or more. Contrary to conventional wisdom, we show that the area of an optimized quantum circuit is not dominated exclusively by error\u0000 correction. Further, our adder evaluation shows that quantum carry-lookahead adders (QCLA) beat ripple-carry adders in ADCR, despite being larger and more complex. We conclude with what we believe is one of most accurate estimates of the area and latency required for 1024-bit Shor's factorization: 7659 mm2 for the smallest circuit and 6 x 108 seconds for the fastest circuit.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"66 1","pages":"383-394"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89280487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

End-to-end performance forecasting: finding bottlenecks before they happen 端到端性能预测:在瓶颈发生之前发现瓶颈

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555800

A. Saidi, N. Binkert, S. Reinhardt, T. Mudge

{"title":"End-to-end performance forecasting: finding bottlenecks before they happen","authors":"A. Saidi, N. Binkert, S. Reinhardt, T. Mudge","doi":"10.1145/1555754.1555800","DOIUrl":"https://doi.org/10.1145/1555754.1555800","url":null,"abstract":"Many important workloads today, such as web-hosted services, are limited not by processor core performance but by interactions among the cores, the memory system, I/O devices, and the complex software layers that tie these components together. Architects designing future systems for these workloads are challenged to identify performance bottlenecks because, as in any concurrent system, overheads in one component may be hidden due to overlap with other operations. These overlaps span the user/kernel and software/hardware boundaries, making traditional performance analysis techniques inadequate.\u0000 We present a methodology for identifying end-to-end critical paths across software and simulated hardware in complex networked systems. By modeling systems as collections of state machines interacting via queues, we can trace critical paths through multiplexed processing engines, identify when resources create bottlenecks (including abstract resources such as flow-control credits), and predict the benefit of eliminating bottlenecks by increasing hardware speeds or expanding available resources.\u0000 We implement our technique in a full-system simulator and analyze a TCP microbenchmark, a web server, the Linux TCP/IP stack, and an Ethernet controller. From a single run of the microbenchmark, our tool--within minutes--correctly identifies a series of bottlenecks, and predicts the performance of hypothetical systems in which these bottlenecks are successively eliminated, culminating in a total speedup of 3X.We then validate these predictions through hours of additional simulation, and find them to be accurate within 1--17%. We also analyze the web server, find it to be CPU-bound, and predict the performance of a system with an additional core within 6%.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"54 1","pages":"361-370"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76507998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Phastlane: a rapid transit optical routing network 相位网:一种快速传输的光路由网络

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555809

Mark J. Cianchetti, Joseph C. Kerekes, D. Albonesi

{"title":"Phastlane: a rapid transit optical routing network","authors":"Mark J. Cianchetti, Joseph C. Kerekes, D. Albonesi","doi":"10.1145/1555754.1555809","DOIUrl":"https://doi.org/10.1145/1555754.1555809","url":null,"abstract":"Tens and eventually hundreds of processing cores are projected to be integrated onto future microprocessors, making the global interconnect a key component to achieving scalable chip performance within a given power envelope. While CMOS-compatible nanophotonics has emerged as a leading candidate for replacing global wires beyond the 22nm timeframe, on-chip optical interconnect architectures proposed thus far are either limited in scalability or are dependent on comparatively slow electrical control networks.\u0000 In this paper, we present Phastlane, a hybrid electrical/optical routing network for future large scale, cache coherent multicore microprocessors. The heart of the Phastlane network is a low-latency optical crossbar that uses simple predecoded source routing to transmit cache-line-sized packets several hops in a single clock cycle under contentionless conditions. When contention exists, the router makes use of electrical buffers and, if necessary, a high speed drop signaling network. Overall, Phastlane achieve 2X better network performance than a state-of-the-art electrical baseline while consuming 80% less network power.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"5 1","pages":"441-450"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76500609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 186