Nikola Vujic, F. Cabarcas, Marc González, Alex Ramírez, X. Martorell, E. Ayguadé
{"title":"DMA++: on the fly data realignment for on-chip memories","authors":"Nikola Vujic, F. Cabarcas, Marc González, Alex Ramírez, X. Martorell, E. Ayguadé","doi":"10.1109/HPCA.2010.5463057","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5463057","url":null,"abstract":"Multimedia extensions based on Single-Instruction Multiple-Data (SIMD) units are widespread. They are used both in processors and accelerators (e.g., the Cell SPEs), since some time ago. SIMD units have usually big memory alignment constraints in order to meet power requirements and design simplicity. This increases the complexity of the code generated by the compiler, as in the general case, the compiler cannot be sure of the proper alignment of data. For that, the ISA provides either unaligned memory load and store instructions, or a special set of instructions to perform the realignments in software. In this paper, we propose a hardware realignment unit that takes advantage of the DMA transfers needed in accelerators with local memories. While the data is being transferred, it is realigned on the fly by our realignment unit, and stored with the proper alignment in the accelerator memory. The accelerator can then access the data with no special instructions. Finally, the data is realigned properly also when put back to main memory. Our experiments with four applications show that with our approach, the bandwidth of the DMA transfers is not penalized. And the performance of the synthetic benchmarks shows that aligned code is 1.5 to 2 times better with respect using unaligned code.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132504641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Application performance modeling in a virtualized environment","authors":"Sajib Kundu, R. Rangaswami, K. Dutta, Ming Zhao","doi":"10.1109/HPCA.2010.5463058","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5463058","url":null,"abstract":"Performance models provide the ability to predict application performance for a given set of hardware resources and are used for capacity planning and resource management. Traditional performance models assume the availability of dedicated hardware for the application. With growing application deployment on virtualized hardware, hardware resources are increasingly shared across multiple virtual machines. In this paper, we build performance models for applications in virtualized environments. We identify a key set of virtualization architecture independent parameters that influence application performance for a diverse and representative set of applications. We explore several conventional modeling techniques and evaluate their effectiveness in modeling application performance in a virtualized environment. We propose an iterative model training technique based on artificial neural networks which is found to be accurate across a range of applications. The proposed approach is implemented as a prototype in Xen-based virtual machine environments and evaluated for accuracy, sensitivity to the training process, and overhead. Median modeling error in the range 1.16-6.65% across a diverse application set and low modeling overhead suggest the suitability of our approach in production virtualized environments.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132937386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arunchandar Vasan, A. Sivasubramaniam, Vikrant Shimpi, T. Sivabalan, R. Subbiah
{"title":"Worth their watts? - an empirical study of datacenter servers","authors":"Arunchandar Vasan, A. Sivasubramaniam, Vikrant Shimpi, T. Sivabalan, R. Subbiah","doi":"10.1109/HPCA.2010.5463056","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5463056","url":null,"abstract":"The management of power consumption in datacenters has become an important problem. This needs a systematic evaluation of the as-is scenario to identify potential areas for improvement and quantify the impact of any strategy. We present a measurement study of a production datacenter from a joint perspective of power and performance at the individual server level. Our observations help correlate power consumption of production servers with their activity, and identify easily implementable improvements. We find that production servers are underutilized from an activity perspective; are overrated from a power perspective; execute temporally similar workloads over a granularity of weeks; do not idle efficiently; and have power consumptions that are well tracked by their CPU utilizations. Our measurements suggest the following steps for improvement: staggering periodic activities on servers; enabling deeper sleep states; and provisioning based on measurement.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128669488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jason E. Miller, H. Kasture, George Kurian, Charles Gruenwald, Nathan Beckmann, Christopher Celio, J. Eastep, A. Agarwal
{"title":"Graphite: A distributed parallel simulator for multicores","authors":"Jason E. Miller, H. Kasture, George Kurian, Charles Gruenwald, Nathan Beckmann, Christopher Celio, J. Eastep, A. Agarwal","doi":"10.1109/HPCA.2010.5416635","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416635","url":null,"abstract":"This paper introduces the Graphite open-source distributed parallel multicore simulator infrastructure. Graphite is designed from the ground up for exploration of future multi-core processors containing dozens, hundreds, or even thousands of cores. It provides high performance for fast design space exploration and software development. Several techniques are used to achieve this including: direct execution, seamless multicore and multi-machine distribution, and lax synchronization. Graphite is capable of accelerating simulations by distributing them across multiple commodity Linux machines. When using multiple machines, it provides the illusion of a single process with a single, shared address space, allowing it to run off-the-shelf pthread applications with no source code modification. Our results demonstrate that Graphite can simulate target architectures containing over 1000 cores on ten 8-core servers. Performance scales well as more machines are added with near linear speedup in many cases. Simulation slowdown is as low as 41× versus native execution.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114482551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Syed Ali, Raza Jafri, Mithuna Thottethodi, T. N. Vijaykumar
{"title":"LiteTM: Reducing transactional state overhead","authors":"Syed Ali, Raza Jafri, Mithuna Thottethodi, T. N. Vijaykumar","doi":"10.1109/HPCA.2010.5416653","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416653","url":null,"abstract":"Transactional memory (TM) has been proposed to address some of the programmability issues of chip multiprocessors. Hardware implementations of transactional memory (HTMs) have made significant progress in providing support for features such as long transactions that spill out of the cache, and context switches, page and thread migration in the middle of transactions. While essential for the adoption of HTMs in real products, supporting these features has resulted in significant state overhead. For instance, TokenTM adds at least 16 bits per block in the caches which is significant in absolute terms, and steals 16 of 64 (25%) memory ECC bits per block, weakening error protection. Also, the state bits nearly double the tag array size. These significant and practical concerns may impede the adoption of HTMs, squandering the progress achieved by HTMs. The overhead comes from tracking the thread identifier and the transactional read-sharer count at the L1-block granularity. The thread identifier is used to identify the transaction, if only one, to which an L1-evicted block belongs. The read-sharer count is used to identify conflicts involving multiple readers (i.e., write to a block with non-zero count). To reduce this overhead, we observe that the thread identifiers and read-sharer counts are not needed in a majority of cases. (1) Repeated misses to the same blocks are rare within a transaction (i.e., locality holds). (2) Transactional read-shared blocks that both are evicted from multiple sharers' L1s and are involved in conflicts are rare. Exploiting these observations, we propose a novel HTM, called LiteTM, which completely eliminates the count and identifier and uses software to infer the lost information. Using simulations of the STAMP benchmarks running on 8 cores, we show that LiteTM reduces TokenTM's state overhead by about 87% while performing within 4%, on average, and 10%, in the worst case, of To ke nTM.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128546073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FlexiShare: Channel sharing for an energy-efficient nanophotonic crossbar","authors":"Yan Pan, John Kim, G. Memik","doi":"10.1109/HPCA.2010.5416626","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416626","url":null,"abstract":"On-chip network is becoming critical to the scalability of future many-core architectures. Recently, nanophotonics has been proposed for on-chip networks because of its low latency and high bandwidth. However, nanophotonics has relatively high static power consumption, which can lead to inefficient architectures. In this work, we propose FlexiShare — a nanopho-tonic crossbar architecture that minimizes static power consumption by fully sharing a reduced number of channels across the network. To enable efficient global sharing, we decouple the allocation of the channels and the buffers, and introduce novel photonic token-stream mechanism for channel arbitration and credit distribution The flexibility of FlexiShare introduces additional router complexity and electrical power consumption. However, with the reduced number of optical channels, the overall power consumption is reduced without loss in performance. Our evaluation shows that the proposed token-stream arbitration applied to a conventional crossbar design improves network throughput by 5.5× under permutation traffic. In addition, FlexiShare achieves similar performance as a token-stream arbitrated conventional crossbar using only half the amount of channels under balanced, distributed traffic. With the extracted trace traffic from MineBench and SPLASH-2, FlexiShare can further reduce the amount of channels by up to 87.5%, while still providing better performance — resulting in up to 72% reduction in power consumption compared to the best alternative.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124004242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ESP-NUCA: A low-cost adaptive Non-Uniform Cache Architecture","authors":"Javier Merino, Valentin Puente, J. Gregorio","doi":"10.1109/HPCA.2010.5416641","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416641","url":null,"abstract":"This paper introduces a cost effective cache architecture called Enhanced Shared-Private Non-Uniform Cache Architecture (ESP-NUCA), which is suitable for highperformance Chip MultiProcessors (CMPs). This architecture enhances system stability by combining the advantages of private and shared caches. Starting from a shared NUCA, ESP-NUCA introduces a low-cost mechanism to dynamically allocate private cache blocks closer to their owner processor. In this way, average on-chip access latency is reduced and inter-core interference minimized. ESP-NUCA synergistically integrates victims and replicas thus making it possible to take advantage of multiple-readers for shared data, and to maximize cache usage under unbalanced core utilization. This architecture leads to stable behavior within the whole system across a broad spectrum of working scenarios. ESP-NUCA not only outperforms architectures with similar implementation costs such as private and shared caches by up to 20% and 40% respectively, but even outperforms much costlier architectures such as D-NUCA [13] by up to 28%, Adaptive Selective Replication [3] by up to 19%, and Cooperative Caching [5] by up to 15%. Moreover, performance variance throughout the set of benchmarks is 37% lower than with ASR, 87% lower than with D-NUCA, and 43% lower than with Cooperative Caching.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127170975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arijit Biswas, C. Recchia, Shubhendu S. Mukherjee, V. Ambrose, L. Chan, A. Jaleel, A. Papathanasiou, M. Plaster, N. Seifert
{"title":"Explaining cache SER anomaly using DUE AVF measurement","authors":"Arijit Biswas, C. Recchia, Shubhendu S. Mukherjee, V. Ambrose, L. Chan, A. Jaleel, A. Papathanasiou, M. Plaster, N. Seifert","doi":"10.1109/HPCA.2010.5416629","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416629","url":null,"abstract":"We have discovered that processors can experience a super-linear increase in detected unrecoverable errors (DUE) when the write-back L2 cache is doubled in size. This paper explains how an increase in the cache tag's Architectural Vulnerability Factor or AVF caused such a super-linear increase in the DUE rate. AVF expresses the fraction of faults that become user-visible errors. Our hypothesis is that this increase in AVF is caused by a super-linear increase in “dirty” data residence times in the L2 cache. Using proton beam irradiation, we measured the DUE rates from the write-back cache tags and analyzed the data to show that our hypothesis holds. We utilized a combination of simulation and measurements to help develop and prove this hypothesis. Our investigation reveals two methods by which dirty line residency causes super-linear increases in the L2 cache tag's AVF. One is a reduction in the miss rates as we move to the larger cache part, resulting in fewer evictions of data required for architecturally correct execution. The second is the occurrence of strided cache access patterns, which cause a significant increase in the “dirty” residency times of cache lines without increasing the cache miss rate.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130996203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DMA cache: Using on-chip storage to architecturally separate I/O data from CPU data for improving I/O performance","authors":"Dan Tang, Yungang Bao, Weiwu Hu, Mingyu Chen","doi":"10.1109/HPCA.2010.5416638","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416638","url":null,"abstract":"As technology advances both in increasing bandwidth and in reducing latency for I/O buses and devices, moving I/O data in/out memory has become critical. In this paper, we have observed the different characteristics of I/O and CPU memory reference behavior, and found the potential benefits of separating I/O data from CPU data. We propose a DMA cache technique to store I/O data in dedicated on-chip storage and present two DMA cache designs. The first design, Decoupled DMA Cache (DDC), adopts additional on-chip storage as the DMA cache to buffer I/O data. The second design, Partition-Based DMA Cache (PBDC), does not require additional on-chip storage, but can dynamically use some ways of the processor's last level cache (LLC) as the DMA cache. We have implemented and evaluated the two DMA cache designs by using an FPGA-based emulation platform and the memory reference traces of real-world applications. Experimental results show that, compared with the existing snooping-cache scheme, DDC can reduce memory access latency (in bus cycles) by 34.8% on average (up to 58.4%), while PBDC can achieve about 80% of DDC's performance improvements despite no additional on-chip storage.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129780470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aniruddha N. Udipi, Naveen Muralimanohar, R. Balasubramonian
{"title":"Towards scalable, energy-efficient, bus-based on-chip networks","authors":"Aniruddha N. Udipi, Naveen Muralimanohar, R. Balasubramonian","doi":"10.1109/HPCA.2010.5416639","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416639","url":null,"abstract":"It is expected that future on-chip networks for many-core processors will impose huge overheads in terms of energy, delay, complexity, verification effort, and area. There is a common belief that the bandwidth necessary for future applications can only be provided by employing packet-switched networks with complex routers and a scalable directory-based coherence protocol. We posit that such a scheme might likely be overkill in a well designed system in addition to being expensive in terms of power because of a large number of power-hungry routers. We show that bus-based networks with snooping protocols can significantly lower energy consumption and simplify network/protocol design and verification, with no loss in performance. We achieve these characteristics by dividing the chip into multiple segments, each having its own broadcast bus, with these buses further connected by a central bus. This helps eliminate expensive routers, but suffers from the energy overhead of long wires. We propose the use of multiple Bloom filters to effectively track data presence in the cache and restrict bus broadcasts to a subset of segments, significantly reducing energy consumption. We further show that the use of OS page coloring helps maximize locality and improves the effectiveness of the Bloom filters. We also employ low-swing wiring to further reduce the energy overheads of the links. Performance can also be improved at relatively low costs by utilizing more of the abundant metal budgets on-chip and employing multiple address-interleaved buses rather than multiple routers. Thus, with the combination of all the above innovations, we extend the scalability of buses and believe that buses can be a viable and attractive option for future on-chip networks. We show energy reductions of up to 31X on average compared to many state-of-the-art packet switched networks.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"290 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116401645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}