2011 IEEE 17th International Symposium on High Performance Computer Architecture最新文献_第4页

NUcache: An efficient multicore cache organization based on Next-Use distance NUcache:基于下一次使用距离的高效多核缓存组织

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749733

R. Manikantan, K. Rajan, Ramaswamy Govindarajan

引用次数: 51

MorphCache: A Reconfigurable Adaptive Multi-level Cache hierarchy MorphCache:一个可重构的自适应多级缓存层次结构

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749732

Shekhar Srikantaiah, Emre Kultursay, Zhang Tao, M. Kandemir, M. J. Irwin, Yuan Xie

{"title":"MorphCache: A Reconfigurable Adaptive Multi-level Cache hierarchy","authors":"Shekhar Srikantaiah, Emre Kultursay, Zhang Tao, M. Kandemir, M. J. Irwin, Yuan Xie","doi":"10.1109/HPCA.2011.5749732","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749732","url":null,"abstract":"Given the diverse range of application characteristics that chip multiprocessors (CMPs) need to cater to, a “one-cache-topology-fits-all” design philosophy will clearly be inadequate. In this paper, we propose MorphCache, a Reconfigurable Adaptive Multi-level Cache hierarchy. Mor-phCache dynamically tunes a multi-level cache topology in a CMP to allow significantly different cache topologies to exist on the same architecture. Starting from per-core L2 and L3 cache slices as the basic design point, MorphCache alters the cache topology dynamically by merging or splitting cache slices and modifying the accessibility of different cache slice groups to different cores in a CMP. We evaluated MorphCache on a 16 core CMP on a full system simulator and found that it significantly improves both average throughput and harmonic mean of speedups of diverse multithreaded and multiprogrammed workloads. Specifically, our results show that MorphCache improves throughput of the multiprogrammed mixes by 29.9% over a topology with all-shared L2 and L3 caches and 27.9% over a topology with per core private L2 cache and shared L3 cache. In addition, we also compared MorphCache to partitioning a single shared cache at each level using promotion/insertion pseudo-partitioning (PIPP) [28] and managing per-core private cache at each level using dynamic spill receive caches (DSR) [18]. We found that MorphCache improves average throughput by 6.6% over PIPP and by 5.7% over DSR when applied to both L2 and L3 caches.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"280 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127391273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Dynamic parallelization of JavaScript applications using an ultra-lightweight speculation mechanism 使用超轻量级推测机制的JavaScript应用程序的动态并行化

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749719

M. Mehrara, Po-Chun Hsu, M. Samadi, S. Mahlke

{"title":"Dynamic parallelization of JavaScript applications using an ultra-lightweight speculation mechanism","authors":"M. Mehrara, Po-Chun Hsu, M. Samadi, S. Mahlke","doi":"10.1109/HPCA.2011.5749719","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749719","url":null,"abstract":"As the web becomes the platform of choice for execution of more complex applications, a growing portion of computation is handed off by developers to the client side to reduce network traffic and improve application responsiveness. Therefore, the client-side component, often written in JavaScript, is becoming larger and more compute-intensive, increasing the demand for high performance JavaScript execution. This has led to many recent efforts to improve the performance of JavaScript engines in the web browsers. Furthermore, considering the wide-spread deployment of multi-cores in today's computing systems, exploiting parallelism in these applications is a promising approach to meet their performance requirement. However, JavaScript has traditionally been treated as a sequential language with no support for multithreading, limiting its potential to make use of the extra computing power in multicore systems. In this work, to exploit hardware concurrency while retaining traditional sequential programming model, we develop ParaScript, an automatic runtime parallelization system for JavaScript applications on the client's browser. First, we propose an optimistic runtime scheme for identifying parallelizable regions, generating the parallel code on-the-fly, and speculatively executing it. Second, we introduce an ultra-lightweight software speculation mechanism to manage parallel execution. This speculation engine consists of a selective checkpointing scheme and a novel runtime dependence detection mechanism based on reference counting and range-based array conflict detection. Our system is able to achieve an average of 2.18× speedup over the Firefox browser using 8 threads on commodity multi-core systems, while performing all required analyses and conflict detection dynamically at runtime.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127994621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

I-CASH: Intelligently Coupled Array of SSD and HDD I-CASH: SSD和HDD智能耦合阵列

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749736

Qing Yang, Jin Ren

{"title":"I-CASH: Intelligently Coupled Array of SSD and HDD","authors":"Qing Yang, Jin Ren","doi":"10.1109/HPCA.2011.5749736","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749736","url":null,"abstract":"This paper presents a new disk I/O architecture composed of an array of a flash memory SSD (solid state disk) and a hard disk drive (HDD) that are intelligently coupled by a special algorithm. We call this architecture I-CASH: Intelligently Coupled Array of SSD and HDD. The SSD stores seldom-changed and mostly read reference data blocks whereas the HDD stores a log of deltas between currently accessed I/O blocks and their corresponding reference blocks in the SSD so that random writes are not performed in SSD during online I/O operations. High speed delta compression and similarity detection algorithms are developed to control the pair of SSD and HDD. The idea is to exploit the fast read performance of SSDs and the high speed computation of modern multi-core CPUs to replace and substitute, to a great extent, the mechanical operations of HDDs. At the same time, we avoid runtime SSD writes that are slow and wearing. An experimental prototype I-CASH has been implemented and is used to evaluate I-CASH performance as compared to existing SSD/HDD I/O architectures. Numerical results on standard benchmarks show that I-CASH reduces the average I/O response time by an order of magnitude compared to existing disk I/O architectures such as RAID and SSD/HDD storage hierarchy, and provides up to 2.8 speedup over state-of-the-art pure SSD storage. Furthermore, I-CASH reduces random writes to SSD implying reduced wearing and prolonged life time of the SSD.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115860124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 167

Checked Load: Architectural support for JavaScript type-checking on mobile processors Checked Load:在移动处理器上对JavaScript类型检查的架构支持

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749748

O. Anderson, Emily Fortuna, L. Ceze, S. Eggers

{"title":"Checked Load: Architectural support for JavaScript type-checking on mobile processors","authors":"O. Anderson, Emily Fortuna, L. Ceze, S. Eggers","doi":"10.1109/HPCA.2011.5749748","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749748","url":null,"abstract":"Dynamic languages such as Javascript are the de-facto standard for web applications. However, generating efficient code for dynamically-typed languages is a challenge, because it requires frequent dynamic type checks. Our analysis has shown that some programs spend upwards of 20% of dynamic instructions doing type checks, and 12.9% on average. In this paper we propose Checked Load, a low-complexity architectural extension that replaces software-based, dynamic type checking. Checked Load is comprised of four new ISA instructions that provide flexible and automatic type checks for memory operations, and whose implementation requires minimal hardware changes. We also propose hardware support for dynamic type prediction to reduce the cost of failed type checks. We show how to use Checked Load in the Nitro JavaScript just-in-time compiler (used in the Safari 5 browser). Speedups on a typical mobile processor range up to 44.6% (with a mean of 11.2%) in popular JavaScript benchmarks. While we have focused our work on JavaScript, Checked Load is sufficiently general to support other dynamically-typed languages, such as Python or Ruby.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130478531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Addressing system-level trimming issues in on-chip nanophotonic networks 解决片上纳米光子网络中的系统级微调问题

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749722

C. Nitta, M. Farrens, V. Akella

引用次数: 86

Safe and efficient supervised memory systems 安全高效的监督存储系统

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749744

J. Bobba, Marc Lupon, M. Hill, D. Wood

{"title":"Safe and efficient supervised memory systems","authors":"J. Bobba, Marc Lupon, M. Hill, D. Wood","doi":"10.1109/HPCA.2011.5749744","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749744","url":null,"abstract":"Supervised Memory systems use out-of-band metabits to control and monitor accesses to normal data memory for such purposes as transactional memory and memory typestate trackers. Previous proposals demonstrate the value of supervised memory systems, but have typically (1) assumed sequential consistency (while most deployed systems use weaker models), and (2) used ad hoc, informal memory specifications (that can be ambiguous and/or incorrect). This paper seeks to make many previous proposals more practical. This paper builds a foundation for future supervised memory systems which (1) operate with the TSO and ×86 memory models, and (2) are formally specified using two supervised memory models. The simpler TSOall model requires all metadata and data accesses to obey TSO, but precludes using store buffers for supervised accesses. The more complex TSOdata model relaxes some ordering constraints (allowing store buffer use) but makes programmer reasoning more difficult. To get the benefits of both models, we propose Safe Supervision, which asks programmers to avoid using metabits from one location to order accesses to another. Programmers that obey safe supervision can reason with the simpler semantics of TSOall while obtaining the higher performance of TSOdata. Our approach is similar to how data-race-free programs can run on relaxed systems and yet appear sequentially consistent. Finally, we show that TSOdata can (a) provide significant performance benefit (up to 22%) over TSOall and (b) can be incorporated correctly and with low overhead into the RTL of an industrial multi-core chip design (OpenSPARC T2).","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"256 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114377825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Exploiting criticality to reduce bottlenecks in distributed uniprocessors 利用临界性来减少分布式单处理器中的瓶颈

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749749

Behnam Robatmili, Madhu Saravana Sibi Govindan, D. Burger, S. Keckler

{"title":"Exploiting criticality to reduce bottlenecks in distributed uniprocessors","authors":"Behnam Robatmili, Madhu Saravana Sibi Govindan, D. Burger, S. Keckler","doi":"10.1109/HPCA.2011.5749749","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749749","url":null,"abstract":"Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multicore systems. We present a critical path analysis revealing that communication needed for cross-core register value delivery and fetch stalls due to misspeculation are the two worst bottlenecks that prevent efficient scaling to a large number of fused cores. To alleviate these bottlenecks, this paper proposes a fully distributed framework to exploit criticality in these architectures at different granularities. A coordinator core exploits different types of block-level communication criticality information to fine-tune critical instructions at decode and register forward pipeline stages of their executing cores. The framework exploits the fetch criticality information at a coarser granularity by reissuing all instructions in the blocks previously fetched into the merged cores. This general framework reduces competing bottlenecks in a synergic manner and achieves scalable performance/power efficiency for sequential programs when running across a large number of cores.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127281371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

A new server I/O architecture for high speed networks 一种用于高速网络的新型服务器I/O架构

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749734

Guangdeng Liao, Xia Zhu, L. Bhuyan

{"title":"A new server I/O architecture for high speed networks","authors":"Guangdeng Liao, Xia Zhu, L. Bhuyan","doi":"10.1109/HPCA.2011.5749734","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749734","url":null,"abstract":"Traditional architectural designs are normally focused on CPUs and have been often decoupled from I/O considerations. They are inefficient for high-speed network processing with a bandwidth of 10Gbps and beyond. Long latency I/O interconnects on mainstream servers also substantially complicate the NIC designs. In this paper, we start with fine-grained driver and OS instrumentation to fully understand the network processing overhead over 10GbE on mainstream servers. We obtain several new findings: 1) besides data copy identified by previous works, the driver and buffer release are two unexpected major overheads (up to 54%); 2) the major source of the overheads is memory stalls and data relating to socket buffer (SKB) and page data structures are mainly responsible for the stalls; 3) prevailing platform optimizations like Direct Cache Access (DCA) are insufficient for addressing the network processing bottlenecks. Motivated by the studies, we propose a new server I/O architecture where DMA descriptor management is shifted from NICs to an on-chip network engine (NEngine), and descriptors are extended with information about data incurring memory stalls. NEngine relies on data lookups and preloads data to eliminate the stalls during network processing. Moreover, NEngine implements efficient packet movement inside caches to address the remaining issues in data copy. The new architecture allows DMA engine to have very fast access to descriptors and keeps packets in CPU caches instead of NIC buffers, significantly simplifying NICs. Experimental results demonstrate that the new server I/O architecture improves the network processing efficiency by 47% and web server throughput by 14%, while substantially reducing the NIC hardware complexity.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124734353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing 利用基于闪存的固态硬盘内部并行性在高速数据处理中的重要作用

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749735

Feng Chen, Rubao Lee, Xiaodong Zhang

{"title":"Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing","authors":"Feng Chen, Rubao Lee, Xiaodong Zhang","doi":"10.1109/HPCA.2011.5749735","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749735","url":null,"abstract":"Flash memory based solid state drives (SSDs) have shown a great potential to change storage infrastructure fundamentally through their high performance and low power. Most recent studies have mainly focused on addressing the technical limitations caused by special requirements for writes in flash memory. However, a unique merit of an SSD is its rich internal parallelism, which allows us to offset for the most part of the performance loss related to technical limitations by significantly increasing data processing throughput. In this work we present a comprehensive study of essential roles of internal parallelism of SSDs in high-speed data processing. Besides substantially improving I/O bandwidth (e.g. 7.2×), we show that by exploiting internal parallelism, SSD performance is no longer highly sensitive to access patterns, but rather to other factors, such as data access interferences and physical data layout. Specifically, through extensive experiments and thorough analysis, we obtain the following new findings in the context of concurrent data processing in SSDs. (1) Write performance is largely independent of access patterns (regardless of being sequential or random), and can even outperform reads, which is opposite to the long-existing common understanding about slow writes on SSDs. (2) One performance concern comes from interference between concurrent reads and writes, which causes substantial performance degradation. (3) Parallel I/O performance is sensitive to physical data-layout mapping, which is largely not observed without parallelism. (4) Existing application designs optimized for magnetic disks can be suboptimal for running on SSDs with parallelism. Our study is further supported by a group of case studies in database systems as typical data-intensive applications. With these critical findings, we give a set of recommendations to application designers and system architects for exploiting internal parallelism and maximizing the performance potential of SSDs.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132073447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 285