ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168881

Benjamin C. Lee, D. Brooks

{"title":"Accurate and efficient regression modeling for microarchitectural performance and power prediction","authors":"Benjamin C. Lee, D. Brooks","doi":"10.1145/1168857.1168881","DOIUrl":"https://doi.org/10.1145/1168857.1168881","url":null,"abstract":"We propose regression modeling as an efficient approach for accurately predicting performance and power for various applications executing on any microprocessor configuration in a large microarchitectural design space. This paper addresses fundamental challenges in microarchitectural simulation cost by reducing the number of required simulations and using simulated results more effectively via statistical modeling and inference.Specifically, we derive and validate regression models for performance and power. Such models enable computationally efficient statistical inference, requiring the simulation of only 1 in 5 million points of a joint microarchitecture-application design space while achieving median error rates as low as 4.1 percent for performance and 4.3 percent for power. Although both models achieve similar accuracy, the sources of accuracy are strikingly different. We present optimizations for a baseline regression model to obtain (1) application-specific models to maximize accuracy in performance prediction and (2) regional power models leveraging only the most relevant samples from the microarchitectural design space to maximize accuracy in power prediction. Assessing sensitivity to the number of samples simulated for model formulation, we find fewer than 4,000 samples from a design space of approximately 22 billion points are sufficient. Collectively, our results suggest significant potential in accurate and efficient statistical inference for microarchitectural design space exploration via regression models.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117124274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 501

Hybrid transactional memory 混合事务性内存

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168900

P. Damron, Alexandra Fedorova, Yossi Lev, Victor Luchangco, Mark Moir, Daniel Nussbaum

{"title":"Hybrid transactional memory","authors":"P. Damron, Alexandra Fedorova, Yossi Lev, Victor Luchangco, Mark Moir, Daniel Nussbaum","doi":"10.1145/1168857.1168900","DOIUrl":"https://doi.org/10.1145/1168857.1168900","url":null,"abstract":"Transactional memory (TM) promises to substantially reduce the difficulty of writing correct, efficient, and scalable concurrent programs. But \"bounded\" and \"best-effort\" hardware TM proposals impose unreasonable constraints on programmers, while more flexible software TM implementations are considered too slow. Proposals for supporting \"unbounded\" transactions in hardware entail significantly higher complexity and risk than best-effort designs.We introduce Hybrid Transactional Memory (HyTM), an approach to implementing TMin software so that it can use best effort hardware TM (HTM) to boost performance but does not depend on HTM. Thus programmers can develop and test transactional programs in existing systems today, and can enjoy the performance benefits of HTM support when it becomes available.We describe our prototype HyTM system, comprising a compiler and a library. The compiler allows a transaction to be attempted using best-effort HTM, and retried using the software library if it fails. We have used our prototype to \"transactify\" part of the Berkeley DB system, as well as several benchmarks. By disabling the optional use of HTM, we can run all of these tests on existing systems. Furthermore, by using a simulated multiprocessor with HTM support, we demonstrate the viability of the HyTM approach: it can provide performance and scalability approaching that of an unbounded HTM implementation, without the need to support all transactions with complicated HTM support.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"140 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120939607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 505

Integrated network interfaces for high-bandwidth TCP/IP 集成网络接口，支持高带宽TCP/IP

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168897

N. Binkert, A. Saidi, S. Reinhardt

{"title":"Integrated network interfaces for high-bandwidth TCP/IP","authors":"N. Binkert, A. Saidi, S. Reinhardt","doi":"10.1145/1168857.1168897","DOIUrl":"https://doi.org/10.1145/1168857.1168897","url":null,"abstract":"This paper proposes new network interface controller (NIC) designs that take advantage of integration with the host CPU to provide increased flexibility for operating system kernel-based performance optimization.We believe that this approach is more likely to meet the needs of current and future high-bandwidth TCP/IP networking on end hosts than the current trend of putting more complexity in the NIC, while avoiding the need to modify applications and protocols. This paper presents two such NICs. The first, the simple integrated NIC (SINIC), is a minimally complex design that moves the responsibility for managing the network FIFOs from the NIC to the kernel. Despite this closer interaction between the kernel and the NIC, SINIC provides performance equivalent to a conventional DMA-based NIC without increasing CPU overhead. The second design, V-SINIC, adds virtual per-packet registers to SINIC, enabling parallel packet processing while maintaining a FIFO model. V-SINIC allows the kernel to decouple examining a packet's header from copying its payload to memory. We exploit this capability to implement a true zero-copy receive optimization in the Linux 2.6 kernel, providing bandwidth improvements of over 50% on unmodified sockets-based receive-intensive benchmarks.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"182 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129638348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

Geiger: monitoring the buffer cache in a virtual machine environment 盖革:监视虚拟机环境中的缓冲区缓存

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168861

Stephen T. Jones, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

{"title":"Geiger: monitoring the buffer cache in a virtual machine environment","authors":"Stephen T. Jones, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau","doi":"10.1145/1168857.1168861","DOIUrl":"https://doi.org/10.1145/1168857.1168861","url":null,"abstract":"Virtualization is increasingly being used to address server management and administration issues like flexible resource allocation, service isolation and workload migration. In a virtualized environment, the virtual machine monitor (VMM) is the primary resource manager and is an attractive target for implementing system features like scheduling, caching, and monitoring. However, the lackof runtime information within the VMM about guest operating systems, sometimes called the semantic gap, is a significant obstacle to efficiently implementing some kinds of services.In this paper we explore techniques that can be used by a VMM to passively infer useful information about a guest operating system's unified buffer cache and virtual memory system. We have created a prototype implementation of these techniques inside the Xen VMM called Geiger and show that it can accurately infer when pages are inserted into and evicted from a system's buffer cache. We explore several nuances involved in passively implementing eviction detection that have not previously been addressed, such as the importance of tracking disk block liveness, the effect of file system journaling, and the importance of accounting for the unified caches found in modern operating systems.Using case studies we show that the information provided by Geiger enables a VMM to implement useful VMM-level services. We implement a novel working set size estimator which allows the VMM to make more informed memory allocation decisions. We also show that a VMM can be used to drastically improve the hit rate in remote storage caches by using eviction-based cache placement without modifying the application or operating system storage interface. Both case studies hint at a future where inference techniques enable a broad new class of VMM-level functionality.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130846089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 184

Computation spreading: employing hardware migration to specialize CMP cores on-the-fly 计算扩展:利用硬件迁移对动态CMP内核进行专门化

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168893

Koushik Chakraborty, Philip M. Wells, G. Sohi

{"title":"Computation spreading: employing hardware migration to specialize CMP cores on-the-fly","authors":"Koushik Chakraborty, Philip M. Wells, G. Sohi","doi":"10.1145/1168857.1168893","DOIUrl":"https://doi.org/10.1145/1168857.1168893","url":null,"abstract":"In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different times, we observe extensive code reuse among different processors, causing redundancy (e.g., in our server workloads, 45-65% of all instruction blocks are accessed by all processors). Moreover, largely independent fragments of computation compete for the same private resources causing destructive interference. Together, this redundancy and interference lead to poor utilization of private microarchitecture resources such as caches and branch predictors.We present Computation Spreading (CSP), which employs hardware migration to distribute a thread's dissimilar fragments of computation across the multiple processing cores of a chip multiprocessor (CMP), while grouping similar computation fragments from different threads together. This paper focuses on a specific example of CSP for OS intensive server applications: separating application level (user) computation from the OS calls it makes.When performing CSP, each core becomes temporally specialized to execute certain computation fragments, and the same core is repeatedly used for such fragments. We examine two specific thread assignment policies for CSP, and show that these policies, across four server workloads, are able to reduce instruction misses in private L2 caches by 27-58%, private L2 load misses by 0-19%, and branch mispredictions by 9-25%.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115183966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 117

Software-based instruction caching for embedded processors 嵌入式处理器基于软件的指令缓存

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168894

Jason E. Miller, A. Agarwal

{"title":"Software-based instruction caching for embedded processors","authors":"Jason E. Miller, A. Agarwal","doi":"10.1145/1168857.1168894","DOIUrl":"https://doi.org/10.1145/1168857.1168894","url":null,"abstract":"While hardware instruction caches are present in virtually all general-purpose and high-performance microprocessors today, many embedded processors use SRAM or scratchpad memories instead. These are simple array memory structures that are directly addressed and explicitly managed by software. Compared to hardware caches of the same data capacity, they are smaller, have shorter access times and consume less energy per access. Access times are also easier to predict with simple memories since there is no possibility of a \"miss.\" On the other hand, they are more difficult for the programmer to use since they are not automatically managed.In this paper, we present a software system that allows all or part of an SRAM or scratchpad memory to be automatically managed as a cache. This system provides the programming convenience of a cache for processors that lack dedicated caching hardware. It has been implemented for an actual processor and runs on real hardware. Our results show that a software-based instruction cache can be built that provides performance within 10% of a traditional hardware cache on many benchmarks while using a cheaper, simpler, SRAM memory. On these same benchmarks, energy consumption is up to 3% lower than it would be using a hardware cache.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116387093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

HeapMD: identifying heap-based bugs using anomaly detection HeapMD:使用异常检测识别基于堆的bug

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168885

Trishul M. Chilimbi, V. Ganapathy

引用次数: 37

Supporting nested transactional memory in logTM 在logTM中支持嵌套事务内存

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168902

M. Moravan, J. Bobba, Kevin E. Moore, Luke Yen, M. Hill, B. Liblit, M. Swift, D. Wood

{"title":"Supporting nested transactional memory in logTM","authors":"M. Moravan, J. Bobba, Kevin E. Moore, Luke Yen, M. Hill, B. Liblit, M. Swift, D. Wood","doi":"10.1145/1168857.1168902","DOIUrl":"https://doi.org/10.1145/1168857.1168902","url":null,"abstract":"Nested transactional memory (TM) facilitates software composition by letting one module invoke another without either knowing whether the other uses transactions. Closed nested transactions extend isolation of an inner transaction until the toplevel transaction commits. Implementations may flatten nested transactions into the top-level one, resulting in a complete abort on conflict, or allow partial abort of inner transactions. Open nested transactions allow a committing inner transaction to immediately release isolation, which increases parallelism and expressiveness at the cost of both software and hardware complexity.This paper extends the recently-proposed flat Log-based Transactional Memory (LogTM) with nested transactions. Flat LogTM saves pre-transaction values in a log, detects conflicts with read (R) and write (W) bits per cache block, and, on abort, invokes a software handler to unroll the log. Nested LogTM supports nesting by segmenting the log into a stack of activation records and modestly replicating R/W bits. To facilitate composition with nontransactional code, such as language runtime and operating system services, we propose escape actions that allow trusted code to run outside the confines of the transactional memory system.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122399816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 200

A performance counter architecture for computing accurate CPI components 用于计算精确CPI组件的性能计数器架构

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168880

Stijn Eyerman, L. Eeckhout, T. Karkhanis, James E. Smith

{"title":"A performance counter architecture for computing accurate CPI components","authors":"Stijn Eyerman, L. Eeckhout, T. Karkhanis, James E. Smith","doi":"10.1145/1168857.1168880","DOIUrl":"https://doi.org/10.1145/1168857.1168880","url":null,"abstract":"A common way of representing processor performance is to use Cycles per Instruction (CPI) `stacks' which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into the behavior of an application on a given microprocessor; consequently, they are widely used by software application developers and computer architects. However, computing CPI stacks on superscalar out-of-order processors is challenging because of various overlaps among execution and miss events (cache misses, TLB misses, and branch mispredictions).This paper shows that meaningful and accurate CPI stacks can be computed for superscalar out-of-order processors. Using interval analysis, a novel method for analyzing out-of-order processor performance, we gain understanding into the performance impact of the various miss events. Based on this understanding, we propose a novel way of architecting hardware performance counters for building accurate CPI stacks. The additional hardware for implementing these counters is limited and comparable to existing hardware performance counter architectures while being significantly more accurate than previous approaches.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127818606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 182

Bell: bit-encoding online memory leak detection Bell:位编码在线内存泄漏检测

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168866

Michael D. Bond, K. McKinley

{"title":"Bell: bit-encoding online memory leak detection","authors":"Michael D. Bond, K. McKinley","doi":"10.1145/1168857.1168866","DOIUrl":"https://doi.org/10.1145/1168857.1168866","url":null,"abstract":"Memory leaks compromise availability and security by crippling performance and crashing programs. Leaks are difficult to diagnose because they have no immediate symptoms. Online leak detection tools benefit from storing and reporting per-object sites (e.g., allocation sites) for potentially leaking objects. In programs with many small objects, per-object sites add high space overhead, limiting their use in production environments.This paper introduces Bit-Encoding Leak Location (Bell), a statistical approach that encodes per-object sites to a single bit per object. A bit loses information about a site, but given sufficient objects that use the site and a known, finite set of possible sites, Bell uses brute-force decoding to recover the site with high accuracy.We use this approach to encode object allocation and last-use sites in Sleigh, a new leak detection tool. Sleigh detects stale objects (objects unused for a long time) and uses Bell decoding to report their allocation and last-use sites. Our implementation steals four unused bits in the object header and thus incurs no per-object space overhead. Sleigh's instrumentation adds 29% execution time overhead, which adaptive profiling reduces to 11%. Sleigh's output is directly useful for finding and fixing leaks in SPEC JBB2000 and Eclipse, although sufficiently many objects must leak before Bell decoding can report sites with confidence. Bell is suitable for other leak detection approaches that store per-object sites, and for other problems amenable to statistical per-object metadata.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115623837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 105

ASPLOS XII最新文献