ASPLOS XI最新文献

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024401

Timothy E. Denehy, John Bent, Florentina I. Popovici, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

引用次数: 25

Low-overhead memory leak detection using adaptive statistical profiling 使用自适应统计分析的低开销内存泄漏检测

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024412

Matthias Hauswirth, Trishul M. Chilimbi

{"title":"Low-overhead memory leak detection using adaptive statistical profiling","authors":"Matthias Hauswirth, Trishul M. Chilimbi","doi":"10.1145/1024393.1024412","DOIUrl":"https://doi.org/10.1145/1024393.1024412","url":null,"abstract":"Sampling has been successfully used to identify performance optimization opportunities. We would like to apply similar techniques to check program correctness. Unfortunately, sampling provides poor coverage of infrequently executed code, where bugs often lurk. We describe an adaptive profiling scheme that addresses this by sampling executions of code segments at a rate inversely proportional to their execution frequency. To validate our ideas, we have implemented SWAT, a novel memory leak detection tool. SWAT traces program allocations/ frees to construct a heap model and uses our adaptive profiling infrastructure to monitor loads/stores to these objects with low overhead. SWAT reports 'stale' objects that have not been accessed for a 'long' time as leaks. This allows it to find all leaks that manifest during the current program execution. Since SWAT has low runtime overhead (‹5%), and low space overhead (‹10% in most cases and often less than 5%), it can be used to track leaks in production code that take days to manifest. In addition to identifying the allocations that leak memory, SWAT exposes where the program last accessed the leaked data, which facilitates debugging and fixing the leak. SWAT has been used by several product groups at Microsoft for the past 18 months and has proved effective at detecting leaks with a low false positive rate (‹10%).","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125977553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 254

Dynamic tracking of page miss ratio curve for memory management 内存管理中页面缺失率曲线的动态跟踪

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024415

Pin Zhou, V. Pandey, Jagadeesan Sundaresan, Anand Raghuraman, Yuanyuan Zhou, Sanjeev Kumar

{"title":"Dynamic tracking of page miss ratio curve for memory management","authors":"Pin Zhou, V. Pandey, Jagadeesan Sundaresan, Anand Raghuraman, Yuanyuan Zhou, Sanjeev Kumar","doi":"10.1145/1024393.1024415","DOIUrl":"https://doi.org/10.1145/1024393.1024415","url":null,"abstract":"Memory can be efficiently utilized if the dynamic memory demands of applications can be determined and analyzed at run-time. The page miss ratio curve(MRC), i.e. page miss rate vs. memory size curve, is a good performance-directed metric to serve this purpose. However, dynamically tracking MRC at run time is challenging in systems with virtual memory because not every memory reference passes through the operating system (OS).This paper proposes two methods to dynamically track MRC of applications at run time. The first method is using a hardware MRC monitor that can track MRC at fine time granularity. Our simulation results show that this monitor has negligible performance and energy overheads. The second method is an OS-only implementation that can track MRC at coarse time granularity. Our implementation results on Linux show that it adds only 7--10% overhead.We have also used the dynamic MRC to guide both memory allocation for multiprogramming systems and memory energy management. Our real system experiments on Linux with applications including Apache Web Server show that the MRC-directed memory allocation can speed up the applications' execution/response time by up to a factor of 5.86 and reduce the number of page faults by up to 63.1%. Our execution-driven simulation results with SPEC2000 benchmarks show that the MRC-directed memory energy management can improve the Energy * Delay metric by 27--58% over previously proposed static and dynamic schemes.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"42 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128224733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 252

Heat-and-run: leveraging SMT and CMP to manage power density through the operating system 热运行:利用SMT和CMP通过操作系统管理功率密度

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024424

M. Gomaa, Michael D. Powell, T. N. Vijaykumar

{"title":"Heat-and-run: leveraging SMT and CMP to manage power density through the operating system","authors":"M. Gomaa, Michael D. Powell, T. N. Vijaykumar","doi":"10.1145/1024393.1024424","DOIUrl":"https://doi.org/10.1145/1024393.1024424","url":null,"abstract":"Power density in high-performance processors continues to increase with technology generations as scaling of current, clock speed, and device density outpaces the downscaling of supply voltage and thermal ability of packages to dissipate heat. Power density is characterized by localized chip hot spots that can reach critical temperatures and cause failure. Previous architectural approaches to power density have used global clock gating, fetch toggling, dynamic frequency scaling, or resource duplication to either prevent heating or relieve overheated resources in a superscalar processor. Previous approaches also evaluate design technologies where power density is not a major problem and most applications do not overheat the processor. Future processors, however, are likely to be chip multiprocessors (CMPs) with simultaneously-multithreaded (SMT) cores. SMT CMPs pose unique challenges and opportunities for power density. SMT and CMP increase throughput and thus on-chip heat, but also provide natural granularities for managing power-density. This paper is the first work to leverage SMT and CMP to address power density. We propose heat-and-run SMT thread assignment to increase processor-resource utilization before cooling becomes necessary by co-scheduling threads that use complimentary resources. We propose heat-and-run CMP thread migration to migrate threads away from overheated cores and assign them to free SMT contexts on alternate cores, leveraging availability of SMT contexts on alternate CMP cores to maintain throughput while allowing overheated cores to cool. We show that our proposal has an average of 9% and up to 34% higher throughput than a previous superscalar technique running the same number of threads.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132393808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 339

Continual flow pipelines 连续流管道

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024407

Srikanth T. Srinivasan, Ravi Rajwar, Haitham Akkary, A. Gandhi, M. Upton

{"title":"Continual flow pipelines","authors":"Srikanth T. Srinivasan, Ravi Rajwar, Haitham Akkary, A. Gandhi, M. Upton","doi":"10.1145/1024393.1024407","DOIUrl":"https://doi.org/10.1145/1024393.1024407","url":null,"abstract":"Increased integration in the form of multiple processor cores on a single die, relatively constant die sizes, shrinking power envelopes, and emerging applications create a new challenge for processor architects. How to build a processor that provides high single-thread performance and enables multiple of these to be placed on the same die for high throughput while dynamically adapting for future applications? Conventional approaches for high single-thread performance rely on large and complex cores to sustain a large instruction window for memory tolerance, making them unsuitable for multi-core chips. We present Continual Flow Pipelines (CFP) as a new non-blocking processor pipeline architecture that achieves the performance of a large instruction window without requiring cycle-critical structures such as the scheduler and register file to be large. We show that to achieve benefits of a large instruction window, inefficiencies in management of both the scheduler and register file must be addressed, and we propose a unified solution. The non-blocking property of CFP keeps key processor structures affecting cycle time and power (scheduler, register file), and die size (second level cache) small. The memory latency-tolerant CFP core allows multiple cores on a single die while outperforming current processor cores for single-thread applications.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127626003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 197

HIDE: an infrastructure for efficiently protecting information leakage on the address bus 用于有效地保护地址总线上的信息泄漏的基础结构

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024403

Xiaotong Zhuang, Zhang Tao, S. Pande

{"title":"HIDE: an infrastructure for efficiently protecting information leakage on the address bus","authors":"Xiaotong Zhuang, Zhang Tao, S. Pande","doi":"10.1145/1024393.1024403","DOIUrl":"https://doi.org/10.1145/1024393.1024403","url":null,"abstract":"XOM-based secure processor has recently been introduced as a mechanism to provide copy and tamper resistant execution. XOM provides support for encryption/decryption and integrity checking. However, neither XOM nor any other current approach adequately addresses the problem of information leakage via the address bus. This paper shows that without address bus protection, the XOM model is severely crippled. Two realistic attacks are shown and experiments show that 70% of the code might be cracked and sensitive data might be exposed leading to serious security breaches.Although the problem of address bus leakage has been widely acknowledged both in industry and academia, no practical solution has ever been proposed that can provide an adequate security guarantee. The main reason is that the problem is very difficult to solve in practice due to severe performance degradation which accompanies most of the solutions. This paper presents an infrastructure called HIDE (Hardware-support for leakage-Immune Dynamic Execution) which provides a solution consisting of chunk-level protection with hardware support and a flexible interface which can be orchestrated through the proposed compiler optimization and user specifications that allow utilizing underlying hardware solution more efficiently to provide better security guarantees.Our results show that protecting both data and code with a high level of security guarantee is possible with negligible performance penalty (1.3% slowdown).","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123811959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 193

Helper threads via virtual multithreading on an experimental itanium® 2 processor-based platform 在基于itanium®2处理器的实验平台上，通过虚拟多线程实现助手线程

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024411

P. Wang, Jamison D. Collins, Hong Wang, Dongkeun Kim, Bill Greene, Kai-Ming Chan, Aamir B. Yunus, T. Sych, Stephen F. Moore, John Paul Shen

{"title":"Helper threads via virtual multithreading on an experimental itanium® 2 processor-based platform","authors":"P. Wang, Jamison D. Collins, Hong Wang, Dongkeun Kim, Bill Greene, Kai-Ming Chan, Aamir B. Yunus, T. Sych, Stephen F. Moore, John Paul Shen","doi":"10.1145/1024393.1024411","DOIUrl":"https://doi.org/10.1145/1024393.1024411","url":null,"abstract":"Helper threading is a technology to accelerate a program by exploiting a processor's multithreading capability to run ``assist'' threads. Previous experiments on hyper-threaded processors have demonstrated significant speedups by using helper threads to prefetch hard-to-predict delinquent data accesses. In order to apply this technique to processors that do not have built-in hardware support for multithreading, we introduce virtual multithreading (VMT), a novel form of switch-on-event user-level multithreading, capable of fly-weight multiplexing of event-driven thread executions on a single processor without additional operating system support. The compiler plays a key role in minimizing synchronization cost by judiciously partitioning register usage among the user-level threads. The VMT approach makes it possible to launch dynamic helper thread instances in response to long-latency cache miss events, and to run helper threads in the shadow of cache misses when the main thread would be otherwise stalled.The concept of VMT is prototyped on an Itanium ® 2 processor using features provided by the Processor Abstraction Layer (PAL) firmware mechanism already present in currently shipping processors. On a 4-way MP physical system equipped with VMT-enabled Itanium 2 processors, helper threading via the VMT mechanism can achieve significant performance gains for a diverse set of real-world workloads, ranging from single-threaded workstation benchmarks to heavily multithreaded large scale decision support systems (DSS) using the IBM DB2 Universal Database. We measure a wall-clock speedup of 5.8% to 38.5% for the workstation benchmarks, and 5.0% to 12.7% on various queries in the DSS workload.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130570024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Software prefetching for mark-sweep garbage collection: hardware analysis and software redesign 标记清除垃圾收集的软件预取:硬件分析和软件重新设计

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024417

Chen-Yong Cher, Antony Lloyd Hosking, T. N. Vijaykumar

{"title":"Software prefetching for mark-sweep garbage collection: hardware analysis and software redesign","authors":"Chen-Yong Cher, Antony Lloyd Hosking, T. N. Vijaykumar","doi":"10.1145/1024393.1024417","DOIUrl":"https://doi.org/10.1145/1024393.1024417","url":null,"abstract":"Tracing garbage collectors traverse references from live program variables, transitively tracing out the closure of live objects. Memory accesses incurred during tracing are essentially random: a given object may contain references to any other object. Since application heaps are typically much larger than hardware caches, tracing results in many cache misses. Technology trends will make cache misses more important, so tracing is a prime target for prefetching.Simulation of Java benchmarks running with the Boehm-De-mers-Weiser mark-sweep garbage collector for a projected hardware platform reveal high tracing overhead (up to 65% of elapsed time), and that cache misses are a problem. Applying Boehm's default prefetching strategy yields improvements in execution time (16% on average with incremental/generational collection for GC-intensive benchmarks), but analysis shows that his strategy suffers from significant timing problems: prefetches that occur too early or too late relative to their matching loads. This analysis drives development of a new prefetching strategy that yields up to three times the performance improvement of Boehm's strategy for GC-intensive benchmark (27% average speedup), and achieves performance close to that of perfect timing ie, few misses for tracing accesses) on some benchmarks. Validating these simulation results with live runs on current hardware produces average speedup of 6% for the new strategy on GC-intensive benchmarks with a GC configuration that tightly controls heap growth. In contrast, Boehm's default prefetching strategy is ineffective on this platform.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122261380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Application-level checkpointing for shared memory programs 共享内存程序的应用程序级检查点

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024421

G. Bronevetsky, Daniel Marques, K. Pingali, P. Szwed, M. Schulz

{"title":"Application-level checkpointing for shared memory programs","authors":"G. Bronevetsky, Daniel Marques, K. Pingali, P. Szwed, M. Schulz","doi":"10.1145/1024393.1024421","DOIUrl":"https://doi.org/10.1145/1024393.1024421","url":null,"abstract":"Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR.Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs.In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks.One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121273717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 131

Formal online methods for voltage/frequency control in multiple clock domain microprocessors 多时钟域微处理器电压/频率控制的正式在线方法

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024423

Qiang Wu, Philo Juang, M. Martonosi, D. Clark

{"title":"Formal online methods for voltage/frequency control in multiple clock domain microprocessors","authors":"Qiang Wu, Philo Juang, M. Martonosi, D. Clark","doi":"10.1145/1024393.1024423","DOIUrl":"https://doi.org/10.1145/1024393.1024423","url":null,"abstract":"Multiple Clock Domain (MCD) processors are a promising future alternative to today's fully synchronous designs. Dynamic Voltage and Frequency Scaling (DVFS) in an MCD processor has the extra flexibility to adjust the voltage and frequency in each domain independently. Most existing DVFS approaches are profile-based offline schemes which are mainly suitable for applications whose execution char-acteristics are constrained and repeatable. While some work has been published about online DVFS schemes, the prior approaches are typically heuristic-based. In this paper, we present an effective online DVFS scheme for an MCD processor which takes a formal analytic approach, is driven by dynamic workloads, and is suitable for all applications. In our approach, we model an MCD processor as a queue-domain network and the online DVFS as a feedback control problem with issue queue occupancies as feedback signals. A dynamic stochastic queuing model is first proposed and linearized through an accu-rate linearization technique. A controller is then designed and verified by stability analysis. Finally we evaluate our DVFS scheme through a cycle-accurate simulation with a broad set of applications selected from MediaBench and SPEC2000 benchmark suites. Compared to the best-known prior approach, which is heuristic-based, the proposed online DVFS scheme is substantially more effective due to its automatic regulation ability. For example, we have achieved a 2-3 fold increase in efficiency in terms of energy-delay product improvement. In addition, our control theoretic technique is more resilient, requires less tuning effort, and has better scalability as compared to prior online DVFS schemes.We believe that the techniques and methodology described in this paper can be generalized for energy control in processors other than MCD, such as tiled stream processors.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114656729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 194