2007 IEEE International Symposium on Performance Analysis of Systems & Software最新文献

A Comparison of Two Approaches to Parallel Simulation of Multiprocessors 多处理器并行仿真的两种方法比较

2007 IEEE International Symposium on Performance Analysis of Systems & Software Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363732

A. Over, Bill Clarke, P. Strazdins

{"title":"A Comparison of Two Approaches to Parallel Simulation of Multiprocessors","authors":"A. Over, Bill Clarke, P. Strazdins","doi":"10.1109/ISPASS.2007.363732","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363732","url":null,"abstract":"The design trend towards CMPs has made the simulation of multiprocessor systems a necessity and has also made multiprocessor systems widely available. While a serial multiprocessor simulation necessarily imposes a linear slowdown, running such a simulation in parallel may help mitigate this effect. In this paper we document our experiences with two different methods of parallelizing Sparc Sulima, a simulator of UltraSPARC IIICu-based multiprocessor systems. In the first approach, a simple interconnect model within the simulator is parallelized non-deterministically using careful locking. In the second, a detailed interconnect model is parallelized while preserving determinism using parallel discrete event simulation (PDES) techniques. While both approaches demonstrate a threefold speedup using 4 threads on workloads from the NAS parallel benchmarks, speedup proved constrained by load-balancing between simulated processors. A theoretical model is developed to help understand why observed speedup is less than ideal. An analysis of the related speed-accuracy tradeoff in the first approach with respect to the simulation time quantum is also given; the results show that, for both serial and parallel simulation, a quantum in the order of a few hundreds of cycles represents a `sweet-spot', but parallel simulation is significantly more accurate for a given quantum size. As with the speedup analysis, these effects are workload dependent","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117048167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Performance Impact of Unaligned Memory Operations in SIMD Extensions for Video Codec Applications 视频编解码器应用SIMD扩展中未对齐内存操作对性能的影响

2007 IEEE International Symposium on Performance Analysis of Systems & Software Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363737

M. Alvarez, E. Salamí, Alex Ramírez, M. Valero

引用次数: 31

Modeling and Single-Pass Simulation of CMP Cache Capacity and Accessibility CMP缓存容量和可访问性的建模和单次仿真

2007 IEEE International Symposium on Performance Analysis of Systems & Software Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363743

Xudong Shi, Feiqi Su, J. Peir, Ye Xia, Zhen Yang

{"title":"Modeling and Single-Pass Simulation of CMP Cache Capacity and Accessibility","authors":"Xudong Shi, Feiqi Su, J. Peir, Ye Xia, Zhen Yang","doi":"10.1109/ISPASS.2007.363743","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363743","url":null,"abstract":"The future chip-multiprocessors (CMPs) with a large number of cores faces difficult issues in efficient utilizing on-chip storage space. Tradeoffs between data accessibility and effective on-chip capacity have been studied extensively. It requires costly simulations to understand a wide-spectrum of design spaces. In this paper, we first develop an abstract model for understanding the performance impact with respect to the degree of data replication. To overcome the lack of real-time interactions among multiple cores in the abstract model, we propose an efficient single-pass stack simulation method to study the performance of a variety of cache organizations on CMPs. The proposed global stack logically incorporates a shared stack and per-core private stacks to collect shared/private reuse (stack) distances for every memory reference in a single simulation pass. With the collected reuse distances, performance in terms of hits/misses and average memory access times can be calculated for multiple cache organizations. The basic stack simulation results can further derive other CMP cache organizations with various degrees of data replication. We verify both the modeling and the stack results against individual execution-driven simulations that consider realistic cache parameters and delays using a set of commercial multithreaded workloads. We also compare the simulation time saving with the stack simulation. The results show that stack simulation can accurately model the performance of various studied cache organizations with 2-9% error margins using only about 8% of the simulation time. The results also show that the effectiveness of various techniques for optimizing the CMP on-chip storage is closely related to the working sets of the workloads as well as the total cache sizes","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127250037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

An Analysis of Microarchitecture Vulnerability to Soft Errors on Simultaneous Multithreaded Architectures 并发多线程微架构软错误漏洞分析

2007 IEEE International Symposium on Performance Analysis of Systems & Software Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363747

Wangyuan Zhang, Xin Fu, Tao Li, J. Fortes

{"title":"An Analysis of Microarchitecture Vulnerability to Soft Errors on Simultaneous Multithreaded Architectures","authors":"Wangyuan Zhang, Xin Fu, Tao Li, J. Fortes","doi":"10.1109/ISPASS.2007.363747","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363747","url":null,"abstract":"Semiconductor transient faults (i.e. soft errors) have become an increasingly important threat to microprocessor reliability. Simultaneous multithreaded (SMT) architectures exploit thread-level parallelism to improve overall processor throughput. A great amount of research has been conducted in the past to investigate performance and power issues of SMT architectures. Nevertheless, the effect of multithreaded execution on a microarchitecture's vulnerability to soft error remains largely unexplored. To address this issue, we have developed a microarchitecture level soft error vulnerability analysis framework for SMT architectures. Using a mixed set of SPEC CPU 2000 benchmarks, we quantify the impact of multithreading on a wide range of microarchitecture structures. We examine how the baseline SMT microarchitecture reliability profile varies with workload behavior, the number of threads and fetch policies. Our experimental results show that the overall vulnerability rises in multithreading architectures, while each individual thread shows less vulnerability. By considering both performance and reliability, SMT outperforms superscalar architectures. The SMT reliability and its tradeoff with performance vary across different fetch policies. With a detailed analysis of the experimental results, we point out a set of potential opportunities to reduce SMT microarchitecture vulnerability, which can serve as guidance to exploiting thread-aware reliability optimization techniques in the near future. To our knowledge, this paper presents the first effort to characterize microarchitecture vulnerability to soft error on SMT processors","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131049613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Benefits of I/O Acceleration Technology (I/OAT) in Clusters 集群中I/O加速技术(I/OAT)的好处

2007 IEEE International Symposium on Performance Analysis of Systems & Software Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363752

K. Vaidyanathan, D. Panda

{"title":"Benefits of I/O Acceleration Technology (I/OAT) in Clusters","authors":"K. Vaidyanathan, D. Panda","doi":"10.1109/ISPASS.2007.363752","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363752","url":null,"abstract":"Packet processing in the TCP/IP stack at multi-gigabit data rates occupies a significant portion of the system overhead. Though there are several techniques to reduce the packet processing overhead on the sender-side, the receiver-side continues to remain as a bottleneck. I/O acceleration technology (I/OAT), developed by Intel, is a set of features particularly designed to reduce the receiver-side packet processing overhead. This paper studies the benefits of the I/OAT technology by extensive evaluations through micro-benchmarks as well as evaluations on two different application domains: (1) a multi-tier data-center environment and (2) a parallel virtual file system (PVFS). Our micro-benchmark evaluations show that I/OAT results in 38% lower overall CPU utilization in comparison with traditional communication. Due to this reduced CPU utilization, I/OAT delivers better performance and increased network bandwidth. Our experimental results with data-centers and file systems reveal that I/OAT can improve the total number of transactions processed by 14% and throughput by 12%, respectively. In addition, I/OAT can sustain a large number of concurrent threads (up to a factor of four as compared to non-I/OAT) in data-center environments, thus increasing the scalability of the servers","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130866816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Using Model Trees for Computer Architecture Performance Analysis of Software Applications 应用模型树进行软件应用程序的计算机体系结构性能分析

2007 IEEE International Symposium on Performance Analysis of Systems & Software Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363742

ElMoustapha Ould-Ahmed-Vall, J. Woodlee, Charles R. Yount, K. Doshi, S. Abraham

{"title":"Using Model Trees for Computer Architecture Performance Analysis of Software Applications","authors":"ElMoustapha Ould-Ahmed-Vall, J. Woodlee, Charles R. Yount, K. Doshi, S. Abraham","doi":"10.1109/ISPASS.2007.363742","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363742","url":null,"abstract":"The identification of performance issues on specific computer architectures has a variety of important benefits such as tuning software to improve performance, comparing the performance of various platforms and assisting in the design of new platforms. In order to enable this analysis, most modern micro-processors provide access to hardware-based event counters. Unfortunately, features such as out-of-order execution, pre-fetching and speculation complicate the interpretation of the raw data. Thus, the traditional approach of assigning a uniform estimated penalty to each event does not accurately identify and quantify performance limiters. This paper presents a novel method employing a statistical regression-modeling approach to better achieve this goal. Specifically, a model-tree based approach based on the M5' algorithm is implemented and validated that accounts for event interactions and workload characteristics. Data from a subset of the SPEC CPU2006 suite is used by the algorithm to automatically build a performance-model tree, identifying the unique performance classes (phases) found in the suite and associating with each class a unique, explanatory linear model of performance events. These models can be used to identify performance problems for a given workload and estimate the potential gain from addressing each problem. This information can help orient the performance optimization efforts to focus available time and resources on techniques most likely to impact performance problems with highest potential gain. The model tree exhibits high correlation (more than 0.98) and low relative absolute error (less than 8 %) between predicted and measured performance, attesting it as a sound approach for performance analysis of modern superscalar machines","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130289096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 54

Phase-Guided Small-Sample Simulation 相位引导小样本仿真

2007 IEEE International Symposium on Performance Analysis of Systems & Software Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363739

J. L. Kihm, Samuel D. Strom, D. Connors

{"title":"Phase-Guided Small-Sample Simulation","authors":"J. L. Kihm, Samuel D. Strom, D. Connors","doi":"10.1109/ISPASS.2007.363739","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363739","url":null,"abstract":"Detailed cycle-accurate simulation is a critical component of processor design. However, with the increasing complexity of modern processors and application workloads, full detailed simulation is prohibitively slow and thereby severely limits design space exploration. Sampled simulation techniques eliminate the need for full simulation by simulating in detail a very small but representative subset of a target application's overall execution. Two effective and accurate sampling techniques are phase-based simulation and small-sample simulation. Both of these techniques have been adopted by the architecture design and simulation communities for research. However, both techniques were derived using a single benchmark evaluation suite and promote the same sampling method for all applications. Alternatively, an execution-aware sampling-based simulation technique can adapt during execution characteristics of the individual application being simulated and achieve the most efficient and accurate simulation acceleration. To evaluate the impact of application characteristics on simulation approaches, we compare several simulation techniques using the SpedOOO benchmark suite. Our results yield key conclusions about combining the strengths of previous simulation techniques into a single approach: (PGSS) phase-guided small-sample simulation. PGSS adapts sampling to the characteristics of the application, thereby achieving high sampling accuracy and requiring an order of magnitude less detailed simulation time than previous techniques","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114976777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Characterizing a Complex J2EE Workload: A Comprehensive Analysis and Opportunities for Optimizations 描述复杂的J2EE工作负载:全面的分析和优化的机会

2007 IEEE International Symposium on Performance Analysis of Systems & Software Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363735

Yefim Shuf, I. Steiner

{"title":"Characterizing a Complex J2EE Workload: A Comprehensive Analysis and Opportunities for Optimizations","authors":"Yefim Shuf, I. Steiner","doi":"10.1109/ISPASS.2007.363735","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363735","url":null,"abstract":"While past studies of relatively simple Java benchmarks like SPECjvm98 and SPECjbb2000 have been integral in advancing the server industry, this paper presents an analysis of a significantly more complex 3-Tier J2EE (Java 2 Enterprise Edition) commercial workload, SPECjAppServer2004. Understanding the nature of such commercial workloads is critical to develop the next generation of servers and identify promising directions for systems and software research. In this study, we validate and disprove several assumptions commonly made about Java workloads. For instance, on a tuned system with an appropriately sized heap, the fraction of CPU time spent on garbage collection for this complex workload is small (<2%) compared to commonly studied client-side Java benchmarks. Unlike small benchmarks, this workload has a rather \"flat\" method profile with no obvious hot spots. Therefore, new performance analysis techniques and tools to identify opportunities for optimizations are needed because the traditional 90/10 rule of thumb does not apply. We evaluate hardware performance monitor data and use insights to motivate future research. We find that this workload has a relatively high CPI and a branch misprediction rate. We observe that almost one half of executed instructions are loads and stores and that the data working set is large. There are very few cache-to-cache \"modified data\" transfers which limits opportunities for intelligent thread co-scheduling. We note that while using large pages for a Java heap is a simple and effective way to reduce TLB misses and improve performance, there is room to reduce translation misses further by placing executable code into large pages. We use statistical correlation to quantify the relationship between various hardware events and an overall system performance. We find that CPI is strongly correlated with branch mispredictions, translation misses, instruction cache misses, and bursty data cache misses that trigger data prefetching. We note that target address mispredictions for indirect branches (corresponding to Java virtual method calls) are strongly correlated with instruction cache misses. Our observations can be used by hardware and runtime architects to estimate potential benefits of performance enhancements being considered","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130731223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Performance Characterization of Decimal Arithmetic in Commercial Java Workloads 商业Java工作负载中十进制算术的性能表征

2007 IEEE International Symposium on Performance Analysis of Systems & Software Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363736

M. Bhat, John Crawford, R. Morin, K. Shiv

{"title":"Performance Characterization of Decimal Arithmetic in Commercial Java Workloads","authors":"M. Bhat, John Crawford, R. Morin, K. Shiv","doi":"10.1109/ISPASS.2007.363736","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363736","url":null,"abstract":"Binary floating-point numbers with finite precision cannot represent all decimal numbers with complete accuracy. This can often lead to errors while performing calculations involving floating point numbers. For this reason many commercial applications use special decimal representations for performing these calculations, but their use carries performance costs such as bi-directional conversion. The purpose of this study was to understand the total application performance impact of using these decimal representations in commercial workloads, and provide a foundation of data to justify pursuing optimized hardware support for decimal math. In Java, a popular development environment for commercial applications, the BigDecimal class is used for performing accurate decimal computations. BigDecimal provides operations for arithmetic, scale manipulation, rounding, comparison, hashing, and format conversion. We studied the impact of BigDecimal usage on the performance of server-side Java applications by analyzing its usage on two standard enterprise benchmarks, SPECjbb2005 and SPECjAppServer2004 as well as a real-life mission-critical financial workload, Morgan Stanley's Trade Completion. In this paper, we present detailed performance characteristics and we conclude that, relative to total application performance, the overhead of using software decimal implementations is low, and at least from the point of view of these workloads, there is insufficient performance justification to pursue hardware solutions","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116888538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Using Wavelet Domain Workload Execution Characteristics to Improve Accuracy, Scalability and Robustness in Program Phase Analysis 利用小波域工作负载执行特性提高程序阶段分析的准确性、可扩展性和鲁棒性

2007 IEEE International Symposium on Performance Analysis of Systems & Software Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363744

Chang-Burm Cho, Tao Li

{"title":"Using Wavelet Domain Workload Execution Characteristics to Improve Accuracy, Scalability and Robustness in Program Phase Analysis","authors":"Chang-Burm Cho, Tao Li","doi":"10.1109/ISPASS.2007.363744","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363744","url":null,"abstract":"Program phase analysis has many applications in computer architecture design and optimization. Recently, there has been a growing interest in employing wavelets as a tool for phase analysis. Nevertheless, the examined scope of workload characteristics and the explored benefits due to wavelet-based analysis are quite limited. This work further extends prior research by applying wavelets analysis to abundant types of program execution statistics and quantifying the benefits of wavelet analysis in terms of accuracy, scalability and robustness in phase classification. Experimental results on SPEC CPU 2000 benchmarks show that compared with methods that work in the time domain, wavelet domain phase analysis achieves higher accuracy and exhibits superior scalability and robustness. We examine and contrast the effectiveness of applying wavelets to a wide range of runtime workload execution characteristics. We find that wavelet transform significantly reduces temporal dependence in the sampled workload statistics and therefore simple models which are insufficient in the time domain become quite accurate in the wavelet domain. More attractively, we show that different types of workload execution characteristics in wavelet domain can be assembled together to further improve phase classification accuracy. For long-running, complex and real-world workloads, a scalable phase analysis technique is essential to capture the manifested large-scale program behavior. In this study, we show that such scalability can be achieved by applying wavelet analysis of high dimension sampled workload statistics to alleviate the counter overflow problem which can negatively affect phase classification accuracy. By exploiting the wavelet denoising capability, we show in this paper that phase classification can be performed robustly under program execution variability. To our knowledge, this work presents the first effort on using wavelets to improve scalability and robustness in phase analysis","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117280753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11