2006 IEEE International Symposium on Workload Characterization最新文献

Evaluating Benchmark Subsetting Approaches 评估基准子集方法

2006 IEEE International Symposium on Workload Characterization Pub Date : 2006-12-01 DOI: 10.1109/IISWC.2006.302733

J. Yi, Resit Sendag, L. Eeckhout, A. Joshi, D. Lilja, L. John

{"title":"Evaluating Benchmark Subsetting Approaches","authors":"J. Yi, Resit Sendag, L. Eeckhout, A. Joshi, D. Lilja, L. John","doi":"10.1109/IISWC.2006.302733","DOIUrl":"https://doi.org/10.1109/IISWC.2006.302733","url":null,"abstract":"To reduce the simulation time to a tractable amount or due to compilation (or other related) problems, computer architects often simulate only a subset of the benchmarks in a benchmark suite. However, if the architect chooses a subset of benchmarks that is not representative, the subsequent simulation results will, at best, be misleading or, at worst, yield incorrect conclusions. To address this problem, computer architects have recently proposed several statistically-based approaches to subset a benchmark suite. While some of these approaches are well-grounded statistically, what has not yet been thoroughly evaluated is the: 1) absolute accuracy; 2) relative accuracy across a range of processor and memory subsystem enhancements; and 3) representativeness and coverage of each approach for a range of subset sizes. Specifically, this paper evaluates statistically-based subsetting approaches based on principal components analysis (PCA) and the Plackett and Burman (P&B) design, in addition to prevailing approaches such as integer vs. floating-point, core vs. memory-bound, by language, and at random. Our results show that the two statistically-based approaches, PCA and P&B, have the best absolute and relative accuracy for CPI and energy-delay product (EDP), produce subsets that are the most representative, and choose benchmark and input set pairs that are most well-distributed across the benchmark space. To achieve a 5% absolute CPI and EDP error, across a wide range of configurations, PCA and P&B typically need about 17 benchmark and input set pairs, while the other five approaches often choose more than 30 benchmark and input set pairs","PeriodicalId":222041,"journal":{"name":"2006 IEEE International Symposium on Workload Characterization","volume":"399 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123527634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

An Architectural Characterization Study of Data Mining and Bioinformatics Workloads 数据挖掘和生物信息学工作负载的架构表征研究

2006 IEEE International Symposium on Workload Characterization Pub Date : 2006-10-25 DOI: 10.1109/IISWC.2006.302730

Berkin Özisikyilmaz, R. Narayanan, Joseph Zambreno, G. Memik, A. Choudhary

引用次数: 32

"Software Performance Tuning with the Apple CHUD Tools" “使用Apple CHUD工具进行软件性能调优”

2006 IEEE International Symposium on Workload Characterization Pub Date : 2006-10-01 DOI: 10.1109/IISWC.2006.302722

R. Altherr, R. D. Bois, L. Hammond, Eric Miller

{"title":"\"Software Performance Tuning with the Apple CHUD Tools\"","authors":"R. Altherr, R. D. Bois, L. Hammond, Eric Miller","doi":"10.1109/IISWC.2006.302722","DOIUrl":"https://doi.org/10.1109/IISWC.2006.302722","url":null,"abstract":"Summary form only given. Many tools have been created to allow software engineers to analyze the execution of their code. While tools such as gprof often work well, most are not integrated very well with each other or the rest of the development environment, and interpreting the data that they provide can be a challenge. Because Apple's MacOS X is based on UNIX, most open source performance analysis tools can be used. However, we have also integrated several key performance tools together and added graphical data visualization to produce the CHUD toolset (Available for at http://developer.apple.com/tools/download/). With the CHUD tools, programmers can examine the performance of their code using a set of integrated tools that can perform most common performance-measurement tasks, including: traces of function call behavior (like gprof); sampled measurements of program execution timing; traces of software events, such as system calls; and hardware event counter measurements; Moreover, instead of just presenting a few key figures from these measurements in a brief report, the CHUD tools present their results in several textual and graphical formats, with integrated hyperlinks to related assembly and source code, so that programmers can easily examine both how their programs work on a large-scale level or zoom in and look at individual program phases in several different ways. This tutorial is targeted primarily at students and software engineers who work on UNIX-based systems and want to expand the repertoire of tools that they can use to analyze and improve the performance of their code. However, the material should also be useful to educators who teach performance-oriented programming techniques, as the graphical nature of Shark's output makes it easy to demonstrate program behaviors in an eye-catching manner","PeriodicalId":222041,"journal":{"name":"2006 IEEE International Symposium on Workload Characterization","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115823278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Workload Characterization of 3D Games 3D游戏的工作量特征

2006 IEEE International Symposium on Workload Characterization Pub Date : 2006-10-01 DOI: 10.1109/IISWC.2006.302726

Jordi Roca, Victor Moya Del Barrio, Carlos González, C. Solis, Agustín Fernández, R. Espasa

引用次数: 18

Comparing Benchmarks Using Key Microarchitecture-Independent Characteristics 使用关键的与微体系结构无关的特征比较基准测试

2006 IEEE International Symposium on Workload Characterization Pub Date : 2006-10-01 DOI: 10.1109/IISWC.2006.302732

Kenneth Hoste, L. Eeckhout

{"title":"Comparing Benchmarks Using Key Microarchitecture-Independent Characteristics","authors":"Kenneth Hoste, L. Eeckhout","doi":"10.1109/IISWC.2006.302732","DOIUrl":"https://doi.org/10.1109/IISWC.2006.302732","url":null,"abstract":"Understanding the behavior of emerging workloads is important for designing next generation microprocessors. For addressing this issue, computer architects and performance analysts build benchmark suites of new application domains and compare the behavioral characteristics of these benchmark suites against well-known benchmark suites. Current practice typically compares workloads based on microarchitecture-dependent characteristics generated from running these workloads on real hardware. There is one pitfall though with comparing benchmarks using microarchitecture-dependent characteristics, namely that completely different inherent program behavior may yield similar microarchitecture-dependent behavior. This paper proposes a methodology for characterizing benchmarks based on microarchitecture-independent characteristics. This methodology minimizes the number of inherent program characteristics that need to be measured by exploiting correlation between program characteristics. In fact, we reduce our 47-dimensional space to an 8-dimensional space without compromising the methodology's ability to compare benchmarks. The important benefits of this methodology are that (i) only a limited number of microarchitecture-independent characteristics need to be measured, and (ii) the resulting workload characterization is easy to interpret. Using this methodology we compare 122 benchmarks from 6 recently proposed benchmark suites. We conclude that some benchmarks in emerging benchmark suites are indeed similar to benchmarks from well-known benchmark suites as suggested through a microarchitecture-dependent characterization. However, other benchmarks are dissimilar based on a microarchitecture-independent characterization although a microarchitecture-dependent characterization suggests the opposite to be true","PeriodicalId":222041,"journal":{"name":"2006 IEEE International Symposium on Workload Characterization","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130184465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 77

Load Instruction Characterization and Acceleration of the BioPerf Programs 负载指令表征及BioPerf程序的加速

2006 IEEE International Symposium on Workload Characterization Pub Date : 2006-10-01 DOI: 10.1109/IISWC.2006.302731

P. Ratanaworabhan, Martin Burtscher

引用次数: 4

"Evolve or Die: Making SPEC's CPU Suite Relevant Today and Tomorrow" “要么进化，要么死亡:让SPEC的CPU套件与今天和明天相关”

2006 IEEE International Symposium on Workload Characterization Pub Date : 2006-10-01 DOI: 10.1109/IISWC.2006.302735

J. Reilly

引用次数: 4

DFS: A Simple to Write Yet Difficult to Execute Benchmark DFS:一个易于编写但难以执行的基准测试

2006 IEEE International Symposium on Workload Characterization Pub Date : 2006-10-01 DOI: 10.1109/IISWC.2006.302741

R. Murphy, Jonathan W. Berry, William C. McLendon, B. Hendrickson, Douglas P. Gregor, A. Lumsdaine

{"title":"DFS: A Simple to Write Yet Difficult to Execute Benchmark","authors":"R. Murphy, Jonathan W. Berry, William C. McLendon, B. Hendrickson, Douglas P. Gregor, A. Lumsdaine","doi":"10.1109/IISWC.2006.302741","DOIUrl":"https://doi.org/10.1109/IISWC.2006.302741","url":null,"abstract":"Many emerging applications are built upon large, unstructured datasets that exhibit highly irregular (or even nearly random) memory access patterns. Examples include informatics applications, and other problems that are often represented by unstructured graph-based data structures. It is well known that these applications are challenging for conventional architectures to execute (either serially or in parallel). The depth first search (DFS) benchmark proposed in this work uses the boost graph library to perform a depth-first search on a large power-law graph, representing \"small world\" phenomena. The graph in question exhibits a small average distance between any two vertices, a small diameter, and has a few high-degree vertices with a large number of low-degree vertices. Graphs such as this appear in many fields, including networking, biology, social networks, and data mining. Many of these applications are of critical importance to researchers, and the challenge of executing them on conventional machines increases as the graph size grows. The benchmark proposed in this work is used as the basis for many fundamental algorithms in graph theory, is critical to several emerging applications, is memory intensive, and exhibits poor performance on conventional machines. Section 2 quantitatively demonstrates the memory characteristics of the benchmark in an architecture independent fashion, showing that it is extremely memory intensive. Section 3 describes the execution phases of the benchmark. And section 4 presents the conclusions","PeriodicalId":222041,"journal":{"name":"2006 IEEE International Symposium on Workload Characterization","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115456932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Exploring Small-Scale and Large-Scale CMP Architectures for Commercial Java Servers 探索商业Java服务器的小规模和大规模CMP体系结构

2006 IEEE International Symposium on Workload Characterization Pub Date : 2006-10-01 DOI: 10.1109/IISWC.2006.302744

R. Iyer, M. Bhat, Li Zhao, R. Illikkal, S. Makineni, Michael Jones, K. Shiv, D. Newell

{"title":"Exploring Small-Scale and Large-Scale CMP Architectures for Commercial Java Servers","authors":"R. Iyer, M. Bhat, Li Zhao, R. Illikkal, S. Makineni, Michael Jones, K. Shiv, D. Newell","doi":"10.1109/IISWC.2006.302744","DOIUrl":"https://doi.org/10.1109/IISWC.2006.302744","url":null,"abstract":"As we enter the era of chip multiprocessor (CMP) architectures, it is important that we explore the scaling characteristics of mainstream server workloads on these platforms. In this paper, we analyze the performance of an Enterprise Java workload (SPECjbb2005) on two important classes of CMP architectures. One class of CMP platforms comprise of \"small-scale\" CMP (SCMP) processors with a few large out-of order cores on the die. Another class of CMP platforms comprise of \"large-scale\" CMP (LCMP) processors) with several small in-order cores on the die. For these classes of CMP architectures to succeed, it is important that there are sufficient resources (cache, memory and interconnect) to allow for a balanced scalable platform. In this paper, we focus on evaluating the resource scaling characteristics (cores, caches and memory) of SPECjbb2005 on these two architectures and understanding architectural trade-offs that may be required in future CMP offerings. The overall evaluation is uniquely conducted using four different methodologies (measurements on latest platforms, trace-based cache simulation, trace-based platform simulation and execution-driven emulation). Based on our findings, we summarize the architectural recommendations for future CMP server platforms (e.g. the need for large DRAM caches)","PeriodicalId":222041,"journal":{"name":"2006 IEEE International Symposium on Workload Characterization","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129532539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Workload Characterization of a Parallel Video Mining Application on a 16-Way Shared-Memory Multiprocessor System 16路共享内存多处理器系统上并行视频挖掘应用的工作负载表征

2006 IEEE International Symposium on Workload Characterization Pub Date : 2006-10-01 DOI: 10.1109/IISWC.2006.302725

Wenlong Li, E. Li, C. Dulong, Yen-kuang Chen, Tao Wang, Yimin Zhang

{"title":"Workload Characterization of a Parallel Video Mining Application on a 16-Way Shared-Memory Multiprocessor System","authors":"Wenlong Li, E. Li, C. Dulong, Yen-kuang Chen, Tao Wang, Yimin Zhang","doi":"10.1109/IISWC.2006.302725","DOIUrl":"https://doi.org/10.1109/IISWC.2006.302725","url":null,"abstract":"As video data become more and more pervasive, mining information from multimedia data sources becomes increasingly important, e.g., automatically extracting highlights from soccer game video content. However, the huge computation requirement of mining interested data limits its wide use in practice. Since the hardware imperative behind computer architecture is shifting from uniprocessors to multi-core processors, exploiting thread-level parallelism existing in multimedia mining applications is critical to utilizing the hardware resources and accelerating the complex processing of highlight events detection. In this paper we analyze the view type and playfield detection application, a widely used application in sports video mining systems, and we present several different schemes (task level, data-slicing-level, and a hybrid parallel scheme, as well as variations of the hybrid parallel scheme) for parallelizing this application. The hybrid parallel scheme, which exploits data-level and task-slicing-level parallelism, outperforms basic task-level and data-slicing-level schemes, delivering much better performance in terms of execution time and speedup. On a 16-way shared-memory multi-processing system with hardware prefetch enabled, the hybrid scheme achieves a speedup of 10.6x. Detailed performance analysis shows that because of the large working set, the workload often requires data from the off-chip memory. Therefore, the saturated bus bandwidth utilization is the likely cause of bottlenecks for achieving perfect scalability performance. With hardware prefetch enabled, the bus utilization rate on 16-processors system is about 76% for the hybrid scheme, and the projected bus bandwidth requirement for perfect scalability is about 3.1GB/s for 16 processors and 6.2 GB/s for 32 processors. In addition, our experiments also reveal that there are also no obvious scaling limiting factors, e.g., very low synchronization and load imbalance problems even with up to 16 processors","PeriodicalId":222041,"journal":{"name":"2006 IEEE International Symposium on Workload Characterization","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130103505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8