{"title":"A portable, open-source implementation of the SPC-1 workload","authors":"S. Daniel, R. Faith","doi":"10.1109/IISWC.2005.1526014","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1526014","url":null,"abstract":"This paper describes an open-source implementation of the Storage Performance Council's SPC benchmark-1 (SPC-1). Although this implementation cannot be used to generate official SPC-1 results, the code can be used for research and other unofficial work. We begin with a brief introduction to SPC-1, concentrating on qualities necessary to understand and use our implementation. Then we discuss important features of our open-source implementation and how it was validated against the official SPC-1 benchmark implementation.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127493684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Maron, T. Chen, D. Vianney, B. Olszewski, S. Kunkel, A. Mericas
{"title":"Workload characterization for the design of future servers","authors":"B. Maron, T. Chen, D. Vianney, B. Olszewski, S. Kunkel, A. Mericas","doi":"10.1109/IISWC.2005.1526009","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1526009","url":null,"abstract":"Workload characterization has become an integral part of the design of future servers since their characteristics can guide the developers to understand the workload requirements and how the underlying architecture would optimize the performance of the intended workload. In this paper, we give an overview of the POWER5 architecture. We also introduce the POWER5 performance monitor facilities and performance events that lead to the construction of a CPI (cycles per instruction) breakdown model. For our study, we characterize four different groups of workloads: commercial, HPC, memory, and scientific. Using the data obtained from the POWER5 performance counters, we breakdown the CPI stack into a base component, when the processor is completing work and a stall component when the processor is not completing instructions. The stall component can be further divided into cycles when the pipeline was empty and cycles when the pipeline was not empty but completion is stalled. With this model, we enumerate the number of processing cycles, i.e., a fraction of the CPI, a workload spent while progressing through the core resources and the incurred penalty upon encountering those resource usage inhibitors. The results show the CPI breakdown for each workload, identify where each workload spends its processing cycles and the associated CPI cost when accessing the core resources.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133100785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Man-Lap Li, Ruchira Sasanka, S. Adve, Yen-kuang Chen, E. Debes
{"title":"The ALPBench benchmark suite for complex multimedia applications","authors":"Man-Lap Li, Ruchira Sasanka, S. Adve, Yen-kuang Chen, E. Debes","doi":"10.1109/IISWC.2005.1525999","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1525999","url":null,"abstract":"Multimedia applications are becoming increasingly important for a large class of general-purpose processors. Contemporary media applications are highly complex and demand high performance. A distinctive feature of these applications is that they have significant parallelism, including thread- , data-, and instruction-level parallelism, that is potentially well-aligned with the increasing parallelism supported by emerging multi-core architectures. Designing systems to meet the demands of these applications therefore requires a benchmark suite comprising these complex applications and that exposes the parallelism present in them. This paper makes two contributions. First, it presents ALPBench, a publicly available benchmark suite that pulls together five complex media applications from various sources: speech recognition (CMU Sphinx 3), face recognition (CSU), ray tracing (Tachyon), MPEG-2 encode (MSSG), and MPEG-2 decode (MSSG). We have modified the original applications to expose thread-level and data-level parallelism using POSIX threads and sub-word SIMD (Inters SSE2) instructions respectively. Second, the paper provides a performance characterization of the ALPBench benchmarks, with a focus on parallelism. Such a characterization is useful for architects and compiler writers for designing systems and compiler optimizations for these applications.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128357687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The FeasNewt benchmark","authors":"T. Munson, P. Hovland","doi":"10.1109/IISWC.2005.1526011","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1526011","url":null,"abstract":"We describe the FeasNewt mesh-quality optimization benchmark. The performance of the code is dominated by three phases - gradient evaluation, Hessian evaluation and assembly, and sparse matrix-vector products - that have very different mixtures of floating-point operations and memory access patterns. The code includes an optional runtime data- and iteration-reordering phase, making it suitable for research on irregular memory access patterns. Mesh-quality optimization (or \"mesh smoothing\") is an important ingredient in the solution of nonlinear partial differential equations (PDEs) as well as an excellent surrogate for finite-element or finite-volume PDE solvers.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115865171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SenseBench: toward an accurate evaluation of sensor network processors","authors":"L. Nazhandali, M. Minuth, T. Austin","doi":"10.1109/IISWC.2005.1526017","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1526017","url":null,"abstract":"Sensor network processors introduce an unprecedented level of compact and portable computing. These small processing systems reside in the environment which they monitor, combining sensing, computation, storage, communication, and power supplies into small form factors. Sensor processors have a wide variety of applications in medical monitoring, environmental sensing, industrial inspection, and military surveillance. Despite efforts to design suitable processors for these systems (Ekanayake et al., 2004; Hempstead et al., 2005; Nazhandali et al., 2005; Wameke and Pister, 2004), there is no well-defined method to evaluate their performance and energy consumption. The historically used MIPS (millions of instructions per second) and EPI (energy per instruction) metrics cannot provide an accurate comparison because of their dependence on the nature of instructions, which differ across instruction set architectures. On the other hand, the current well-defined benchmarks (1989; Guthaus et al., 2001; Lee et al., 1997) do not represent typical workloads of sensor network systems, and hence, are not suitable to compare sensor processors. This paper defines a set of stream applications representing the typical real-time workload of a sensor processor. Furthermore, three new metrics, EPB (energy per bundle), xRT (times real-time), and CFP (composition foot print) are introduced to evaluate and compare such systems.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"23 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126320364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Friman Sánchez, E. Salamí, Alex Ramírez, M. Valero
{"title":"Parallel processing in biological sequence comparison using general purpose processors","authors":"Friman Sánchez, E. Salamí, Alex Ramírez, M. Valero","doi":"10.1109/IISWC.2005.1526005","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1526005","url":null,"abstract":"The comparison and alignment of DNA and protein sequences are important tasks in molecular biology and bioinformatics. One of the most well known algorithms to perform the string-matching operation present in these tasks is the Smith-Waterman algorithm (SW). However, it is a computation intensive algorithm, and many researchers have developed heuristic strategies to avoid using it, specially when using large databases to perform the search. There are several efficient implementations of the SW algorithm on general purpose processors. These implementations try to extract data-level parallelism taking advantage of single-instruction multiple-data extensions (SIMD), capable of performing several operations in parallel on a set of data. In this paper, we propose a more efficient data parallel implementation of the SW algorithm. Our proposed implementation obtains a 30% reduction in the execution time relative to the previous best data-parallel alternative. In this paper we review different alternative implementation of the SW algorithm, compare them with our proposal, and present preliminary results for some heuristic implementations. Finally, we present a detailed study of the computational complexity of the different alignment algorithms presented and their behavior on the different aspect of the CPU microarchitecture.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125245376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting program microarchitecture independent characteristics and phase behavior for reduced benchmark suite simulation","authors":"L. Eeckhout, Jack Sampson, Brad Calder","doi":"10.1109/IISWC.2005.1525996","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1525996","url":null,"abstract":"Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to complete. Simulating the full execution of the whole benchmark suite for one architecture configuration can take months. To address this issue researchers have examined using targetted sampling based on phase behavior to significantly reduce the simulation time of each program in the benchmark suite. However, even with this sampling approach, simulating the full benchmark suite across a large range of architecture designs can take days to weeks to complete. The goal of this paper is to further reduce simulation time for architecture design space exploration. We reduce simulation time by finding similarity between benchmarks and program inputs at the level of samples (100M instructions of execution). This allows us to use a representative sample of execution from one benchmark to accurately represent a sample of execution of other benchmarks and inputs. The end result of our analysis is a small number of sample points of execution. These are selected across the whole benchmark suite in order to accurately represent the complete simulation of the whole benchmark suite for design space exploration. We show that this provides approximately the same accuracy as the SimPoint sampling approach while reducing the number of simulated instructions by a factor of 1.5.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114254593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
U. Srinivasan, Peng-Sheng Chen, Q. Diao, C. Lim, E. Li, Yongjian Chen, R. Ju, Yimin Zhang
{"title":"Characterization and analysis of HMMER and SVM-RFE parallel bioinformatics applications","authors":"U. Srinivasan, Peng-Sheng Chen, Q. Diao, C. Lim, E. Li, Yongjian Chen, R. Ju, Yimin Zhang","doi":"10.1109/IISWC.2005.1526004","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1526004","url":null,"abstract":"Bioinformatics applications constitute an emerging data-intensive, high-performance computing (HPC) domain. While there is much research on algorithmic improvements, (2004), the actual performance of an application also depends on how well the program maps to the target hardware. This paper presents a performance study of two parallel bioinformatics applications HMMER (sequence alignment) and SVM-RFE (gene expression analysis), on Intel x86 based hyperthread-capable (2002) shared-memory multiprocessor systems. The performance characteristics varied according to the application and target hardware characteristics. For instance, HMMER is compute intensive and showed better scalability on a 3.0 GHz system versus a 2.2 GHz system. However, SVM-RFE is memory intensive and showed better absolute performance on the 2.2 GHz machine which has better memory bandwidth. The performance is also impacted by processor features, e.g. hyperthreading (HT) (2002) and prefetching. With HMMER we could obtain -75% of the performance with HT enabled with respect to doubling the number of CPUs. While load balancing optimizations can provide speedup of -30% for HMMER on a hyperthreading-enabled system, the load balancing has to adapt to the target number of processors and threads. SVM-RFE benefits differently from the same load-balancing and thread scheduling tuning. We conclude that compiler and runtime optimizations play an important role to achieve the best performance for a given bioinformatics algorithm.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133711097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterizing sources and remedies for packet loss in network intrusion detection systems","authors":"Lambert Schaelicke, J. C. Freeland","doi":"10.1109/IISWC.2005.1526016","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1526016","url":null,"abstract":"Network intrusion detection is becoming an increasingly important tool to protect critical information and infrastructure from unauthorized access. Network intrusion detection systems (NIDS) are commonly based on general-purpose workstations connected to a network tap. However, these general-purpose systems, although cost-efficient, are not able to sustain the packet rates of modern high-speed networks. The resulting packet loss degrades the system's overall effectiveness, since attackers can intentionally overload the NIDS to evade detection. This paper studies the performance requirements of a commonly used open-source NIDS on a modern workstation architecture. Using full-system simulation, this paper characterizes the impact of a number of system-level optimizations and architectural trends on packet loss, and highlights the key bottlenecks for this type of network-intensive workloads. Results suggest that interrupt aggregation combined with rule set pruning is most effective in minimizing packet loss. Surprisingly, the workload also exhibits sufficient locality to benefit from larger level-2 caches as well. On the other hand, many other common architecture and system optimizations have only a negligible impact on throughput.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129237254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient power analysis using synthetic testcases","authors":"R. Bell, L. John","doi":"10.1109/IISWC.2005.1526007","DOIUrl":"https://doi.org/10.1109/IISWC.2005.1526007","url":null,"abstract":"Power dissipation has become an important consideration for processor designs. Assessing power using simulators is problematic given the long runtimes of real applications. Researchers have responded with techniques to reduce the total number of simulated instructions while still maintaining representative simulation behavior. Synthetic testcases have been shown to reduce the number of necessary instructions significantly while still achieving accurate performance results for many workload characteristics. In this paper, we show that the synthetic testcases can rapidly and accurately assess the dynamic power dissipation of real programs. Synthetic versions of the SPEC2000 and STREAM benchmarks can predict the total power per cycle to within 6.8% error on average, with a maximum of 15% error, and total power per instruction to within 4.4% error. In addition, for many design changes for which IPC and power change significantly, the synthetic testcases show small errors, many less than 5%. We also show that simulated power dissipation for both applications and synthetics correlates well with the IPCs of the real programs, often giving a correlation coefficient greater than 0.9.","PeriodicalId":275514,"journal":{"name":"IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132728052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}