IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004最新文献

Evaluating performance of BLAST on Intel Xeon and Itanium2 processors 评估BLAST在Intel Xeon和Itanium2处理器上的性能

IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004 Pub Date : 2004-12-13 DOI: 10.1007/978-3-540-30566-8_115

R. Radhakrishnan, R. Ali, G. Kochhar, Kalyana Chadalavada, R. Rajagopalan, J. Hsieh, O. Celebioglu

引用次数: 4

On the extraction and analysis of prevalent dataflow patterns 流行数据流模式的提取与分析

IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004 Pub Date : 2004-10-25 DOI: 10.1109/WWC.2004.1437389

P. Sassone, D. S. Wills

引用次数: 8

GENIUS: a generator of interactive user media sessions GENIUS:交互式用户媒体会话的生成器

IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004 Pub Date : 2004-10-25 DOI: 10.1109/WWC.2004.1437393

C. Costa, C. Ramos, I. Cunha, J. Almeida

引用次数: 10

Evaluation of a speculative multithreading compiler by characterizing program dependences 通过描述程序依赖关系来评估推测性多线程编译器

IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004 Pub Date : 2004-10-25 DOI: 10.1109/WWC.2004.1437390

A. Bhowmik, M. Franklin

引用次数: 0

Construction and performance characterization of parallel interior point solver on 4-way Intel Itanium 2 multiprocessor system Intel Itanium 2 4路多处理器系统并行内部点求解器的构建与性能表征

IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004 Pub Date : 2004-10-25 DOI: 10.1109/WWC.2004.1437402

P. Koka, T. Suh, M. Smelyanskiy, R. Grzeszczuk, C. Dulong

{"title":"Construction and performance characterization of parallel interior point solver on 4-way Intel Itanium 2 multiprocessor system","authors":"P. Koka, T. Suh, M. Smelyanskiy, R. Grzeszczuk, C. Dulong","doi":"10.1109/WWC.2004.1437402","DOIUrl":"https://doi.org/10.1109/WWC.2004.1437402","url":null,"abstract":"In recent years the interior point method (IPM) has became a dominant choice for solving large convex optimization problems for many scientific, engineering and commercial applications. Two reasons for the success of the IPM are its good scalability on existing multiprocessor systems with a small number of processors and its potential to deliver a scalable performance on systems with a large number of processors. The scalability of a parallel IPM is determined by several key issues such as exploiting parallelism due to sparsity of the problem, reducing communication overhead and proper load balancing. In this paper we present an implementation of a parallel linear programming IPM workload and characterize its scalability on a 4-way Itanium/spl reg/ 2 system. We show a speedup of up to 3-times for some of the datasets. We also present a detailed micro-architectural analysis of the workload using VTune/spl trade/ performance analyzer. Our results suggest that a good IPM implementation is latency-bound. Based on these findings, we make suggestions on how to improve the performance of the IPM workload in the future.","PeriodicalId":240633,"journal":{"name":"IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004","volume":"11 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114107801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Does halting make hardware trace collection inaccurate? A study using Pentium 4 performance counters and SPEC2000 停机是否使硬件跟踪收集不准确?使用奔腾4性能计数器和SPEC2000的研究

IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004 Pub Date : 2004-10-25 DOI: 10.1109/WWC.2004.1437397

M. Watson, J. Flanagan

{"title":"Does halting make hardware trace collection inaccurate? A study using Pentium 4 performance counters and SPEC2000","authors":"M. Watson, J. Flanagan","doi":"10.1109/WWC.2004.1437397","DOIUrl":"https://doi.org/10.1109/WWC.2004.1437397","url":null,"abstract":"Processor address traces are invaluable for characterizing workloads and testing proposed memory hierarchies. Long traces are needed to exercise modern cache designs and produce meaningful results, but are difficult to collect with hardware monitors because microprocessors access memory too frequently for disks or other large storage to keep up. The small, fast buffers of the monitors fill quickly; in order to obtain long contiguous traces, the processor must be stopped while the buffer is emptied. This halting may perturb the traces collected, but this cannot be measured directly, since long uninterrupted traces cannot be collected. We make the case that hardware performance counters, which collect runtime statistics without influencing execution, can be used to measure halting effects. We use the performance counters of the Pentium 4 processor to collect statistics while halting the processor as if traces were being collected. We then compare these results to the statistics obtained from unhalted runs. We present our results in terms of which counters are affected, why, and what this means for trace-collection systems.","PeriodicalId":240633,"journal":{"name":"IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004","volume":"32 11","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120858697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

The USAR characterization model USAR表征模型

IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004 Pub Date : 2004-10-25 DOI: 10.1109/WWC.2004.1437399

A. Pereira, G. Franco, L. Silva, Wagner Meira, Jr, W. Santos

引用次数: 11

Characterizing the impact of different memory-intensity levels 描述不同内存强度水平的影响

IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004 Pub Date : 2004-10-25 DOI: 10.1109/WWC.2004.1437388

R. Kotla, A. Devgan, S. Ghiasi, T. Keller, F. Rawson

{"title":"Characterizing the impact of different memory-intensity levels","authors":"R. Kotla, A. Devgan, S. Ghiasi, T. Keller, F. Rawson","doi":"10.1109/WWC.2004.1437388","DOIUrl":"https://doi.org/10.1109/WWC.2004.1437388","url":null,"abstract":"Applications on today's high-end processors typically make varying load demands over time. A single application may have many different phases during its lifetime, and workload mixes show interleaved phases. This work examines and uses the differences between memory- and CPU-intensive phases to reduce power. Today's processors provide resources that are underutilized during memory-intensive phases, consuming power while producing little incremental gain in performance. This work examines a deployed system consisting of identical cores with a goal of running each one at a different effective frequency. The initial goal is to find the appropriate frequency at which to run each phase. This paper demonstrates that memory intensity directly affects the throughput of applications. The results indicate that simple metrics such as IPC (instructions per cycle) cannot be used to determine what frequency to run a phase. Instead, it uses performance counters which directly monitor memory behavior to identify. Memory-intensive phases can then be run on a slower core without incurring significant performance penalties. The key result of the paper is the introduction of a very simple, online model that uses the performance counter data to predict the performance of a program phase at any particular frequency setting. The information from this model allows a scheduler to decide which core to use to execute the program phase. Using a sophisticated power model for the processor family shows that this approach significantly reduces power consumption. The model was evaluated using a subset of SPECCPU and the SPECjbb and TPC-W benchmarks. It predicts performance with an average error of less than 10%. The power modeling shows that memory-intensive benchmarks achieve up to-a 58%, power reduction at a performance loss of less than 20% when run at 80% of nominal frequency.","PeriodicalId":240633,"journal":{"name":"IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128631930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

Experiments with subsetting benchmark suites 使用子集基准套件进行实验

IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004 Pub Date : 2004-10-25 DOI: 10.1109/WWC.2004.1437398

Jabatan Perangkaan, Malaysia Kenyataan, Sebut Harga, Tawaran adalah dipelawa, daripada syarikat-syarikat, tempatan yang berdaftar dengan, Kementerian Kewangan, Samihah binti Kamaruddin

{"title":"Experiments with subsetting benchmark suites","authors":"Jabatan Perangkaan, Malaysia Kenyataan, Sebut Harga, Tawaran adalah dipelawa, daripada syarikat-syarikat, tempatan yang berdaftar dengan, Kementerian Kewangan, Samihah binti Kamaruddin","doi":"10.1109/WWC.2004.1437398","DOIUrl":"https://doi.org/10.1109/WWC.2004.1437398","url":null,"abstract":"Benchmarks are one of the most popular tools to compare the performance of computing systems. Benchmark suites typically contain multiple benchmark programs with more or less the same properties. Hence the suite contains redundancy, which increases the cost of executing or simulating the benchmark suite without adding value. To limit simulation time, researchers frequently subset benchmark suites. However, correctly identifying a representative subset is of paramount importance to perform a trustworthy evaluation. This paper shows that subsetting a benchmark suite in such a way that representativeness of the suite is maintained is non-trivial. We show that a small randomly selected subset is not representative of the fill benchmark suite. We discuss algorithms to subset the SPEC CPU 2000 benchmark suite and show that they provide more representative subsets than randomly selected subsets. However, the algorithms evaluated in this paper do not always compute representative subsets: the algorithms produce bad results for some subset sizes. In this sense, these algorithms are unreliable, as it remains necessary to validate the benchmark suite subset. We find one subsetting algorithm that is reliable. It is, however, uncertain whether this algorithm is also reliable under other circumstances.","PeriodicalId":240633,"journal":{"name":"IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124364226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Micro-architectural anatomy of a commercial TCP/IP stack 商业TCP/IP栈的微观架构剖析

IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004 Pub Date : 2004-10-25 DOI: 10.1109/WWC.2004.1437394

R. Illikkal, R. Iyer, D. Newell

{"title":"Micro-architectural anatomy of a commercial TCP/IP stack","authors":"R. Illikkal, R. Iyer, D. Newell","doi":"10.1109/WWC.2004.1437394","DOIUrl":"https://doi.org/10.1109/WWC.2004.1437394","url":null,"abstract":"Over the last couple of decades, computer architects and performance analysts have routinely attempted to profile the overhead of TCP/IP processing in an effort to understand where the time was spent. It is well understood that this is a rather difficult problem since the processing time is spread across various software modules such as the network stack, interrupt routines, drivers, O/S scheduler, etc. As a result, the problem of extracting the micro-architectural characteristics of TCP/IP processing is significantly more challenging. In this paper, we start by covering the previous attempts at this problem and show what existing tools can provide in terms of execution time characteristics. We then propose a detailed methodology that combines full-system simulation, cycle-accurate performance simulations and symbol annotation to provide a rich cycle-accurate view of TCP/IP packet processing execution. We discuss initial results based on our profiling methodology and discuss where the time is spent. This includes an analysis of micro-architectural characteristics (such as instruction breakdown, CPI, MPI and TLB misses on a state-of-the-art microprocessor).","PeriodicalId":240633,"journal":{"name":"IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130968671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2