IEEE International Symposium on Workload Characterization (IISWC'10)最新文献_第2页

Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications 探索GPGPU工作负载:表征方法、分析和微架构评估含义

IEEE International Symposium on Workload Characterization (IISWC'10) Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5649549

Nilanjan Goswami, Ramkumar Shankar, Madhura Joshi, Tao Li

{"title":"Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications","authors":"Nilanjan Goswami, Ramkumar Shankar, Madhura Joshi, Tao Li","doi":"10.1109/IISWC.2010.5649549","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5649549","url":null,"abstract":"The GPUs are emerging as a general-purpose high-performance computing device. Growing GPGPU research has made numerous GPGPU workloads available. However, a systematic approach to characterize these benchmarks and analyze their implication on GPU microarchitecture design evaluation is still lacking. In this research, we propose a set of microarchitecture agnostic GPGPU workload characteristics to represent them in a microarchitecture independent space. Correlated dimensionality reduction process and clustering analysis are used to understand these workloads. In addition, we propose a set of evaluation metrics to accurately evaluate the GPGPU design space. With growing number of GPGPU workloads, this approach of analysis provides meaningful, accurate and thorough simulation for a proposed GPU architecture design choice. Architects also benefit by choosing a set of workloads to stress their intended functional block of the GPU microarchitecture. We present a diversity analysis of GPU benchmark suites such as Nvidia CUDA SDK, Parboil and Rodinia. Our results show that with a large number of diverse kernels, workloads such as Similarity Score, Parallel Reduction, and Scan of Large Arrays show diverse characteristics in different workload spaces. We have also explored diversity in different workload subspaces (e.g. memory coalescing and branch divergence). Similarity Score, Scan of Large Arrays, MUMmerGPU, Hybrid Sort, and Nearest Neighbor workloads exhibit relatively large variation in branch divergence characteristics compared to others. Memory coalescing behavior is diverse in Scan of Large Arrays, K-Means, Similarity Score and Parallel Reduction.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116009225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 77

A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads Rodinia基准套件的特征，并与当代CMP工作负载进行比较

IEEE International Symposium on Workload Characterization (IISWC'10) Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5650274

Shuai Che, J. Sheaffer, Michael Boyer, Lukasz G. Szafaryn, Liang Wang, K. Skadron

引用次数: 305

Characterization of workload and resource consumption for an online travel and booking site 描述在线旅游和预订网站的工作量和资源消耗

IEEE International Symposium on Workload Characterization (IISWC'10) Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5649408

Nicolás Poggi, David Carrera, Ricard Gavaldà, J. Torres, E. Ayguadé

{"title":"Characterization of workload and resource consumption for an online travel and booking site","authors":"Nicolás Poggi, David Carrera, Ricard Gavaldà, J. Torres, E. Ayguadé","doi":"10.1109/IISWC.2010.5649408","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5649408","url":null,"abstract":"Online travel and ticket booking is one of the top E-Commerce industries. As they present a mix of products: flights, hotels, tickets, restaurants, activities and vacational packages, they rely on a wide range of technologies to support them: Javascript, AJAX, XML, B2B Web services, Caching, Search Algorithms and Affiliation; resulting in a very rich and heterogeneous workload. Moreover, visits to travel sites present a great variability depending on time of the day, season, promotions, events, and linking; creating bursty traffic, making capacity planning a challenge. It is therefore of great importance to understand how users and crawlers interact on travel sites and their effect on server resources, for devising cost effective infrastructures and improving the Quality of Service for users. In this paper we present a detailed workload and resource consumption characterization of the web site of a top national Online Travel Agency. Characterization is performed on server logs, including both HTTP data and resource consumption of the requests, as well as the server load status during the execution. From the dataset we characterize user sessions, their patterns and how response time is affected as load on Web servers increases. We provide a fine grain analysis by performing experiments differentiating: types of request, time of the day, products, and resource requirements for each. Results show that the workload is bursty, as expected, that exhibit different properties between day and night traffic in terms of request type mix, that user session length cover a wide range of durations, which response time grows proportionally to server load, and that response time of external data providers also increase on peak hours, amongst other results. Such results can be useful for optimizing infrastructure costs, improving QoS for users, and development of realistic workload generators for similar applications.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122901568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Toward a more accurate understanding of the limits of the TLS execution paradigm 更准确地理解TLS执行范例的限制

IEEE International Symposium on Workload Characterization (IISWC'10) Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5649169

Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas Yiapanis, Adam Craig Pocock, Gavin Brown, M. Luján, I. Watson, Marcelo H. Cintra

引用次数: 19

Benchmark synthesis for architecture and compiler exploration 用于架构和编译器探索的基准综合

IEEE International Symposium on Workload Characterization (IISWC'10) Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5650208

Luk Van Ertvelde, L. Eeckhout

{"title":"Benchmark synthesis for architecture and compiler exploration","authors":"Luk Van Ertvelde, L. Eeckhout","doi":"10.1109/IISWC.2010.5650208","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5650208","url":null,"abstract":"This paper presents a novel benchmark synthesis framework with three key features. First, it generates synthetic benchmarks in a high-level programming language (C in our case), in contrast to prior work in benchmark synthesis which generates synthetic benchmarks in assembly. Second, the synthetic benchmarks hide proprietary information from the original workloads they are built after. Hence, companies may want to distribute synthetic benchmark clones to third parties as proxies for their proprietary codes; third parties can then optimize the target system without having access to the original codes. Third, the synthetic benchmarks are shorter running than the original workloads they are modeled after, yet they are representative. In summary, the proposed framework generates small (thus quick to simulate) and representative benchmarks that can serve as proxies for other workloads without revealing proprietary information; and because the benchmarks are generated in a high-level programming language, they can be used to explore both the architecture and compiler spaces. The results obtained with our initial framework are promising. We demonstrate that we can generate synthetic proxy benchmarks for the MiBench benchmarks, and we show that they are representative across a range of machines with different instruction-set architectures, microarchitectures, and compilers and optimization levels, while being 30 times shorter running on average. We also verify using software plagiarism detection tools that the synthetic benchmark clones hide proprietary information from the original workloads.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127011479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Analyzing and scaling parallelism for network routing protocols 网络路由协议的并行性分析与扩展

IEEE International Symposium on Workload Characterization (IISWC'10) Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5650317

A. Dhanotia, Sabina Grover, G. Byrd

引用次数: 1

Characterizing datasets for data deduplication in backup applications 对备份应用中的数据集进行特征描述，以便重复数据删除

IEEE International Symposium on Workload Characterization (IISWC'10) Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5650369

Nohhyun Park, D. Lilja

{"title":"Characterizing datasets for data deduplication in backup applications","authors":"Nohhyun Park, D. Lilja","doi":"10.1109/IISWC.2010.5650369","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5650369","url":null,"abstract":"The compression and throughput performance of data deduplication system is directly affected by the input dataset. We propose two sets of evaluation metrics, and the means to extract those metrics, for deduplication systems. The First set of metrics represents how the composition of segments changes within the deduplication system over five full backups. This in turn allows more insights into how the compression ratio will change as data accumulate. The second set of metrics represents index table fragmentation caused by duplicate elimination and the arrival rate at the underlying storage system. We show that, while shorter sequences of unique data may be bad for index caching, they provide a more uniform arrival rate which improves the overall throughput. Finally, we compute the metrics derived from the datasets under evaluation and show how the datasets perform with different metrics. Our evaluation shows that backup datasets typically exhibit patterns in how they change over time and that these patterns are quantifiable in terms of how they affect the deduplication process. This quantification allows us to: 1) decide whether deduplication is applicable, 2) provision resources, 3) tune the data deduplication parameters and 4) potentially decide which portion of the dataset is best suited for deduplication.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"216 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116199735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

Performance of multi-process and multi-thread processing on multi-core SMT processors 多核SMT处理器上的多进程和多线程处理性能

IEEE International Symposium on Workload Characterization (IISWC'10) Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5650174

H. Inoue, T. Nakatani

{"title":"Performance of multi-process and multi-thread processing on multi-core SMT processors","authors":"H. Inoue, T. Nakatani","doi":"10.1109/IISWC.2010.5650174","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5650174","url":null,"abstract":"Many modern high-performance processors support multiple hardware threads in the form of multiple cores and SMT (Simultaneous Multi-Threading). Hence achieving good performance scalability of programs with respect to the numbers of cores (core scalability) and SMT threads in one core (SMT scalability) is critical. To identify a way to achieve higher performance on the multi-core SMT processors, this paper compares the performance scalability with two parallelization models (using multiple processes and using multiple threads in one process) on two types of hardware parallelism (core scalability and SMT scalability). We tested standard Java benchmarks and a real-world server program written in PHP on two platforms, Sun's UltraSPARC T1 (Niagara) processor and Intel's Xeon (Nehalem) processor. We show that the multi-thread model achieves better SMT scalability compared to the multi-process model by reducing the number of cache misses and DTLB misses. However both models achieve roughly equal core scalability. We show that the multi-thread model generates up to 7.4 times more DTLB misses than the multi-process model when multiple cores are used. To take advantage of the both models, we implemented a memory allocator for a PHP runtime to reduce DTLB misses on multi-core SMT processors. The allocator is aware of the core that is running each software thread and allocates memory blocks from same memory page for each processor core. When using all of the hardware threads on a Niagara, the core-aware allocator reduces the DTLB misses by 46.7% compared to the default allocator, and it improves the performance by 3.0%.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"s3-50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130239289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

A limit study of JavaScript parallelism JavaScript并行性的极限研究

IEEE International Symposium on Workload Characterization (IISWC'10) Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5649419

Emily Fortuna, O. Anderson, L. Ceze, S. Eggers

引用次数: 42

Analysis on semantic transactional memory footprint for hardware transactional memory 硬件事务内存的语义事务内存占用分析

IEEE International Symposium on Workload Characterization (IISWC'10) Pub Date : 2010-12-02 DOI: 10.1109/IISWC.2010.5649529

Jaewoong Chung, Dhruva R. Chakrabarti, C. Minh

{"title":"Analysis on semantic transactional memory footprint for hardware transactional memory","authors":"Jaewoong Chung, Dhruva R. Chakrabarti, C. Minh","doi":"10.1109/IISWC.2010.5649529","DOIUrl":"https://doi.org/10.1109/IISWC.2010.5649529","url":null,"abstract":"We analyze various characteristics of semantic transactional memory footprint (STMF) that consists of only the memory accesses the underlying hardware transactional memory (HTM) system has to manage for the correct execution of transactional programs. Our analysis shows that STMF can be significantly smaller than declarative transactional memory footprint (DTMF) that contains all memory accesses within transaction boundaries (i.e., only 8.3% of DTMF in the applications examined). This result encourages processor designers and software toolchain developers to explore new design points for low-cost HTM systems and intelligent software toolchains to find and leverage STMF efficiently. We identify seven code patterns that belong to DTMF, but not to STMF, and show that they take up 91.7% of all memory accesses in transactional boundaries, on average, for the transactional programs examined. A new instruction prefix is proposed to express STMF efficiently, and the existing compiler techniques are examined to check their applicability to deduce STMF from DTMF. Our trace analysis shows that using STMF significantly reduces the ratio of transactions overflowing a 32KB L1 cache, from 12.80% to 2.00%, and substantially lowers the false positive probability of Bloom filters used for transaction signature management, from 23.60% to less than 0.001%. The simulation result shows that the STAMP applications with the STMF expression run 40% faster on average than those with the DTMF expression.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123369824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1