2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)最新文献_第2页

Machine learning for performance and power modeling/prediction 性能和功率建模/预测的机器学习

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-04-24 DOI: 10.1109/ISPASS.2017.7975264

L. John

{"title":"Machine learning for performance and power modeling/prediction","authors":"L. John","doi":"10.1109/ISPASS.2017.7975264","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975264","url":null,"abstract":"Effective design space exploration relies on fast and accurate pre-silicon performance and power models. Simulation is commonly used for understanding architectural tradeoffs, however many emerging workloads cannot even run on many full-system simulators. Even if you manage to run an emerging workload, it may be a tiny part of the workload, because detailed simulators are prohibitively slow. This talk presents some examples of how machine learning can be used to solve some of the problems haunting the performance evaluation field. An application for machine learning is in cross-platform performance and power prediction. If one model is slow to run real-world benchmarks/workloads, is it possible to predict/estimate its performance/power by using runs on another platform? Are there correlations that can be exploited using machine learning to make cross-platform performance and power predictions? A methodology to perform cross-platform performance/power predictions will be presented in this talk. Another application illustrating the use of machine learning to calibrate analytical power estimation models will be discussed. Yet another application for machine learning has been to create max power stressmarks. Manually developing and tuning so called stressmarks is extremely tedious and time-consumingwhile requiring an intimate understanding of the processor. In our past research, we created a framework that uses machine learning for the automated generation of stressmarks. In this talk, the methodology of the creation of automatic stressmarks will be explained. Experiments on multiple platforms validating the proposed approach will also be described.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"299 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121046388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

dist-gem5: Distributed simulation of computer clusters dist-gem5:计算机集群的分布式模拟

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-04-24 DOI: 10.1109/ISPASS.2017.7975287

Mohammad Alian, Umur Darbaz, G. Dózsa, S. Diestelhorst, Daehoon Kim, N. Kim

引用次数: 33

Chai: Collaborative heterogeneous applications for integrated-architectures 柴:集成架构的协同异构应用

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-04-24 DOI: 10.1109/ISPASS.2017.7975269

Juan Gómez-Luna, I. E. Hajj, Li-Wen Chang, Victor Garcia-Flores, Simon Garcia De Gonzalo, T. Jablin, Antonio J. Peña, Wen-mei W. Hwu

{"title":"Chai: Collaborative heterogeneous applications for integrated-architectures","authors":"Juan Gómez-Luna, I. E. Hajj, Li-Wen Chang, Victor Garcia-Flores, Simon Garcia De Gonzalo, T. Jablin, Antonio J. Peña, Wen-mei W. Hwu","doi":"10.1109/ISPASS.2017.7975269","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975269","url":null,"abstract":"Heterogeneous system architectures are evolving towards tighter integration among devices, with emerging features such as shared virtual memory, memory coherence, and systemwide atomics. Languages, device architectures, system specifications, and applications are rapidly adapting to the challenges and opportunities of tightly integrated heterogeneous platforms. Programming languages such as OpenCL 2.0, CUDA 8.0, and C++ AMP allow programmers to exploit these architectures for productive collaboration between CPU and GPU threads. To evaluate these new architectures and programming languages, and to empower researchers to experiment with new ideas, a suite of benchmarks targeting these architectures with close CPU-GPU collaboration is needed. In this paper, we classify applications that target heterogeneous architectures into generic collaboration patterns including data partitioning, fine-grain task partitioning, and coarse-grain task partitioning. We present Chai, a new suite of 14 benchmarks that cover these patterns and exercise different features of heterogeneous architectures with varying intensity. Each benchmark in Chai has seven different implementations in different programming models such as OpenCL, C++ AMP, and CUDA, and with and without the use of the latest heterogeneous architecture features. We characterize the behavior of each benchmark with respect to varying input sizes and collaboration combinations, and evaluate the impact of using the emerging features of heterogeneous architectures on application performance.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130028472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 72

OpenSMART: Single-cycle multi-hop NoC generator in BSV and Chisel OpenSMART: BSV和Chisel中的单周期多跳NoC发生器

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-04-01 DOI: 10.1109/ISPASS.2017.7975291

Hyoukjun Kwon, T. Krishna

{"title":"OpenSMART: Single-cycle multi-hop NoC generator in BSV and Chisel","authors":"Hyoukjun Kwon, T. Krishna","doi":"10.1109/ISPASS.2017.7975291","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975291","url":null,"abstract":"The chip industry faces two key challenges today — the impending end of Moore's Law and the rising costs of chip design and verification (millions of dollars today). Heterogeneous IPs — cores and domain-specific accelerators — are a promising answer to the first challenge, enabling performance and energy benefits no longer provided by technology scaling. IP-reuse with plug-and-play designs can help with the second challenge, amortizing NRE costs tremendously. A key requirement in a heterogeneous IP-based plug-and-play SoC environment is an interconnection fabric to connect these IPs together. This fabric needs to be scalable — low latency, low energy and low area — and yet be flexible/parametrizable for use across designs. The key scalability challenge in any Network-on-Chip (NoC) today is that the latency increases proportional to the number of hops. In this work, we present a NoC generator called OpenSMART, which generates low-latency NoCs based on SMART1. SMART is a recently proposed NoC microarchitecture that enables multihop on-chip traversals within a single cycle, removing the dependence of latency on hops. SMART leverages wire delay of the underlying repeated wires, and augments each router with the ability to request and setup bypass paths. OpenSMART takes SMART from a NoC optimization to a design methodology for SoCs, enabling users to generate verified RTL for a class of userspecified network configurations, such as network size, topology, routing algorithm, number of VCs/buffers, router pipeline stages, and so on. OpenSMART also provides the ability to generate any heterogeneous topology with low and high-radix routers and optimized single-stage pipelines, leveraging fast logic delays in technology nodes today. OpenSMART v1.0 comes with both Bluespec System Verilog and Chisel implementations, and this paper also presents a case study of our experiences with both languages. OpenSMART is available for download2 and is going to be a key addition to the emerging open-source hardware movement, providing a glue for interconnecting existing and emerging IPs.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127488785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 52

A taxonomy of out-of-order instruction commit 无序指令提交的一种分类

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-04-01 DOI: 10.1109/ISPASS.2017.7975283

M. Alipour, Trevor E. Carlson, S. Kaxiras

引用次数: 1

Treelogy: A benchmark suite for tree traversals Treelogy:树遍历的基准测试套件

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-04-01 DOI: 10.1109/ISPASS.2017.7975294

Nikhil Hegde, Jianqiao Liu, Kirshanthan Sundararajah, Milind Kulkarni

{"title":"Treelogy: A benchmark suite for tree traversals","authors":"Nikhil Hegde, Jianqiao Liu, Kirshanthan Sundararajah, Milind Kulkarni","doi":"10.1109/ISPASS.2017.7975294","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975294","url":null,"abstract":"An interesting class of irregular algorithms is tree traversal algorithms, which repeatedly traverse various trees to perform efficient computations. Tree traversal algorithms form the algorithmic kernels in an important set of applications in scientific computing, computer graphics, bioinformatics, and data mining, etc. There has been increasing interest in understanding tree traversal algorithms, optimizing them, and applying them in a wide variety of settings. Crucially, while there are many possible optimizations for tree traversal algorithms, which optimizations apply to which algorithms is dependent on algorithmic characteristics. In this work, we present a suite of tree traversal kernels, drawn from diverse domains, called Treelogy, to explore the connection between tree traversal algorithms and state-of-the-art optimizations. We characterize these algorithms by developing an ontology based on their structural properties. The attributes extracted through our ontology, for a given traversal kernel, can aid in quick analysis of the suitability of platform- and application-specific as well as independent optimizations. We provide reference implementations of these kernels for three platforms: shared memory multicores, distributed memory systems, and GPUs, and evaluate their scalability.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131222254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Prefetching for cloud workloads: An analysis based on address patterns 云工作负载的预取:基于地址模式的分析

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-04-01 DOI: 10.1109/ISPASS.2017.7975288

Jiajun Wang, Reena Panda, L. John

{"title":"Prefetching for cloud workloads: An analysis based on address patterns","authors":"Jiajun Wang, Reena Panda, L. John","doi":"10.1109/ISPASS.2017.7975288","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975288","url":null,"abstract":"Cloud computing is gaining popularity due to its ability to provide infrastructure, platform and software services to clients on a global scale. Using cloud services, clients reduce the cost and complexity of buying and managing the underlying hardware and software layers. Popular services like web search, data analytics and data mining typically work with big data sets that do not fit into top level caches. Thus performance efficiency of last-level caches and the off-chip memory becomes a crucial determinant of cloud application performance. In this paper we use CloudSuite as an example and we study how prefetching schemes affect cloud workloads. We conduct detailed analysis on address patterns to explore the correlation between prefetching performance and intrinsic workload characteristics. Our work focuses particularly on the behavior of memory accesses at the last-level cache and beyond. We observe that cloud workloads in general do not have dominant strides. State-of-the-art prefetching schemes are only able to improve performance for some cloud applications such as web search. Our analysis shows that cloud workloads with long temporal reuse patterns often get negatively impacted by prefetching, especially if their working set is larger than the cache size.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116290357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

PTAT: An efficient and precise tool for collecting detailed TLB miss traces PTAT:一个有效和精确的工具，用于收集详细的TLB遗漏痕迹

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-04-01 DOI: 10.1109/ISPASS.2017.7975284

Jiutian Zhang, Yuhang Liu, Xiaojing Zhu, Yuan Ruan, Mingyu Chen

引用次数: 0

Predicting memory page stability and its application to memory deduplication and live migration 预测内存页稳定性及其在内存重复数据删除和实时迁移中的应用

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-04-01 DOI: 10.1109/ISPASS.2017.7975278

Karim Elghamrawy, D. Franklin, F. Chong

引用次数: 2

StressRight: Finding the right stress for accurate in-development system evaluation stress right:为开发中系统的准确评估找到正确的压力

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-04-01 DOI: 10.1109/ISPASS.2017.7975292

Jaewon Lee, Hanhwi Jang, Jae-Eon Jo, Gyu-hyeon Lee, Jangwoo Kim

{"title":"StressRight: Finding the right stress for accurate in-development system evaluation","authors":"Jaewon Lee, Hanhwi Jang, Jae-Eon Jo, Gyu-hyeon Lee, Jangwoo Kim","doi":"10.1109/ISPASS.2017.7975292","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975292","url":null,"abstract":"Computer architects use a variety of workloads to measure system performance. For many workloads, the workload configuration determines the stress applied to the system and the corresponding performance behavior. Therefore, architects must make great efforts to explore and find the correct workload configuration before performing detailed analysis. However, such explorations become impossible for indevelopment systems which exist only as a software model. The existing system modeling platforms are either accurate but too slow, or fast but inaccurate to get workload-reported performance metrics (e.g., latency and throughput) which are necessary for configuring workloads. In this paper, we propose StressRight, a method to quickly model the first-order performance of full-system workloads and reconstruct the workload-reported performance metrics. Stress-Right allows to explore how the workload configurations affect the stress and performance. The key idea is to execute workloads on a fast but timing-agnostic platform (e.g. emulators), and efficiently reconstruct the timing/performance details by analyzing only the unique code blocks. Our evaluation using memcached and PARSEC shows that StressRight achieves 8∼45x speedup compared to a cycle-level simulator while maintaining good accuracy.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127780252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2