2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)最新文献

筛选
英文 中文
Machine learning for performance and power modeling/prediction 性能和功率建模/预测的机器学习
L. John
{"title":"Machine learning for performance and power modeling/prediction","authors":"L. John","doi":"10.1109/ISPASS.2017.7975264","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975264","url":null,"abstract":"Effective design space exploration relies on fast and accurate pre-silicon performance and power models. Simulation is commonly used for understanding architectural tradeoffs, however many emerging workloads cannot even run on many full-system simulators. Even if you manage to run an emerging workload, it may be a tiny part of the workload, because detailed simulators are prohibitively slow. This talk presents some examples of how machine learning can be used to solve some of the problems haunting the performance evaluation field. An application for machine learning is in cross-platform performance and power prediction. If one model is slow to run real-world benchmarks/workloads, is it possible to predict/estimate its performance/power by using runs on another platform? Are there correlations that can be exploited using machine learning to make cross-platform performance and power predictions? A methodology to perform cross-platform performance/power predictions will be presented in this talk. Another application illustrating the use of machine learning to calibrate analytical power estimation models will be discussed. Yet another application for machine learning has been to create max power stressmarks. Manually developing and tuning so called stressmarks is extremely tedious and time-consumingwhile requiring an intimate understanding of the processor. In our past research, we created a framework that uses machine learning for the automated generation of stressmarks. In this talk, the methodology of the creation of automatic stressmarks will be explained. Experiments on multiple platforms validating the proposed approach will also be described.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"299 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121046388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
dist-gem5: Distributed simulation of computer clusters dist-gem5:计算机集群的分布式模拟
Mohammad Alian, Umur Darbaz, G. Dózsa, S. Diestelhorst, Daehoon Kim, N. Kim
{"title":"dist-gem5: Distributed simulation of computer clusters","authors":"Mohammad Alian, Umur Darbaz, G. Dózsa, S. Diestelhorst, Daehoon Kim, N. Kim","doi":"10.1109/ISPASS.2017.7975287","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975287","url":null,"abstract":"When analyzing a distributed computer system, we often observe that the complex interplay among processor, node, and network sub-systems can profoundly affect the performance and power efficiency of the distributed computer system. Therefore, to effectively cross-optimize hardware and software components of a distributed computer system, we need a full-system simulation infrastructure that can precisely capture the complex interplay. Responding to the aforementioned need, we present dist-gem5, a flexible, detailed, and open-source full-system simulation infrastructure that can model and simulate a distributed computer system using multiple simulation hosts. Then we validate dist-gem5 against a physical cluster and show that the latency and bandwidth of the simulated network sub-system are within 18% of the physical one. Compared with the single threaded and parallel versions of gem5, dist-gem5 speeds up the simulation of a 63-node computer cluster by 83.1x and 12.8x, respectively.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130910121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Chai: Collaborative heterogeneous applications for integrated-architectures 柴:集成架构的协同异构应用
Juan Gómez-Luna, I. E. Hajj, Li-Wen Chang, Victor Garcia-Flores, Simon Garcia De Gonzalo, T. Jablin, Antonio J. Peña, Wen-mei W. Hwu
{"title":"Chai: Collaborative heterogeneous applications for integrated-architectures","authors":"Juan Gómez-Luna, I. E. Hajj, Li-Wen Chang, Victor Garcia-Flores, Simon Garcia De Gonzalo, T. Jablin, Antonio J. Peña, Wen-mei W. Hwu","doi":"10.1109/ISPASS.2017.7975269","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975269","url":null,"abstract":"Heterogeneous system architectures are evolving towards tighter integration among devices, with emerging features such as shared virtual memory, memory coherence, and systemwide atomics. Languages, device architectures, system specifications, and applications are rapidly adapting to the challenges and opportunities of tightly integrated heterogeneous platforms. Programming languages such as OpenCL 2.0, CUDA 8.0, and C++ AMP allow programmers to exploit these architectures for productive collaboration between CPU and GPU threads. To evaluate these new architectures and programming languages, and to empower researchers to experiment with new ideas, a suite of benchmarks targeting these architectures with close CPU-GPU collaboration is needed. In this paper, we classify applications that target heterogeneous architectures into generic collaboration patterns including data partitioning, fine-grain task partitioning, and coarse-grain task partitioning. We present Chai, a new suite of 14 benchmarks that cover these patterns and exercise different features of heterogeneous architectures with varying intensity. Each benchmark in Chai has seven different implementations in different programming models such as OpenCL, C++ AMP, and CUDA, and with and without the use of the latest heterogeneous architecture features. We characterize the behavior of each benchmark with respect to varying input sizes and collaboration combinations, and evaluate the impact of using the emerging features of heterogeneous architectures on application performance.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130028472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 72
OpenSMART: Single-cycle multi-hop NoC generator in BSV and Chisel OpenSMART: BSV和Chisel中的单周期多跳NoC发生器
Hyoukjun Kwon, T. Krishna
{"title":"OpenSMART: Single-cycle multi-hop NoC generator in BSV and Chisel","authors":"Hyoukjun Kwon, T. Krishna","doi":"10.1109/ISPASS.2017.7975291","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975291","url":null,"abstract":"The chip industry faces two key challenges today — the impending end of Moore's Law and the rising costs of chip design and verification (millions of dollars today). Heterogeneous IPs — cores and domain-specific accelerators — are a promising answer to the first challenge, enabling performance and energy benefits no longer provided by technology scaling. IP-reuse with plug-and-play designs can help with the second challenge, amortizing NRE costs tremendously. A key requirement in a heterogeneous IP-based plug-and-play SoC environment is an interconnection fabric to connect these IPs together. This fabric needs to be scalable — low latency, low energy and low area — and yet be flexible/parametrizable for use across designs. The key scalability challenge in any Network-on-Chip (NoC) today is that the latency increases proportional to the number of hops. In this work, we present a NoC generator called OpenSMART, which generates low-latency NoCs based on SMART1. SMART is a recently proposed NoC microarchitecture that enables multihop on-chip traversals within a single cycle, removing the dependence of latency on hops. SMART leverages wire delay of the underlying repeated wires, and augments each router with the ability to request and setup bypass paths. OpenSMART takes SMART from a NoC optimization to a design methodology for SoCs, enabling users to generate verified RTL for a class of userspecified network configurations, such as network size, topology, routing algorithm, number of VCs/buffers, router pipeline stages, and so on. OpenSMART also provides the ability to generate any heterogeneous topology with low and high-radix routers and optimized single-stage pipelines, leveraging fast logic delays in technology nodes today. OpenSMART v1.0 comes with both Bluespec System Verilog and Chisel implementations, and this paper also presents a case study of our experiences with both languages. OpenSMART is available for download2 and is going to be a key addition to the emerging open-source hardware movement, providing a glue for interconnecting existing and emerging IPs.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127488785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 52
A taxonomy of out-of-order instruction commit 无序指令提交的一种分类
M. Alipour, Trevor E. Carlson, S. Kaxiras
{"title":"A taxonomy of out-of-order instruction commit","authors":"M. Alipour, Trevor E. Carlson, S. Kaxiras","doi":"10.1109/ISPASS.2017.7975283","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975283","url":null,"abstract":"While in-order instruction commit has its advantages, such as providing precise interrupts and avoiding complications with the memory consistency model, it requires the core to hold on to resources (reorder buffer entries, load/store queue entries, registers) until they are released in program order. In contrast, out-of-order commit releases resources much earlier, yielding improved performance without the need for additional hardware resources. In this paper, we revisit out-of-order commit from a different perspective, not by proposing another hardware technique, but by introducing a taxonomy and evaluating three different micro-architectures that have this technique enabled. We show how smaller processors can benefit from simple out-oforder commit strategies, but that larger, aggressive cores require more aggressive strategies to improve performance.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121658923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Treelogy: A benchmark suite for tree traversals Treelogy:树遍历的基准测试套件
Nikhil Hegde, Jianqiao Liu, Kirshanthan Sundararajah, Milind Kulkarni
{"title":"Treelogy: A benchmark suite for tree traversals","authors":"Nikhil Hegde, Jianqiao Liu, Kirshanthan Sundararajah, Milind Kulkarni","doi":"10.1109/ISPASS.2017.7975294","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975294","url":null,"abstract":"An interesting class of irregular algorithms is tree traversal algorithms, which repeatedly traverse various trees to perform efficient computations. Tree traversal algorithms form the algorithmic kernels in an important set of applications in scientific computing, computer graphics, bioinformatics, and data mining, etc. There has been increasing interest in understanding tree traversal algorithms, optimizing them, and applying them in a wide variety of settings. Crucially, while there are many possible optimizations for tree traversal algorithms, which optimizations apply to which algorithms is dependent on algorithmic characteristics. In this work, we present a suite of tree traversal kernels, drawn from diverse domains, called Treelogy, to explore the connection between tree traversal algorithms and state-of-the-art optimizations. We characterize these algorithms by developing an ontology based on their structural properties. The attributes extracted through our ontology, for a given traversal kernel, can aid in quick analysis of the suitability of platform- and application-specific as well as independent optimizations. We provide reference implementations of these kernels for three platforms: shared memory multicores, distributed memory systems, and GPUs, and evaluate their scalability.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131222254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Prefetching for cloud workloads: An analysis based on address patterns 云工作负载的预取:基于地址模式的分析
Jiajun Wang, Reena Panda, L. John
{"title":"Prefetching for cloud workloads: An analysis based on address patterns","authors":"Jiajun Wang, Reena Panda, L. John","doi":"10.1109/ISPASS.2017.7975288","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975288","url":null,"abstract":"Cloud computing is gaining popularity due to its ability to provide infrastructure, platform and software services to clients on a global scale. Using cloud services, clients reduce the cost and complexity of buying and managing the underlying hardware and software layers. Popular services like web search, data analytics and data mining typically work with big data sets that do not fit into top level caches. Thus performance efficiency of last-level caches and the off-chip memory becomes a crucial determinant of cloud application performance. In this paper we use CloudSuite as an example and we study how prefetching schemes affect cloud workloads. We conduct detailed analysis on address patterns to explore the correlation between prefetching performance and intrinsic workload characteristics. Our work focuses particularly on the behavior of memory accesses at the last-level cache and beyond. We observe that cloud workloads in general do not have dominant strides. State-of-the-art prefetching schemes are only able to improve performance for some cloud applications such as web search. Our analysis shows that cloud workloads with long temporal reuse patterns often get negatively impacted by prefetching, especially if their working set is larger than the cache size.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116290357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
PTAT: An efficient and precise tool for collecting detailed TLB miss traces PTAT:一个有效和精确的工具,用于收集详细的TLB遗漏痕迹
Jiutian Zhang, Yuhang Liu, Xiaojing Zhu, Yuan Ruan, Mingyu Chen
{"title":"PTAT: An efficient and precise tool for collecting detailed TLB miss traces","authors":"Jiutian Zhang, Yuhang Liu, Xiaojing Zhu, Yuan Ruan, Mingyu Chen","doi":"10.1109/ISPASS.2017.7975284","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975284","url":null,"abstract":"It is well known that the TLB performance impacts the memory system performance, which is critical for overall system performance. Similar to multi-level caches, multilevel TLBs have become an important leverage for boosting data access performance. Applications have increasingly large working sets. Servers targeting such applications have thus been built with ever larger main memory capacities, but there has been no commensurate growth in TLB sizes. Designing high performance and energy efficient memory hierarchies require insight into the behavior of current designs: when do they work well, and when do they fall short of expectations. Profiling the TLB misses is the prerequisite for memory system optimization. Both designing efficient TLB architecture and TLB-friendly applications require analysis of TLB miss behavior. Although researchers have extensively studied TLB behavior, current approaches have some issues in either efficiency or precision.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116686161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting memory page stability and its application to memory deduplication and live migration 预测内存页稳定性及其在内存重复数据删除和实时迁移中的应用
Karim Elghamrawy, D. Franklin, F. Chong
{"title":"Predicting memory page stability and its application to memory deduplication and live migration","authors":"Karim Elghamrawy, D. Franklin, F. Chong","doi":"10.1109/ISPASS.2017.7975278","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975278","url":null,"abstract":"There are various applications and operations in virtualized environments that rely on memory page stability to achieve satisfactory performance. These applications include VM live migration and memory deduplication. Unfortunately, there is a large gap between existing prediction mechanisms and actual behavior. This is the gap we hope to narrow.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"152 3-4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116700541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
StressRight: Finding the right stress for accurate in-development system evaluation stress right:为开发中系统的准确评估找到正确的压力
Jaewon Lee, Hanhwi Jang, Jae-Eon Jo, Gyu-hyeon Lee, Jangwoo Kim
{"title":"StressRight: Finding the right stress for accurate in-development system evaluation","authors":"Jaewon Lee, Hanhwi Jang, Jae-Eon Jo, Gyu-hyeon Lee, Jangwoo Kim","doi":"10.1109/ISPASS.2017.7975292","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975292","url":null,"abstract":"Computer architects use a variety of workloads to measure system performance. For many workloads, the workload configuration determines the stress applied to the system and the corresponding performance behavior. Therefore, architects must make great efforts to explore and find the correct workload configuration before performing detailed analysis. However, such explorations become impossible for indevelopment systems which exist only as a software model. The existing system modeling platforms are either accurate but too slow, or fast but inaccurate to get workload-reported performance metrics (e.g., latency and throughput) which are necessary for configuring workloads. In this paper, we propose StressRight, a method to quickly model the first-order performance of full-system workloads and reconstruct the workload-reported performance metrics. Stress-Right allows to explore how the workload configurations affect the stress and performance. The key idea is to execute workloads on a fast but timing-agnostic platform (e.g. emulators), and efficiently reconstruct the timing/performance details by analyzing only the unique code blocks. Our evaluation using memcached and PARSEC shows that StressRight achieves 8∼45x speedup compared to a cycle-level simulator while maintaining good accuracy.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127780252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信