2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)最新文献

筛选
英文 中文
A detailed and flexible cycle-accurate Network-on-Chip simulator 一个详细和灵活的周期精确的片上网络模拟器
Nan Jiang, Daniel U. Becker, George Michelogiannakis, J. Balfour, Brian Towles, D. E. Shaw, John Kim, W. Dally
{"title":"A detailed and flexible cycle-accurate Network-on-Chip simulator","authors":"Nan Jiang, Daniel U. Becker, George Michelogiannakis, J. Balfour, Brian Towles, D. E. Shaw, John Kim, W. Dally","doi":"10.1109/ISPASS.2013.6557149","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557149","url":null,"abstract":"Network-on-Chips (NoCs) are becoming integral parts of modern microprocessors as the number of cores and modules integrated on a single chip continues to increase. Research and development of future NoC technology relies on accurate modeling and simulations to evaluate the performance impact and analyze the cost of novel NoC architectures. In this work, we present BookSim, a cycle-accurate simulator for NoCs. The simulator is designed for simulation flexibility and accurate modeling of network components. It features a modular design and offers a large set of configurable network parameters in terms of topology, routing algorithm, flow control, and router microarchitecture, including buffer management and allocation schemes. BookSim furthermore emphasizes detailed implementations of network components that accurately model the behavior of actual hardware. We have validated the accuracy of the simulator against RTL implementations of NoC routers.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121842858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 639
Understanding the implications of virtual machine management on processor microarchitecture design 了解虚拟机管理对处理器微体系结构设计的影响
Xiufeng Sui, Tao Sun, Tao Li, Lixin Zhang
{"title":"Understanding the implications of virtual machine management on processor microarchitecture design","authors":"Xiufeng Sui, Tao Sun, Tao Li, Lixin Zhang","doi":"10.1109/ISPASS.2013.6557145","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557145","url":null,"abstract":"Cloud computing has demonstrated tremendous capability in a wide spectrum of online services. Virtualization provides an efficient solution to the utilization of modern multicore processor systems while affording significant flexibility. The growing popularity of virtualized datacenters motivates deeper understanding of the interactions between virtual machine management and the micro-architecture behaviors of the privileged domain. We argue that these behaviors must be factored into the design of processor microarchitecture in virtualized datacenters. In this work, we use performance counters on modern servers to study the micro-architectural execution characteristics of the privileged domain while performing various VM management operations. Our study shows that today's state-of-the-art processor still has room for further optimizations when executing virtualized cloud workloads, particularly in the organization of last level caches and on-chip cache coherence protocol. Specifically, our analysis shows that: shared caches could be partitioned to eliminate interference between the privileged domain and guest domains; the cache coherence protocol could support a high degree of data sharing of the privileged domain; and cache capacity or CPU utilization occupied by the privileged domain could be effectively managed when performing management workflows to achieve high system throughput.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116491918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Wall-clock based synchronization: A parallel simulation technology for cluster systems 基于挂钟的同步:集群系统的并行仿真技术
Xiaodong Zhu, Junmin Wu, Guoliang Chen, Tao Li
{"title":"Wall-clock based synchronization: A parallel simulation technology for cluster systems","authors":"Xiaodong Zhu, Junmin Wu, Guoliang Chen, Tao Li","doi":"10.1109/ISPASS.2013.6557166","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557166","url":null,"abstract":"A common practice for reducing synchronization overheads in parallel simulation of a large-scale cluster is to relax synchronization with lengthened synchronous steps. However, as a side effect, simulation accuracy degrades considerably. This paper proposes a novel mechanism that keeps the running speeds of different nodes consistent by synchronizing logical clocks with the wall clock periodically within each lax step. Because speed deviations of nodes are the main source of time causality errors, through aligning speeds our mechanism only causes modest precision loss while achieving a close performance to lax synchronization. The experimental results show that it improves the performance by 2 to 11 times relative to the baseline barrier synchronization with a high accuracy (e.g. 99% in most cases). Compared to the recently proposed adaptive mechanism, it also achieves nearly 30% performance improvement.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125939030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Use of simple analytic performance models for streaming data applications deployed on diverse architectures 为部署在不同架构上的流数据应用程序使用简单的分析性能模型
J. Beard, R. Chamberlain
{"title":"Use of simple analytic performance models for streaming data applications deployed on diverse architectures","authors":"J. Beard, R. Chamberlain","doi":"10.1109/ISPASS.2013.6557162","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557162","url":null,"abstract":"Modern hardware is often heterogeneous. With heterogeneity comes multiple abstraction layers that hide underlying complex systems. This complexity makes quantitative performance modeling a difficult task. Designers of high-performance streaming applications for heterogeneous systems must contend with unpredictable and often non-generalizable models to predict performance of a particular application and hardware mapping. This paper outlines a computationally simple approach that can be used to model the overall throughput and buffering needs of a streaming application on heterogeneous hardware. The model presented is based upon a hybrid maximum flow and decomposed discrete queueing model. The utility of the model is assessed using a set of real and synthetic benchmarks with model predictions compared to measured application performance.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"31 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129399605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Power measurement techniques on standard compute nodes: A quantitative comparison 标准计算节点上的功率测量技术:定量比较
D. Hackenberg, T. Ilsche, R. Schöne, Daniel Molka, Maik Schmidt, W. Nagel
{"title":"Power measurement techniques on standard compute nodes: A quantitative comparison","authors":"D. Hackenberg, T. Ilsche, R. Schöne, Daniel Molka, Maik Schmidt, W. Nagel","doi":"10.1109/ISPASS.2013.6557170","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557170","url":null,"abstract":"Energy efficiency is of steadily growing importance in virtually all areas from mobile to high performance computing. Therefore, lots of research projects focus on this topic and strongly rely on power measurements from their test platforms. The need for finer grained measurement data-both in terms of temporal and spatial resolution (component breakdown)-often collides with very rudimentary measurement setups that rely e.g., on non-professional power meters, IMPI based platform data or model-based interfaces such as RAPL or APM. This paper presents an in-depth study of several different AC and DC measurement methodologies as well as model approaches on test systems with the latest processor generations from both Intel and AMD. We analyze most important aspects such as signal quality, time resolution, accuracy, and measurement overhead and use a calibrated, professional power analyzer as our reference.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115783773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 132
Parallel GPU architecture simulation framework exploiting work allocation unit parallelism 利用工作分配单元并行性的并行GPU架构仿真框架
Sangpil Lee, W. Ro
{"title":"Parallel GPU architecture simulation framework exploiting work allocation unit parallelism","authors":"Sangpil Lee, W. Ro","doi":"10.1109/ISPASS.2013.6557151","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557151","url":null,"abstract":"GPU computing is at the forefront of high-performance computing, and it has greatly affected current studies on parallel software and hardware design because of its massively parallel architecture. Therefore, numerous studies have focused on the utilization of GPUs in various fields. However, studies of GPU architectures are constrained by the lack of a suitable GPU simulator. Previously proposed GPU simulators do not have sufficient simulation speed for advanced software and architecture studies. In this paper, we propose a new parallel simulation framework and a parallel simulation technique called work-group parallel simulation in order to improve the simulation speed for modern many-core GPUs. The proposed framework divides the GPU architecture into parallel and shared components, and it determines which GPU component can be effectively parallelized and can work correctly in multithreaded simulation. In addition, the work-group parallel simulation technique effectively boosts the performance of parallelized GPU simulation by eliminating the synchronization overhead. Experimental results obtained using a simulator with the proposed framework show that the proposed parallel simulation technique has a speed-up of up to 4.15 as compared to an existing sequential GPU simulator on an 8-core machine providing minimized cycle errors.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"155 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133657820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Sampled simulation of multi-threaded applications 多线程应用程序的采样模拟
Trevor E. Carlson, W. Heirman, L. Eeckhout
{"title":"Sampled simulation of multi-threaded applications","authors":"Trevor E. Carlson, W. Heirman, L. Eeckhout","doi":"10.1109/ISPASS.2013.6557141","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557141","url":null,"abstract":"Sampling is a well-known workload reduction technique that allows one to speed up architectural simulation while accurately predicting performance. Previous sampling methods have been shown to accurately predict single-threaded application runtime based on its overall IPC. However, these previous approaches are unsuitable for general multi-threaded applications, for which IPC is not a good proxy for runtime. Additionally, we find that issues such as application periodicity and inter-thread synchronization play a significant role in determining how best to sample these applications. The proposed multi-threaded application sampling methodology is able to derive an effective sampling strategy for candidate applications using architecture-independent metrics. Using this methodology, large input sets can now be simulated which would otherwise be infeasible, allowing for more accurate conclusions to be made than from studies using scaled-down input sets. Through the use of the proposed methodology, we can simulate less than 10% of the total application runtime in detail. On the SPEComp, NPB and PARSEC benchmarks, running on an 8-core simulated system, we achieve an average absolute error of 3.5%.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117192984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 60
Advancing computer systems without technology progress 在没有技术进步的情况下推进计算机系统
C. Kozyrakis
{"title":"Advancing computer systems without technology progress","authors":"C. Kozyrakis","doi":"10.1109/ISPASS.2013.6557164","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557164","url":null,"abstract":"Summary form only given. Computing is now an essential tool for all aspects of human endeavor, including healthcare, education, science, commerce, government, and entertainment. We expect our computers, whether those hidden away in data-centers or those in a handheld form factor, to be capable of running sophisticated algorithms that process rapidly growing volumes of data. In other words, we expect our computers to have exponentially increasing performance at constant cost (energy and chip area). For decades, CMOS technology has been our ally, providing exponential improvements in both transistor density and energy consumption, which we turned into exponential improvements in system performance. Unfortunately, we are now in a phase where transistor cost and energy consumption are barely scaling, making it necessary to rethink the way we build scalable systems. In this talk, we will consider how to advance computer systems without technology progress. There are several promising directions that combined can provide improvements equivalent to several decades of Moore's law. These directions include massive parallelism with locality awareness, specialization, removing the bloat from our infrastructure, increasing system utilization, and embracing approximate computing. We will review motivating results in these areas, establish that they require cross-layer optimizations across both hardware and software, and discuss the remaining challenges that systems researchers must address.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121464867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
An analytical framework for estimating TCO and exploring data center design space 用于估计TCO和探索数据中心设计空间的分析框架
D. Hardy, Marios Kleanthous, I. Sideris, A. Saidi, Emre Ozer, Yiannakis Sazeides
{"title":"An analytical framework for estimating TCO and exploring data center design space","authors":"D. Hardy, Marios Kleanthous, I. Sideris, A. Saidi, Emre Ozer, Yiannakis Sazeides","doi":"10.1109/ISPASS.2013.6557146","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557146","url":null,"abstract":"In this paper, we present EETCO: an estimation and exploration tool that provides qualitative assessment of data center design decisions on Total-Cost-of-Ownership (TCO) and environmental impact. It can capture the implications of many parameters including server performance, power, cost, and Mean-Time-To-Failure (MTTF). The tool includes a model for spare estimation needed to account for server failures and performance variability. The paper describes the tool model and its implementation, and presents experiments that explore tradeoffs offered by different server configurations, performance variability, MTTF, 2D vs 3D-stacked processors, and ambient temperature. These experiments reveal, for the data center configurations used in this study, several opportunities for profit and optimization in the datacenter ecosystem: (i) servers with different computing performance and power consumption merit exploration to minimize TCO and the environmental impact, (ii) performance variability is desirable if it comes with a drastic cost reduction, (iii) shorter processor MTTF is beneficial if it comes with a moderate processor cost reduction, (iv) increasing by few degrees the ambient datacenter temperature reduces the environmental impact with a minor increase in the TCO and (v) a higher cost for a 3D-stacked processor with shorter MTTF and higher power consumption can be preferred, over a conventional 2D processor, if it offers a moderate performance increase.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131031783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Energy efficiency of lossless data compression on a mobile device: An experimental evaluation 移动设备上无损数据压缩的能量效率:一个实验评估
Armen Dzhagaryan, A. Milenković, Martin Burtscher
{"title":"Energy efficiency of lossless data compression on a mobile device: An experimental evaluation","authors":"Armen Dzhagaryan, A. Milenković, Martin Burtscher","doi":"10.1109/ISPASS.2013.6557156","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557156","url":null,"abstract":"Lossless compression and decompression are routinely used in mobile computing devices to reduce the costs of communicating and storing data. This paper presents the results of an experimental evaluation of common compression utilities on Pandaboard, a development platform similar to current commercial mobile devices. We study the compression ratio, compression and decompression throughput, and energy efficiency of different usage scenarios typical for mobile computing. We observe a wide variety of energy costs associated with data compression and provide practical guidelines for selecting the most energy-efficient configurations.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127702955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书