Proceedings of the 7th ACM international conference on Computing frontiers最新文献_第3页

Low cost and low intrusive approach to test on-line the scheduler of high performance microprocessors 一种低成本、低干扰的高性能微处理器调度程序在线测试方法

Proceedings of the 7th ACM international conference on Computing frontiers Pub Date : 2010-05-17 DOI: 10.1145/1787275.1787309

Daniele Rossi, M. Omaña, Gianluca Berghella, C. Metra, A. Jas, C. Tirumurti, R. Galivanche

引用次数: 2

Session details: Neuroscience 会议细节:神经科学

Proceedings of the 7th ACM international conference on Computing frontiers Pub Date : 2010-05-17 DOI: 10.1145/3251907

P. Kelly

引用次数: 0

Session details: Poster session 会议详情:海报会议

Proceedings of the 7th ACM international conference on Computing frontiers Pub Date : 2010-05-17 DOI: 10.1145/3254736

S. Vinoski

引用次数: 0

Session details: Parallel systems 会话细节:并行系统

Proceedings of the 7th ACM international conference on Computing frontiers Pub Date : 2010-05-17 DOI: 10.1145/3251920

Thomas R. Gross

引用次数: 0

NCID: a non-inclusive cache, inclusive directory architecture for flexible and efficient cache hierarchies NCID:一种非包容性缓存，包容性目录架构，用于灵活高效的缓存层次结构

Proceedings of the 7th ACM international conference on Computing frontiers Pub Date : 2010-05-17 DOI: 10.1145/1787275.1787314

Li Zhao, R. Iyer, S. Makineni, D. Newell, Liqun Cheng

{"title":"NCID: a non-inclusive cache, inclusive directory architecture for flexible and efficient cache hierarchies","authors":"Li Zhao, R. Iyer, S. Makineni, D. Newell, Liqun Cheng","doi":"10.1145/1787275.1787314","DOIUrl":"https://doi.org/10.1145/1787275.1787314","url":null,"abstract":"Chip-multiprocessor (CMP) architectures employ multi-level cache hierarchies with private L2 caches per core and a shared L3 cache like Intel's Nehalem processor and AMD's Barcelona processor. When designing a multi-level cache hierarchy, one of the key design choices is the inclusion policy: inclusive, non-inclusive or exclusive. Either choice has its benefits and drawbacks. An inclusive cache hierarchy (like Nehalem's L3) has the benefit of allowing incoming snoops to be filtered at the L3 cache, but suffers from (a) reduced space efficiency due to replication between the L2 and L3 caches and (b) reduced flexibility since it cannot bypass the L3 cache for transient or low priority data. In an inclusive L2/L3 cache hierarchy, it also becomes difficult to flexibly chop L3 cache size (or increase L2 cache size) for different product instantiations because the inclusion can start to affect performance (due to significant back-invalidates). In this paper, we present a novel approach to addressing the drawbacks of inclusive caches, while retaining its positive features of snoop filtering. We present NCID: a non-inclusive cache, inclusive directory architecture that allows data in the L3 to be non-inclusive or exclusive, but retains tag inclusion in the directory to support complete snoop filtering. We then describe and evaluate a range of NCID-based architecture options and policies. Our evaluation shows that NCID enables a flexible and efficient cache hierarchy for future CMP platforms and has the potential to improve performance significantly for several important server benchmarks.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130968321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Enabling a highly-scalable global address space model for petascale computing 为千兆级计算启用高度可伸缩的全局地址空间模型

Proceedings of the 7th ACM international conference on Computing frontiers Pub Date : 2010-05-17 DOI: 10.1145/1787275.1787326

V. Tipparaju, E. Aprá, Weikuan Yu, J. Vetter

{"title":"Enabling a highly-scalable global address space model for petascale computing","authors":"V. Tipparaju, E. Aprá, Weikuan Yu, J. Vetter","doi":"10.1145/1787275.1787326","DOIUrl":"https://doi.org/10.1145/1787275.1787326","url":null,"abstract":"Over the past decade, the trajectory to the petascale has been built on increased complexity and scale of the underlying parallel architectures. Meanwhile, software developers have struggled to provide tools that maintain the productivity of computational science teams using these new systems. In this regard, Global Address Space (GAS) programming models provide a straightforward and easy to use addressing model, which can lead to improved productivity. However, the scalability of GAS depends directly on the design and implementation of the runtime system on the target petascale distributed-memory architecture. In this paper, we describe the design, implementation, and optimization of the Aggregate Remote Memory Copy Interface (ARMCI) runtime library on the Cray XT5 2.3 PetaFLOPs computer at Oak Ridge National Laboratory. We optimized our implementation with the flow intimation technique that we have introduced in this paper. Our optimized ARMCI implementation improves scalability of both the Global Arrays (GA) programming model and a real-world chemistry application - NWChem - from small jobs up through 180,000 cores.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132886125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Automatic tuning of MPI runtime parameter settings by using machine learning 使用机器学习自动调整MPI运行时参数设置

Proceedings of the 7th ACM international conference on Computing frontiers Pub Date : 2010-05-17 DOI: 10.1145/1787275.1787310

Simone Pellegrini, T. Fahringer, Herbert Jordan, H. Moritsch

{"title":"Automatic tuning of MPI runtime parameter settings by using machine learning","authors":"Simone Pellegrini, T. Fahringer, Herbert Jordan, H. Moritsch","doi":"10.1145/1787275.1787310","DOIUrl":"https://doi.org/10.1145/1787275.1787310","url":null,"abstract":"MPI implementations provide several hundred runtime parameters that can be tuned for performance improvement. The ideal parameter setting does not only depend on the target multiprocessor architecture but also on the application, its problem and communicator size. This paper presents ATune, an automatic performance tuning tool that uses machine learning techniques to determine the program-specific optimal settings for a subset of the Open MPI's runtime parameters. ATune learns the behaviour of a target system by means of a training phase where several MPI benchmarks and MPI applications are run on a target architecture for varying problem and communicator sizes. For new input programs, only one run is required in order for ATune to deliver a prediction of the optimal runtime parameters values. Experiments based on the NAS Parallel Benchmarks performed on a cluster of SMP machines are shown that demonstrate the effectiveness of ATune. For these experiments, ATune derives MPI runtime parameter settings that are on average within 4% of the maximum performance achievable on the target system resulting in a performance gain of up to 18% with respect to the default parameter setting.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129731008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Scalable event-driven native parallel processing: the SpiNNaker neuromimetic system 可扩展事件驱动的本地并行处理:SpiNNaker神经模拟系统

Proceedings of the 7th ACM international conference on Computing frontiers Pub Date : 2010-05-17 DOI: 10.1145/1787275.1787279

Alexander D. Rast, Xin Jin, F. Galluppi, L. Plana, Cameron Patterson, S. Furber

{"title":"Scalable event-driven native parallel processing: the SpiNNaker neuromimetic system","authors":"Alexander D. Rast, Xin Jin, F. Galluppi, L. Plana, Cameron Patterson, S. Furber","doi":"10.1145/1787275.1787279","DOIUrl":"https://doi.org/10.1145/1787275.1787279","url":null,"abstract":"Neural networks present a fundamentally different model of computation from the conventional sequential digital model. Modelling large networks on conventional hardware thus tends to be inefficient if not impossible. Neither dedicated neural chips, with model limitations, nor FPGA implementations, with scalability limitations, offer a satisfactory solution even though they have improved simulation performance dramatically. SpiNNaker introduces a different approach, the \"neuromimetic\" architecture, that maintains the neural optimisation of dedicated chips while offering FPGA-like universal configurability. Central to this parallel multiprocessor is an asynchronous event-driven model that uses interrupt-generating dedicated hardware on the chip to support real-time neural simulation. While this architecture is particularly suitable for spiking models, it can also implement \"classical\" neural models like the MLP efficiently. Nonetheless, event handling, particularly servicing incoming packets, requires careful and innovative design in order to avoid local processor congestion and possible deadlock. Using two exemplar models, a spiking network using Izhikevich neurons, and an MLP network, we illustrate how to implement efficient service routines to handle input events. These routines form the beginnings of a library of \"drop-in\" neural components. Ultimately, the goal is the creation of a library-based development system that allows the modeller to describe a model in a high-level neural description environment of his choice and use an automated tool chain to create the appropriate SpiNNaker instantiation. The complete system: universal hardware, automated tool chain, embedded system management, represents the \"ideal\" neural modelling environment: a general-purpose platform that can generate an arbitrary neural network and run it with hardware speed and scale.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124451746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Reversible online BIST using bidirectional BILBO 使用双向BILBO的可逆在线BIST

Proceedings of the 7th ACM international conference on Computing frontiers Pub Date : 2010-05-17 DOI: 10.1145/1787275.1787333

Jiaoyan Chen, D. Vasudevan, E. Popovici, M. Schellekens

引用次数: 5

EXACT: explicit dynamic-branch prediction with active updates EXACT:带有活动更新的显式动态分支预测

Proceedings of the 7th ACM international conference on Computing frontiers Pub Date : 2010-05-17 DOI: 10.1145/1787275.1787321

Muawya Al-Otoom, E. Forbes, E. Rotenberg

{"title":"EXACT: explicit dynamic-branch prediction with active updates","authors":"Muawya Al-Otoom, E. Forbes, E. Rotenberg","doi":"10.1145/1787275.1787321","DOIUrl":"https://doi.org/10.1145/1787275.1787321","url":null,"abstract":"Branches that depend directly or indirectly on load instructions are a leading cause of mispredictions by state-of-the-art branch predictors. For a branch of this type, there is a unique dynamic instance of the branch for each unique combination of producer-load addresses. Based on this definition, a study of mispredictions reveals two related problems: (i) Global branch history often fails to distinguish between different dynamic branches. In this case, the predictor is unable to specialize predictions for different dynamic branches, causing mispredictions if their outcomes differ. Ideally, the remedy is to predict a dynamic branch using its program counter (PC) and the addresses of its producer loads, since this context uniquely identifies the dynamic branch. We call this context the identity, or ID, of the dynamic branch. In general, producer loads are unlikely to have generated their addresses when the dynamic branch is fetched. We show that the ID of a distant retired branch in the global branch stream combined with recent global branch history, is effective context for predicting the current branch. (ii) Fixing the first problem exposes another problem. A store to an address on which a dynamic branch depends may flip its outcome when it is next encountered. With conventional passive updates, the branch suffers a misprediction before the predictor is retrained. We propose that stores to the memory addresses on which a dynamic branch depends, directly update its prediction in the predictor. This novel \"active update\" concept avoids mispredictions that are otherwise incurred by conventional passive training. We highlight two practical features that enable large EXACT predictors: the prediction path is scalably pipelinable by virtue of its decoupled indexing strategy, and active updates are tolerant of 100s of cycles of latency making it ideal for virtualizing this component in the general-purpose memory hierarchy. We also present a compact form of the predictor that caches only dynamic instances of a static branch that differ from its overall bias.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"124 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120818893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17