{"title":"Addressing the challenges of future large-scale many-core architectures","authors":"P. Petrides, P. Trancoso","doi":"10.1145/2482767.2482776","DOIUrl":"https://doi.org/10.1145/2482767.2482776","url":null,"abstract":"Current processor trends show an increasing number of cores and a diversity of characteristics among them. Such processors offer a large potential for achieving high performance for different applications. Nevertheless, exploiting the characteristics of such processors is a challenge. In particular, considering all cores to be the same for scheduling tasks is not valid any longer. In this work we address three important characteristics for future many-core processors: (1) a many-core processor will include groups of different cores, (2) the latency to access off-chip memory will be larger for cores further from the on-chip memory controller and (3) as the number of cores per memory controller increases so does the pressure regarding the off-chip access bandwidth. To address these issues we propose a task assignment policy that monitors the demands of the application task and accordingly assigns the task to a better matching core if available. The assignment policy triggers, if needed, task migration in order to optimize both the execution time and the power consumption. In this paper we describe the assignment algorithm and how we will implement it on a many-core system.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121687741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aleksandar Brankovic, Kyriakos Stavrou, E. Gibert, Antonio González
{"title":"Performance analysis and predictability of the software layer in dynamic binary translators/optimizers","authors":"Aleksandar Brankovic, Kyriakos Stavrou, E. Gibert, Antonio González","doi":"10.1145/2482767.2482786","DOIUrl":"https://doi.org/10.1145/2482767.2482786","url":null,"abstract":"Dynamic Binary Translators and Optimizers (DBTOs) have been established as a common architecture during the last years. They are used in many different systems, such as emulation, instrumentation tools and innovative HW/SW co-designed microarchitectures. Although many researchers worked on characterizing and reducing the emulation overhead, there are no published results that explain how the DBTO behaves from the microarchitectural prospective and how its behavior may be predicted based on high-level, guest application statistics. Such results are important for guiding design decisions and system optimization.\u0000 In this paper we study the DBTO as an independent application by dividing its functionality into modules. We show that the behavior of the DBTO is not constant at all. The contribution of the different modules in the total overhead, the overhead itself, the microarchitectural interaction with the emulated application and the microarchitectural profile of the different modules changes significantly based on the emulated application. This result comes in contrast to numerous papers that consider this behavior constant and exclude the DBTO from the simulation. Throughout this paper we detail this variance, we quantify it and we explain the reasons behind it.\u0000 The insights presented in this work can be exploited towards the design of more efficient DBTOs and their early performance evaluation.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134191691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Load balancing in a changing world: dealing with heterogeneity and performance variability","authors":"Michael Boyer, K. Skadron, Shuai Che, N. Jayasena","doi":"10.1145/2482767.2482794","DOIUrl":"https://doi.org/10.1145/2482767.2482794","url":null,"abstract":"Fully utilizing the power of modern heterogeneous systems requires judiciously dividing work across all of the available computational devices. Existing approaches for partitioning work require offline training and generate fixed partitions that fail to respond to fluctuations in device performance that occur at run time. We present a novel dynamic approach to work partitioning that requires no offline training and responds automatically to performance variability to provide consistently good performance. Using six diverse OpenCL#8482; applications, we demonstrate the effectiveness of our approach in scenarios both with and without run-time performance variability, as well as in more extreme scenarios in which one device is non-functional.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133990016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"To cache or not to cache: a trade-off analysis for locally cached database systems","authors":"K. Rietveld, H. Wijshoff","doi":"10.1145/2482767.2482807","DOIUrl":"https://doi.org/10.1145/2482767.2482807","url":null,"abstract":"In this paper, we study the feasibility of using performance models to support an analysis of the computational load in local database caching. Local database caching is used, for example, to relieve the computational load of a main DBMS in large deployments of web applications. This is done by caching part of the database contents in a DBMS local to the application server. While for common scenarios with a high browse-to-order ratio this frequently results in a reduction of the computational load, there are also scenarios in which there is not a clear advantage of local database caching. This is especially the case when each local write also results in a write into the main database server, thereby increasing the computational resource requirements. In this paper, two methods are presented which can be used to obtain significant computational parameters. We demonstrate how these parameters are used on two different hardware platforms and show that a reasonable prediction accuracy within actual measured results can be reached.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125131816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chunkun Bo, Rui Hou, Junmin Wu, Tao Jiang, Liuhang Zhang
{"title":"TCNet: cross-node virtual machine communication acceleration","authors":"Chunkun Bo, Rui Hou, Junmin Wu, Tao Jiang, Liuhang Zhang","doi":"10.1145/2482767.2482810","DOIUrl":"https://doi.org/10.1145/2482767.2482810","url":null,"abstract":"Driven by rapid development of cloud computing, virtualized environments are becoming popular in data center. Frequent communication among multiple virtual machines is required by a large amount of applications. Although many virtualization acceleration techniques have been proposed, the network performance is still a hot research topic due to the complicated and costly implementations of I/O virtualization mechanism. Some previous research focuses on improving the efficiency of communication among virtual machines in the same host. But studying how to accelerate cross-node virtual machine communication is also necessary. On the other hand, many high efficient, tight-coupling interconnects have been proposed as data center interconnects. They have advantages in performance and efficiency, while traditional Ethernet and InfiniBand have good scalability. However, these two kinds of interconnects can coexist very well. Tight-coupling protocol is suitable for connecting small-scale data center nodes, which we call super-node, while super-node is connected by traditional interconnect. In our opinion, data center with such hybrid interconnect architecture is one of important trends. Targeting the hybrid interconnect architecture, this paper proposes an efficient mechanism, named as TCNet (abbreviation for tight-coupling network), to accelerate cross-node virtual machine communication.\u0000 To verify the acceleration mechanism, we build a prototype system which chooses PCIe (for inner-super-node interconnect) and Ethernet (for inter-super-node interconnect) as the hybrid interconnect and use KVM as software environments. We use several benchmarks to evaluate the mechanism. The latency of TCNet is 23% shorter than that of Gigabit Ethernet on average and the bandwidth is 1.14 times as large as that of Gigabit Ethernet on average. Besides, we use Specweb2006 to evaluate its web service ability. TCNet can support 20% more clients simultaneously than that of Ethernet and response requests 19% faster. The results demonstrate that TCNet has great potential to accelerate cross-node virtual machine communication for data center with hybrid interconnect.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131602880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigating hybrid SSD FTL schemes for Hadoop workloads","authors":"Hyeran Jeon, Kaoutar El Maghraoui, G. Kandiraju","doi":"10.1145/2482767.2482793","DOIUrl":"https://doi.org/10.1145/2482767.2482793","url":null,"abstract":"The Flash Translation Layer (FTL) is the core engine for Solid State Disks (SSD). It is responsible for managing the virtual to physical address mappings and emulating the functionality of a normal block-level device. SSD performance is highly dependent on the design of the FTL. For the last few years, several FTL schemes have been proposed. Hybrid FTL schemes have gained more popularity since they try to combine the benefits of both page-level mapping and block-level mapping schemes. Examples include BAST, FAST, LAST, etc. To provide high performance, FTL designers face several cross cutting issues: the right balance between coarse and fine grain address mapping, the asymmetric nature of reads and writes, the write amplification property of Flash memory, and the wear-out behavior of flash.\u0000 The MapReduce paradigm has become a very popular paradigm for performing parallel and distributed computations on large data. Hadoop, an open-source implementation of MapReduce, has accelerated MapReduce adoption. Flash SSD is increasingly being used as a storage solution in Hadoop deployments for faster processing and better energy utilization. Little work has been done to understand the endurance implications of SSD on Hadoop-based workloads. In this paper, using a highly flexible and reconfigurable kernel-level simulation infrastructure, we investigate the internal characteristics of various hybrid FTL schemes using a representative set of Hadoop workloads. Our investigation brings out the wear-out behavior of SSD for Hadoop-based workloads including wear-leveling details, garbage collection, translation and block/page mappings, and advocates the need for dynamic tuning of FTL parameters for these workloads.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130399597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"D3AS project: a different approach to the manycore challenges","authors":"L. Verdoscia, R. Vaccaro","doi":"10.1145/2212908.2212948","DOIUrl":"https://doi.org/10.1145/2212908.2212948","url":null,"abstract":"The number of cores integrated onto a single die is expected to climb steadily in the foreseeable future. The main aim of Demand Data Driven Architecture System (D3AS) project is an attempt to provide a new programming model and architecture to allow efficient programming of highly parallel systems based on thousands of simple, thin cores. After a detailed description of the proposed prototype, some experimental results, obtained by a demonstrator, are discussed. Results show that the D3AS approach is feasible and promising.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125272280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Isuru Herath, Demian Rosas-Ham, M. Luján, I. Watson
{"title":"SnCTM: reducing false transaction aborts by adaptively changing the source of conflict detection","authors":"Isuru Herath, Demian Rosas-Ham, M. Luján, I. Watson","doi":"10.1145/2212908.2212919","DOIUrl":"https://doi.org/10.1145/2212908.2212919","url":null,"abstract":"Optimistic concurrency provided by Transactional Memory (TM) makes it a good candidate for maintaining synchronization in future multi-core processors. Speculative execution and bulk level conflict detection enable TM to provide synchronization at fine grain without the complexity of managing fine grain locks. Early hardware TM systems proposed to store the information needed for checking conflicts in the Level 1 (L1) cache, thereby limiting the size of a transaction to the size of the L1 cache. The introduction of signatures to TM systems removed this limitation and allowed transactions to be of any size.\u0000 However signatures produce false positives which leads to performance degradation in TM systems. The objective of introducing signatures to TM is that the size of a transaction can be bigger than the L1 cache. Once signatures are integrated to a TM system, they are used to detect conflicts regardless of the size of a transaction. This means signatures are being used even for transactions that can store their read and write sets in the L1 cache.\u0000 Based on this observation we propose SnCTM, a TM system that adaptively changes the source used to detect conflicts. In our approach, when a transaction fits in the L1 cache, cache line information is used to detect conflicts and signatures are used otherwise. By adaptively changing the source, SnCTM achieved up to 4.62 and 2.93 times speed-up over a baseline TM using lazy versioning and lazy conflict detection with two commonly used signature configurations. We also show that our system, even with a smaller signature (64 bit), can achieve performance comparable to a system with a perfect signature (8k bit).","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"88 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127027363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mont-Blanc: towards energy-efficient HPC systems","authors":"Nikola Puzovic","doi":"10.1145/2212908.2212961","DOIUrl":"https://doi.org/10.1145/2212908.2212961","url":null,"abstract":"This talk will present the Mont-Blanc project, an European initiative to build exascale systems using energy-efficient parts coming from the embedded market. The energy consumption of current general purpose and high-performance chips would require an unaffordable total power budget for an exascale system to be build using these parts.\u0000 The Mont-Blanc project aims to lower the total power of exascale systems by using parts from the embedded market which have a much higher FLOPS/Watt ration than traditional general purpose processor, at the cost of a lower peak performance per chip. Hence, exascale systems built using embedded parts would require a very high number of processors. In this context, overlapping communications and computations is key for applications to reach the system peak performance. This would require highly tuned application code which most users would not be able to afford. The Mont-Blanc project heavily relies on the OmpSs programming model. OmpSs provide a simple parallel programming interface that most users can easily use, and an advanced runtime system that automatically overlaps computation and communication. Furthermore, the OmpSs runtime system is also able to dynamically adapt the load of each node to accomplish the overall system load balance.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"546 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122503123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GA-GPU: extending a library-based global address spaceprogramming model for scalable heterogeneouscomputing systems","authors":"V. Tipparaju, J. Vetter","doi":"10.1145/2212908.2212918","DOIUrl":"https://doi.org/10.1145/2212908.2212918","url":null,"abstract":"Scalable heterogeneous computing (SHC) architectures are emerging as a response to new requirements for low cost, power efficiency, and high performance. For example, numerous contemporary HPC systems are using commodity Graphical Processing Units (GPU) to supplement traditional multicore processors. Yet scientists still face a number of challenges in utilizing SHC systems. First and foremost, they are forced to combine a number of programming models and then delicately optimize the data movement among these multiple programming systems on each architecture. In this paper, we investigate a new programming model for SHC systems that attempts to unify data access to the aggregate memory available in GPUs in the system. In particular, we extend the popular and easy to use Global Address Space (GAS) programming model to SHC systems. We explore multiple implementation options, and demonstrate our solution in the context of Global Arrays, a library based GAS model. We then evaluate these options in the context of kernels and applications, such as a scalable chemistry application: NWChem. Our results reveal that GA-GPU can offer considerable benefit to users in terms of programmability, and both our empirical results and performance model provide encouraging performance benefits for future systems that offer a tightly integrated memory system.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129301585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}