{"title":"Implementing the Open Community Runtime for Shared-Memory and Distributed-Memory Systems","authors":"J. Dokulil, Martin Sandrieser, S. Benkner","doi":"10.1109/PDP.2016.81","DOIUrl":"https://doi.org/10.1109/PDP.2016.81","url":null,"abstract":"The extreme scale, complexity and performance variability of future high performance computing systems pose many new challenges to parallel programming models and runtime systems. The Open Community Runtime (OCR) is a recent effort for a task-based runtime system for extreme scale parallel systems. We have implemented the OCR specification in a shared-memory environment on top of TBB, providing an alternative to the implementation created by the OCR consortium. We have created an experimental extension that supports parallel accelerators programmed with OpenCL. We also have an implementation that targets distributed-memory systems. Despite being in an early stage of development, our implementations can achieve reasonable performance with some applications. We describe the main aspects of our OCR implementations and report on early experimental results on shared-memory and distributed-memory systems.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134603892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards a General Framework for Ensuring and Reusing Proofs of Termination Detection in Distributed Computing","authors":"Maha Boussabbeh, M. Tounsi, A. Kacem, M. Mosbah","doi":"10.1109/PDP.2016.113","DOIUrl":"https://doi.org/10.1109/PDP.2016.113","url":null,"abstract":"Distributed algorithms are designed to run on interconnected autonomous computing entities for achieving a common task: each entity executes asynchronously the same code and interacts locally with its immediate neighbours. It is widely agreed that the lack of knowledge of the global state makes termination detection one of the most important and complex problems in distributed computing. By relying on refinement, we prove that an algorithm computing a spanning tree with Local Termination Detection (each entity is able to determine only its own termination condition), can be reused and adapted in order to compute the same algorithm with Global Termination Detection (at least one entity is aware that the entire computation is achieved in the network). The main idea relies upon specifying a combination of a well known algorithm namely SSP and the spanning tree algorithm, following a top/down approach. This paper is a starting point towards a general framework for enhancing termination detection property of distributed algorithms and reusing their proofs.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116117666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
André Naz, Benoît Piranda, S. Goldstein, J. Bourgeois
{"title":"A Time Synchronization Protocol for Modular Robots","authors":"André Naz, Benoît Piranda, S. Goldstein, J. Bourgeois","doi":"10.1109/PDP.2016.73","DOIUrl":"https://doi.org/10.1109/PDP.2016.73","url":null,"abstract":"In this paper, we propose the Modular Robot Time Protocol (MRTP), a network-wide time synchronization protocol for modular robots. Our protocol achieves its performance by combining several mechanisms: central time master election, low-level time-stamping and clock skew compensation using linear regression. We evaluate our protocol on the Blinky Blocks hardware. Experimental results show that MRTP can potentially manage real systems composed of up to 27,775 Blinky Blocks. We observe that the synchronization precision depends on the hardware, the hop distance to the time master, the synchronization periods and the number of synchronization points used for the linear regressions. Furthermore, we show that our protocol is able to keep a Blinky Blocks system synchronized to a few milliseconds, using few network resources at runtime, even-though the Blinky Blocks hardware clocks exhibit very poor accuracy and resolution.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124102059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Burdakov, Y. Grigorev, A. Ploutenko, Eugene Ttsviashchenko
{"title":"Estimation Models for NoSQL Database Consistency Characteristics","authors":"A. Burdakov, Y. Grigorev, A. Ploutenko, Eugene Ttsviashchenko","doi":"10.1109/PDP.2016.23","DOIUrl":"https://doi.org/10.1109/PDP.2016.23","url":null,"abstract":"This article considers NoSQL database replication problems. It analyzes the influence of the N, W, R replication parameters on the consistency characteristics of database record replicas (N -- the total number of one record's replicas, W -- number of replicas for write operation execution into a database, R -- number of replicas for record read operation execution from a database). It describes a developed model for eventual consistency (W+R ≤ N), obtaining probability estimate that during the process of N-W replica updates there will be at least one read request out of non-updated replicas. It also proposes a model for strong consistency of the replicas in NoSQL databases, which allows for estimation of random wait time of the read request for the record update completion. It describes the process for preparation and execution of experiments in the cloud for model calibration and its validation.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123032794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluation of the Memory Communication Traffic in a Hierarchical Cache Model for Massively-Manycore Processors","authors":"Sharifa Al Khanjari, W. Vanderbauwhede","doi":"10.1109/PDP.2016.30","DOIUrl":"https://doi.org/10.1109/PDP.2016.30","url":null,"abstract":"The scaling of semiconductor technologies is leading to processors with increasing numbers of cores. A key enabler in manycore systems is the use of Networks-on-Chip (NoC) as a global communication mechanism. The adoption of NoCs in manycore systems requires a shift in focus from computation to communication, as communication is fast becoming the dominant factor in processor performance. Many researchers have focused on direct communication between cores in the NoC, however in a manycore processor the communication is actually between the cores and the memory hierarchy. In this work, we investigate the memory communication traffic of shared threads in a hierarchical cache architecture. We argue that the performance scalability for shared-memory applications in a hierarchical cache architecture for systems with thousands of processor cores depends on the distance between threads sharing memory in terms of the cache hierarchy (the \"memory distance\"). We present latency and throughput results comparing fat quadtree, concentrated mesh and mesh topologies as a function of the \"memory distance\" between the threads. Our results using the ITRS physical data for 2023 show that the model of thread placement and the distance of placing them significantly affects the NoC performance, and that scale-invariant topologies perform better than flat topologies.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121468446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Vizitiu, L. Itu, Ranveer Joyseeree, A. Depeursinge, H. Müller, C. Suciu
{"title":"GPU-Accelerated Texture Analysis Using Steerable Riesz Wavelets","authors":"A. Vizitiu, L. Itu, Ranveer Joyseeree, A. Depeursinge, H. Müller, C. Suciu","doi":"10.1109/PDP.2016.105","DOIUrl":"https://doi.org/10.1109/PDP.2016.105","url":null,"abstract":"Visual pattern recognition is a key research topic in the field of image processing and computer vision. Texture analysis based on steerable Riesz wavelets is powerful, but requires computing pixel-wise operations resulting in a run time in the order of days when large volumes of data are processed. To overcome this limitation we propose a Graphics Processing Unit (GPU) based solution. A standard CPU version is used as starting point for the development of baseline GPU versions. To further increase the performance, and to overcome compute and memory limitations we apply a series of optimization techniques, leading to five versions in total. The best performing GPU solution ensures a speed-up of 93× for the parallelized section of the application and of 29.6× for the entire application. Furthermore, we show that a higher Riesz order and/or a higher image resolution further increases the speed-up.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125505908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reasoning about Fences and Relaxed Atomics","authors":"Mengda He, Viktor Vafeiadis, S. Qin, J. Ferreira","doi":"10.1109/PDP.2016.103","DOIUrl":"https://doi.org/10.1109/PDP.2016.103","url":null,"abstract":"For efficiency reasons, weak (or relaxed) memory is now the norm on modern architectures. To cater for this trend, modern programming languages are adapting their memory models. The new C11 memory model [1] allows several levels of memory weakening, including non-atomics, relaxed atomics, release-acquire atomics, and sequentially consistent atomics. Under such weak memory models, multithreaded programs exhibit more behaviours, some of which would have been inconsistent under the traditional strong (i.e. sequentially consistent) memory model. This makes the task of reasoning about concurrent programs even more challenging. The GPS framework, recently developed by Turon et al.[22], has made a step forward towards tackling this challenge. By integrating ghost states, per-location protocols and separation logic, GPS can successfully verify programs with release-acquire atomics. In this paper, we present a program logic, an enhancement of the GPS framework, that can support the verification of a bigger class of C11 programs, that is, programs with release-acquire atomics, relaxed atomics and release-acquire fences. Key elements of our proposed logic include two new types of assertions, a more expressive resource model and a set of newly-designed verification rules.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"441 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134276467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erwan Nogues, M. Pelcat, D. Ménard, Alexandre Mercat
{"title":"Energy Efficient Scheduling of Real Time Signal Processing Applications through Combined DVFS and DPM","authors":"Erwan Nogues, M. Pelcat, D. Ménard, Alexandre Mercat","doi":"10.1109/PDP.2016.15","DOIUrl":"https://doi.org/10.1109/PDP.2016.15","url":null,"abstract":"This paper proposes a framework to design energy efficient signal processing systems. The energy efficiency is provided by combining Dynamic Frequency and Voltage Scaling (DVFS) and Dynamic Power Management (DPM). The framework is based on Synchronous Dataflow (SDF) modeling of signal processing applications. A transformation to a single rate form is performed to expose the application parallelism. An automated scheduling is then performed, minimizing the constraint of energy efficiency and providing DVFS and DPM decisions. This framework uses an architecture model including the number of available cores, the per-actor processing load and the energy per-cycle, derived from time and power measurements of modelled applications. After introducing the proposed framework, the energy characterization of big.LITTLE SoC systems is described. A generic approach is presented to generate the energy model of a platform from power measurements as customized polynomials. Finally, the experimental results on a Samsung Exynos 5410 big.LITTLE processor show that the energy optimal execution is not obtained by Linux governors that can execute either as-fast-as-possible or as-slow-as-possible. Instead, the most energy efficient scheduling is obtained by adapting both DVFS and DPM to application needs.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"196 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130727189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predicting Performance and Power Consumption of Parallel Applications","authors":"D. D. Sensi","doi":"10.1109/PDP.2016.41","DOIUrl":"https://doi.org/10.1109/PDP.2016.41","url":null,"abstract":"Current architectures provide many control knobs for the reduction of power consumption of applications, like reducing the number of used cores or scaling down their frequency. However, choosing the right values for these knobs in order to satisfy requirements on performance and/or power consumption is a complex task and trying all the possible combinations of these values is an unfeasible solution since it would require too much time. For this reasons, there is the need for techniques that allow an accurate estimation of the performance and power consumption of an application when a specific configuration of the control knobs values is used. Usually, this is done by executing the application with different configurations and by using these information to predict its behaviour when the values of the knobs are changed. However, since this is a time consuming process, we would like to execute the application in the fewest number of configurations possible. In this work, we consider as control knobs the number of cores used by the application and the frequency of these cores. We show that on most Parsec benchmark programs, by executing the application in 1% of the total possible configurations and by applying a multiple linear regression model we are able to achieve an average accuracy of 96% in predicting its execution time and power consumption in all the other possible knobs combinations.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121096555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Rousset, B. Herrmann, C. Lang, L. Philippe, Hadrien Bride
{"title":"Using Nested Graphs to Distribute Parallel and Distributed Multi-agent Systems","authors":"A. Rousset, B. Herrmann, C. Lang, L. Philippe, Hadrien Bride","doi":"10.1109/PDP.2016.91","DOIUrl":"https://doi.org/10.1109/PDP.2016.91","url":null,"abstract":"Simulation has become an indispensable tool for researchers to explore systems without having recourse to real experiments. In this context multi-agent systems are often used to model and simulate complex systems. Depending on the characteristics of the modelled system, methods used to represent the system may vary. Whatever the modelling techniques used, increasing the size and the precision of a model increases the amount of computation needed, requiring the use of parallel systems when it becomes too large. Usually, to efficiently run on parallel resources, the model must be adapted to be distributed. In this paper, we propose a new modelling approach, based on nested graphs, that allows the design of large, complex and multi-scale multi-agent models which can be efficiently distributed on parallel resources. A PDMAS (Parallel and Distributed Multi-Agent Platform) that supports this approach and efficiently run parallel multi-agent models is introduced.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126189693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}