Niyati Gupta, Manoj Kumar, Ashish Sharma, M. Gaur, V. Laxmi, M. Daneshtalab, M. Ebrahimi
{"title":"Improved Route Selection Approaches using Q-learning framework for 2D NoCs","authors":"Niyati Gupta, Manoj Kumar, Ashish Sharma, M. Gaur, V. Laxmi, M. Daneshtalab, M. Ebrahimi","doi":"10.1145/2768177.2768180","DOIUrl":"https://doi.org/10.1145/2768177.2768180","url":null,"abstract":"With the emergence of large multi-core architectures, a volume of research has been focused on distributing traffic evenly over the whole network. However, increase in traffic density may lead to congestion and subsequently degrade the performance by increased latency in the network. In this paper, we propose two novel route selection strategies for on-chip networks which are based on the Q-learning framework. The proposed strategies use variable learning rate to dynamically capture the current congestion status of the network using an additional parameter and improves the learning process to select a less congested output channel. Both the proposed selection strategies are found to adapt significantly faster to the changes in traffic load and traffic patterns by avoiding congested areas. The results demonstrate that proposed strategies achieve significant performance improvement over conventional Q-routing and its variants with slight area-overhead.","PeriodicalId":374555,"journal":{"name":"Proceedings of the 3rd International Workshop on Many-core Embedded Systems","volume":"449 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116332052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Programming Model for the Epiphany Many-Core Coprocessor Using Threaded MPI","authors":"J. Ross, D. Richie, S. Park, D. Shires","doi":"10.1145/2768177.2768183","DOIUrl":"https://doi.org/10.1145/2768177.2768183","url":null,"abstract":"The Adapteva Epiphany many-core architecture comprises a 2D tiled mesh Network-on-Chip (NoC) of low-power RISC cores with minimal uncore functionality. It offers high computational energy efficiency for both integer and floating point calculations as well as parallel scalability. Yet despite the interesting architectural features, a compelling programming model has not been presented to date. This paper demonstrates an efficient parallel programming model for the Epiphany architecture based on the Message Passing Interface (MPI) standard. Using MPI exploits the similarities between the Epiphany architecture and a conventional parallel distributed cluster of serial cores. Our approach enables MPI codes to execute on the RISC array processor with little modification and achieve high performance. We report benchmark results for the threaded MPI implementation of four algorithms (dense matrix-matrix multiplication, N-body particle interaction, a five-point 2D stencil update, and 2D FFT) and highlight the importance of fast inter-core communication for the architecture.","PeriodicalId":374555,"journal":{"name":"Proceedings of the 3rd International Workshop on Many-core Embedded Systems","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122489368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FOLCS: A Lightweight Implementation of a Cycle-accurate NoC Simulator on FPGAs","authors":"Takahiro Naruko, K. Hiraki","doi":"10.1145/2768177.2768182","DOIUrl":"https://doi.org/10.1145/2768177.2768182","url":null,"abstract":"Recent trends toward multi- and many-core architectures make computer architecture simulation time-consuming. Although core counts are increasing, it is difficult to exploit parallelism in simulators because of synchronization overheads. FPGAs are effective tools to reduce simulation time. The size of a circuit implementable on them, however, is limited by the number of block RAMs and slices they have. It is important to develop a lightweight simulator of each processor component so that a full-system simulator as a whole fits into an FPGA. In this paper, we focus on a network-on-chip (NoC), which is an intra-chip communication fabric to connect cores and memory controllers. We present Flit-Oriented Lightweight Cycle-accurate network Simulator (FOLCS) that is a NoC simulator running on an FPGA. FOLCS provides a cycle-accurate NoC model with moderate resource requirements. The accuracy is validated by case studies that compare network latency computed by FOLCS and a reference software simulator. The post place-and-route report shows that FOLCS requires less block RAMs than previous methods.","PeriodicalId":374555,"journal":{"name":"Proceedings of the 3rd International Workshop on Many-core Embedded Systems","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133732104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the Feasibility of Advanced Cache Indexing for High-Performance and Energy-Efficient GPGPU Computing","authors":"Kyu Yeun Kim, Seunghoe Kim, Woongki Baek","doi":"10.1145/2768177.2768179","DOIUrl":"https://doi.org/10.1145/2768177.2768179","url":null,"abstract":"To achieve higher performance and energy efficiency, GPGPU architectures have recently begun to employ hardware caches. Adding hardware caches to GPGPUs, however, does not automatically guarantee improved performance and energy efficiency due to the thrashing in small hardware caches shared by thousands of threads. While prior work has proposed warp scheduling and cache bypassing techniques to address this issue, relatively little work has been done in the context of advanced cache indexing. To bridge this gap, this work investigates the feasibility of advanced cache indexing for high-performance and energy-efficient GPGPU computing. We first discuss the design and implementation of static and adaptive cache indexing schemes for GPGPUs. We then quantify the effectiveness of the advanced indexing schemes using GPGPU benchmarks. Our quantitative evaluation demonstrates that the advanced cache indexing schemes are promising in that they significantly outperform the conventional cache indexing scheme. In addition, for a subset of cache-sensitive benchmarks, the adaptive indexing scheme substantially outperforms the static indexing scheme by effectively identifying and utilizing high-quality indexing bits based on runtime information. Finally, our evaluation shows that the effectiveness of advanced cache indexing is sensitive to different warp schedulers, motivating further research on coordinated cache indexing and warp scheduling techniques.","PeriodicalId":374555,"journal":{"name":"Proceedings of the 3rd International Workshop on Many-core Embedded Systems","volume":"226 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134604680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 3rd International Workshop on Many-core Embedded Systems","authors":"M. Ebrahimi, D. Goehringer","doi":"10.1145/2768177","DOIUrl":"https://doi.org/10.1145/2768177","url":null,"abstract":"","PeriodicalId":374555,"journal":{"name":"Proceedings of the 3rd International Workshop on Many-core Embedded Systems","volume":"58 S274","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120835317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware Scheduler Performance on the Plural Many-Core Architecture","authors":"Itai Avron, R. Ginosar","doi":"10.1145/2768177.2768184","DOIUrl":"https://doi.org/10.1145/2768177.2768184","url":null,"abstract":"The Plural many-core architecture combines hundreds of simple cores, lock-free shared memory, hardware scheduler and a task-based programming model. The hardware scheduler enables fast scheduling and allocation of fine grain tasks to all cores. Scheduler performance is evaluated based on an architectural simulator and on multiple benchmarks representing a wide variety of inherent parallelism. Several architectural alternatives and scheduler configurations are simulated. It is shown that a scheduler with capacity to schedule and terminate 10 task-instances per cycle, along with a task queue of as little as two slots near each core, is sufficient to utilize 256 cores.","PeriodicalId":374555,"journal":{"name":"Proceedings of the 3rd International Workshop on Many-core Embedded Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121980357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigating the Viability of Maximum Flexibility Selection Function in Bufferless 2D Meshes","authors":"M. A. A. ElMohsen, H. M. El-Boghdadi","doi":"10.1145/2768177.2768185","DOIUrl":"https://doi.org/10.1145/2768177.2768185","url":null,"abstract":"Bufferless NoCs have emerged as a solution to reduce power and area by eliminating buffers used for routing. Such networks handle contention using packet dropping or deflection. In this paper, we study the effect of MaxFlex selection function on 2D bufferless meshes for both a fixed and a variable step size. For fixed step size, we perform an analytical study for the effect of using MaxFlex with different step size on the performance of 2D bufferless meshes. The analysis indicates that, as the step size increases the traffic in the central part of the network bisection relaxes. Simulation results show that, both average packet latency and average deflection count decrease as the step size used increases. Additionally, over different sizes of meshes, the results show that the network performs best if the step size is equal 60--80% of the mesh dimension. Then, we consider using variable step size in which a packet is routed using a step size dependent on the Manhattan distance, d, between the source and destination. Simulation results show that, using MaxFlex, a step size of 60% of the distance d enhances the packet latency over using fixed step size, straight line selection function and random productive port selection function by around 29%, 97% and 99% respectively.","PeriodicalId":374555,"journal":{"name":"Proceedings of the 3rd International Workshop on Many-core Embedded Systems","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125059536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mostafa Said, Farhad Mehdipour, K. Murakami, M. El-Sayed
{"title":"A Design Methodology for Performance Maintenance of 3D Network-on-Chip with Multiplexed Through-Silicon Vias","authors":"Mostafa Said, Farhad Mehdipour, K. Murakami, M. El-Sayed","doi":"10.1145/2768177.2768178","DOIUrl":"https://doi.org/10.1145/2768177.2768178","url":null,"abstract":"3D integration is an emerging technology that overcomes 2D integration process limitations. The use of short Through-Silicon Vias (TSVs) introduces a significant reduction in routing area, power consumption, and delay. Though, there are still several challenges in 3D integration technology need to be addressed. It is shown in literature that reducing TSV count has a considerable effect in improving yield. The TSV multiplexing technique called TSVBOX was introduced in [1] to reduce the TSV count without affecting the direct benefits of TSVs. The TSVBOX introduces some delay to the signals to be multiplexed. In this paper, we analyse the TSVBOX timing requirements and deduce a design methodology for TSVBOX-based 3D Network-on-Chip (NoC) to overcome the TSVBOX speed degradation. Performance comparisons under different traffic patterns are conducted to verify our solution. We show that TSVBOX-based 3D NoC performance is highly dependent on the NoC traffic pattern and in most simulation scenarios we tried, it shows almost the same performance of the conventional 3D NoC.","PeriodicalId":374555,"journal":{"name":"Proceedings of the 3rd International Workshop on Many-core Embedded Systems","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124346246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SoPHy: A Software Platform for Hybrid Resource Management of Homogeneous Many-core Accelerators","authors":"Taeyoung Kim, Jintaek Kang, Sungchan Kim, S. Ha","doi":"10.1145/2768177.2768181","DOIUrl":"https://doi.org/10.1145/2768177.2768181","url":null,"abstract":"As demand of higher computing power is steadily increasing, it becomes popular to equip a many-core accelerator in a computer system to run current applications. Efficient management of compute resources in such a system is challenging because various factors such as workload variation, QoS requirement change, and hardware failure may cause dynamic change of system status. Recently a variety of resource management techniques for many-core accelerators have been proposed. They are usually tailored to a specific target architecture. In this paper, we propose a software platform, SoPHy, which supports various types of many-core architectures, based on a hybrid resource management technique. SoPHy has been implemented on two different many-core architectures: the Xeon Phi coprocessor and a NoC virtual prototype. Experimental results prove that SoPHy is capable of adapting to the runtime workload variation effectively with affordable overhead of runtime resource management.","PeriodicalId":374555,"journal":{"name":"Proceedings of the 3rd International Workshop on Many-core Embedded Systems","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124391859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}