Tian Chen , Yu-an Tan , Zheng Zhang , Nan Luo , Bin Li , Yuanzhang Li
{"title":"Dataflow optimization with layer-wise design variables estimation method for enflame CNN accelerators","authors":"Tian Chen , Yu-an Tan , Zheng Zhang , Nan Luo , Bin Li , Yuanzhang Li","doi":"10.1016/j.jpdc.2024.104869","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104869","url":null,"abstract":"<div><p>As convolution layers have been proved to be the most time-consuming operation in convolutional neural network (CNN) algorithms, many efficient CNN accelerators have been designed to boost the performance of convolution operations. Previous works on CNN acceleration usually use fixed design variables for diverse convolutional layers, which would lead to inefficient data movements and low utilization of computing resource. We tackle this issue by proposing a flexible dataflow optimization method with design variables estimation for different layers. The optimization method first narrows the design space by the priori constraints, and then enumerates all legal solutions to select the optimal design variables. We demonstrate the effectiveness of the proposed optimization method by implementing representative CNN models (VGG-16, ResNet-18 and MobileNet V1) on Enflame Technology's programmable CNN accelerator, General Computing Unit (GCU). The results indicate that our optimization can significantly enhance the throughput of the convolution layers in ResNet, VGG and MobileNet on GCU, with improvement of up to 1.84×. Furthermore, it achieves up to 2.08× of GCU utilization specifically for the convolution layers of ResNet on GCU.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140067279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive patch grid strategy for parallel protein folding using atomic burials with NAMD","authors":"Emerson A. Macedo, Alba C.M.A. Melo","doi":"10.1016/j.jpdc.2024.104868","DOIUrl":"10.1016/j.jpdc.2024.104868","url":null,"abstract":"<div><p>The definition of protein structures is an important research topic in molecular biology currently, since there is a direct relationship between the function of the protein in the organism and the 3D geometric configuration it adopts. The transformations that occur in the protein structure from the 1D configuration to the 3D form are called protein folding. <em>Ab initio</em> protein folding methods use physical forces to model the interactions among the atoms that compose the protein. In order to accelerate those methods, parallel tools such as NAMD were proposed. In this paper, we propose two contributions for parallel protein folding simulations: (a) adaptive patch grid (APG) and (b) the addition of atomic burials (AB) to the traditional forces used in the simulation. With APG, we are able to adapt the simulation box (patch grid) to the current shape of the protein during the folding process. AB forces relate the 3D protein structure to its geometric center and are adequate for modeling globular proteins. Thus, adding AB to the forces used in parallel protein folding potentially increases the quality of the result for this class of proteins. APG and AB were implemented in NAMD and tested in supercomputer environments. Our results show that, with APG, we are able to reduce the execution time of the folding simulation of protein 4LNZ (5,714 atoms, 15 million time steps) from 12 hours and 36 minutes to 11 hours and 8 minutes, using 16 nodes (256 CPU cores). We also show that our APG+AB strategy was successfully used in a realistic protein folding simulation (1.7 billion time steps).</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140054484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HoneyTwin: Securing smart cities with machine learning-enabled SDN edge and cloud-based honeypots","authors":"Mohammed M. Alani","doi":"10.1016/j.jpdc.2024.104866","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104866","url":null,"abstract":"<div><p>With the promise of higher throughput, and better response times, 6G networks provide a significant enabler for smart cities to evolve. The rapidly-growing reliance on connected devices within the smart city context encourages malicious actors to target these devices to achieve various malicious goals. In this paper, we present a novel defense technique that creates a cloud-based virtualized honeypot/twin that is designed to receive malicious traffic through edge-based machine learning-enabled detection system. The proposed system performs early identification of malicious traffic in a software defined network-enabled edge routing point to divert that traffic away from the 6G-enabled smart city endpoints. Testing of the proposed system showed an accuracy exceeding 99.8%, with an <span><math><msub><mrow><mi>F</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span> score of 0.9984.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139942060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hierarchical sort-based parallel algorithm for dynamic interest matching","authors":"Wenjie Tang, Yiping Yao, Lizhen Ou, Kai Chen","doi":"10.1016/j.jpdc.2024.104867","DOIUrl":"10.1016/j.jpdc.2024.104867","url":null,"abstract":"<div><p>Publish–subscribe communication is a fundamental service used for message-passing between decoupled applications in distributed simulation. When abundant unnecessary data transfer is introduced, interest-matching services are needed to filter irrelevant message traffic. Frequent demands during simulation execution makes interest matching a bottleneck with increased simulation scale. Contemporary algorithms built for serial processing inadequately leverage multicore processor-based parallel resources. Parallel algorithmic improvements are insufficient for large-scale simulations. Therefore, we propose a hierarchical sort-based parallel algorithm for dynamic interest matching that embeds all update and subscription regions into two full binary trees, thereby transferring the region-matching task to one of node-matching. It utilizes the association between adjacent nodes and the hierarchical relation between parent‒child nodes to eliminate redundant operations, and achieves incremental parallel matching that only compares changed regions. We analyze the time and space complexity of this process. The new algorithm performs better and is more scalable than state-of-the-art algorithms.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139923545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anne Benoit , Thomas Herault , Lucas Perotin , Yves Robert , Frédéric Vivien
{"title":"Revisiting I/O bandwidth-sharing strategies for HPC applications","authors":"Anne Benoit , Thomas Herault , Lucas Perotin , Yves Robert , Frédéric Vivien","doi":"10.1016/j.jpdc.2024.104863","DOIUrl":"10.1016/j.jpdc.2024.104863","url":null,"abstract":"<div><p>This work revisits I/O bandwidth-sharing strategies for HPC applications. When several applications post concurrent I/O operations, well-known approaches include serializing these operations (<figure><img></figure>) or fair-sharing the bandwidth across them (<span>FairShare</span>). Another recent approach, I/O-Sets, assigns priorities to the applications, which are classified into different sets based upon the average length of their iterations. We introduce several new bandwidth-sharing strategies, some of them simple greedy algorithms, and some of them more complicated to implement, and we compare them with existing ones. Our new strategies do not rely on any a-priori knowledge of the behavior of the applications, such as the length of work phases, the volume of I/O operations, or some expected periodicity. We introduce a rigorous framework, namely <em>steady-state windows</em>, which enables to derive bounds on the competitive ratio of all bandwidth-sharing strategies for three different objectives: minimum yield, platform utilization, and global efficiency. To the best of our knowledge, this work is the first to provide a quantitative assessment of the online competitiveness of any bandwidth-sharing strategy. This theory-oriented assessment is complemented by a comprehensive set of simulations, based upon both synthetic and realistic traces. The main conclusion is that two of our simple and low-complexity greedy strategies significantly outperform <figure><img></figure>, <span>FairShare</span> and I/O-Sets, and we recommend that the I/O community would implement them for further assessment.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139878546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(24)00023-6","DOIUrl":"https://doi.org/10.1016/S0743-7315(24)00023-6","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000236/pdfft?md5=8661326c859cab793505056ef1edee51&pid=1-s2.0-S0743731524000236-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139726370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring multiprocessor approaches to time series analysis","authors":"Ricardo Quislant, Eladio Gutierrez, Oscar Plata","doi":"10.1016/j.jpdc.2024.104855","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104855","url":null,"abstract":"<div><p>Time series analysis is a key technique for extracting and predicting events in domains as diverse as epidemiology, genomics, neuroscience, environmental sciences, economics, etc. <em>Matrix Profile</em>, a state-of-the-art algorithm to perform time series analysis, finds out the most similar and dissimilar subsequences in a time series in deterministic time and it is exact. Matrix Profile has low arithmetic intensity and it operates on large amounts of time series data, which can be an issue in terms of memory requirements. On the other hand, Hardware Transactional Memory (HTM) is an alternative optimistic synchronization method that executes transactions speculatively in parallel while keeping track of memory accesses to detect and resolve conflicts.</p><p>This work evaluates one of the best implementations of Matrix Profile exploring multiple multiprocessor variants and proposing new implementations that consider a variety of synchronization methods (HTM, locks, barriers), as well as algorithm organizations. We analyze these variants using real datasets, both short and large, in terms of speedup and memory requirements, the latter being a major issue when dealing with very large time series. The experimental evaluation shows that our proposals can achieve up to 100× speedup over the sequential algorithm for 128 threads, and up to 3× over the baseline, while keeping memory requirements low and even independent of the number of threads.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000194/pdfft?md5=a25b14cc13a327c9c4b6c5f9abde8126&pid=1-s2.0-S0743731524000194-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139732906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast recovery for large disk enclosures based on RAID2.0: Algorithms and evaluation","authors":"Qiliang Li , Min Lyu , Liangliang Xu , Yinlong Xu","doi":"10.1016/j.jpdc.2024.104854","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104854","url":null,"abstract":"<div><p>The RAID2.0 architecture, which uses dozens or even hundreds of disks, is widely adopted for large-capacity data storage. However, limited resources like memory and CPU cause RAID2.0 to execute batch recovery for disk failures. The traditional random data placement and recovery schemes result in highly skewed I/O access within a batch, which slows down the recovery speed. To address this issue, we propose DR-RAID, an efficient reconstruction scheme that balances local rebuilding workloads across all surviving disks within a batch. We dynamically select a batch of tasks with almost balanced read loads and make intra-batch adjustments for tasks with multiple solutions of reading source chunks. Furthermore, we use a bipartite graph model to achieve a uniform distribution of write loads. DR-RAID can be applied with homogeneous or heterogeneous disk rebuilding bandwidth. Experimental results demonstrate that in offline rebuilding, DR-RAID enhances the rebuilding throughput by up to 61.90% compared to the random data placement scheme. With varied rebuilding bandwidth, the improvement can reach up to 65.00%.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139732543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating the effectiveness of Bat optimization in an adaptive and energy-efficient network-on-chip routing framework","authors":"B. Naresh Kumar Reddy , Aruru Sai Kumar","doi":"10.1016/j.jpdc.2024.104853","DOIUrl":"10.1016/j.jpdc.2024.104853","url":null,"abstract":"<div><p>Adaptive routing is effective in maintaining higher processor performance and avoids packets over minimal or non-minimal alternate routes without congestion for a multiprocessor system on chip. However, many systems cannot deal with the fact that sending packets over an alternative path rather than the shorter, fixed-priority route can result in packets arriving at the destination node out of order. This can occur if packets belonging to the same communication flow are adaptively routed through a different path. In real-world network systems, there are strategies and algorithms to efficiently handle out-of-order packets without requiring infinite memory. Techniques like buffering, sliding windows, and sequence number management are used to reorder packets while considering the practical constraints of available memory and processing power. The specific method used depends on the network protocol and the requirements of the application. In the proposed technique, a novel technique aimed at improving the performance of multiprocessor systems on chip by implementing adaptive routing based on the Bat algorithm. The framework employs 5 stage pipeline router, that completely gained and forward a packet at the perfect direction in an adaptive mode. Bat algorithm is used to enhance the performance, which can optimize route to transmit packets at the destination. A test was carried out on various NoC sizes (6 X 6 and 8 X 8) under multimedia benchmarks, compared with other related algorithms and implemented on Kintex-7 FPGA board. The outcomes of the simulation illustrate that the proposed algorithm reduces delay and improves the throughput over the other traditional adaptive algorithms.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139688940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Collaborative dispersion by silent robots","authors":"Barun Gorain , Partha Sarathi Mandal , Kaushik Mondal , Supantha Pandit","doi":"10.1016/j.jpdc.2024.104852","DOIUrl":"10.1016/j.jpdc.2024.104852","url":null,"abstract":"<div><p>In the dispersion problem, a set of <em>k</em> co-located mobile robots must relocate themselves in distinct nodes of an unknown network. The network is modeled as an anonymous graph <span><math><mi>G</mi><mo>=</mo><mo>(</mo><mi>V</mi><mo>,</mo><mi>E</mi><mo>)</mo></math></span>, where the graph's nodes are not labeled. The edges incident to a node <em>v</em> with degree <em>d</em> are labeled with port numbers in the range <span><math><mo>{</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo>,</mo><mo>…</mo><mo>,</mo><mi>d</mi><mo>−</mo><mn>1</mn><mo>}</mo></math></span> at <em>v</em>. The robots have unique IDs in the range <span><math><mo>[</mo><mn>0</mn><mo>,</mo><mi>L</mi><mo>]</mo></math></span>, where <span><math><mi>L</mi><mo>≥</mo><mi>k</mi></math></span>, and are initially placed at a source node <em>s</em>. The task of the dispersion was traditionally achieved based on the assumption of two types of communication abilities: (a) when some robots are at the same node, they can communicate by exchanging messages between them, and (b) any two robots in the network can exchange messages between them. This paper investigates whether this communication ability among co-located robots is absolutely necessary to achieve dispersion. We establish that even in the absence of the ability of communication, the task of the dispersion by a set of mobile robots can be achieved in a much weaker model, where a robot at a node <em>v</em> has access to following very restricted information at the beginning of any round: (1) am I alone at <em>v</em>? (2) did the number of robots at <em>v</em> increase or decrease compared to the previous round?</p><p>We propose a deterministic distributed algorithm that achieves the dispersion on any given graph <span><math><mi>G</mi><mo>=</mo><mo>(</mo><mi>V</mi><mo>,</mo><mi>E</mi><mo>)</mo></math></span> in time <span><math><mi>O</mi><mrow><mo>(</mo><mi>k</mi><mi>log</mi><mo></mo><mi>L</mi><mo>+</mo><msup><mrow><mi>k</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>log</mi><mo></mo><mi>Δ</mi><mo>)</mo></mrow></math></span>, where Δ is the maximum degree of a node in <em>G</em>. Further, each robot uses <span><math><mi>O</mi><mo>(</mo><mi>log</mi><mo></mo><mi>L</mi><mo>+</mo><mi>log</mi><mo></mo><mi>Δ</mi><mo>)</mo></math></span> additional memory, i.e., memory other than the memory required to store its id. We also prove that the task of the dispersion cannot be achieved by a set of mobile robots with <span><math><mi>o</mi><mo>(</mo><mi>log</mi><mo></mo><mi>L</mi><mo>+</mo><mi>log</mi><mo></mo><mi>Δ</mi><mo>)</mo></math></span> additional memory.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139688985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}