{"title":"A Monolithic 3D Hybrid Architecture for Energy-Efficient Computation","authors":"Ye Yu;Niraj K. Jha","doi":"10.1109/TMSCS.2018.2882433","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2882433","url":null,"abstract":"The exponentially increasing performance of chip multiprocessors (CMPs) predicted by Moore's Law is no longer due to the increasing clock rate of a single CPU core, but on account of the increase of core counts in the CMP. More transistors are integrated within the same footprint area as the technology node shrinks to deliver higher performance. However, this is accompanied by higher power dissipation that usually exceeds the coping capability of inexpensive cooling techniques. This Power Wall prevents the chip from running at full speed with all the devices powered-on. This is known as the dark silicon problem. Another major bottleneck in CMP development is the imbalance between the CPU clock rate and memory access speed. This Memory Wall keeps the CPU from fully utilizing its compute power. To address both the Power and Memory Walls, we propose a monolithic 3D hybrid architecture that consists of a multi-core CPU tier, a fine-grain dynamically reconfigurable (FDR) field-programmable gate array (FPGA) tier, and multiple resistive RAM (RRAM) tiers. The FDR tier is used as an accelerator. It uses the concept of temporal logic folding to localize on-chip communication. The RRAM tiers are connected to the CPU and FDR tiers through an efficient memory interface that takes advantage of the tremendous bandwidth available from monolithic inter-tier vias and hides the latency of large data transfers. We evaluate the architecture on two types of benchmarks: compute-intensive and memory-intensive. We show that the architecture reduces both power and energy significantly at a better performance for both types of applications. Compared to the baseline, our architecture achieves an average of 43.1× and 2.5× speedup on compute-intensive and memory-intensive benchmarks, respectively. The power and energy consumption are reduced by 5.0× and 40.5×, respectively, for compute-intensive applications, and 2.0× and 4.2×, respectively, for memory-intensive applications. This translates to 1745.3× energy-delay product (EDP) improvement for compute-intensive applications and 10.5× for memory-intensive applications.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"533-547"},"PeriodicalIF":0.0,"publicationDate":"2018-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2882433","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunfeng Lu;Huaxi Gu;Krishnendu Chakrabarty;Yintang Yang
{"title":"H$^2$OEIN: A Hierarchical Hybrid Optical/Electrical Interconnection Network for Exascale Computing Systems","authors":"Yunfeng Lu;Huaxi Gu;Krishnendu Chakrabarty;Yintang Yang","doi":"10.1109/TMSCS.2018.2881715","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2881715","url":null,"abstract":"The performance of high-performance computing (HPC) systems is largely determined by the interconnection network. The rising demand for computing capability leads to an expansion of the interconnection network and a corresponding increase in system cost and power consumption. The growing use of optical interconnects not only reduces the network cost and power consumption, but also meets the system-scaling bandwidth demands. However, unlike in an electrical switch, the lack of a buffer in the optical switch makes it hard to operate an all-optical network at packet-level granularity. In this paper, we propose a hierarchical hybrid optical/electrical interconnection network (H\u0000<inline-formula><tex-math>$^2$</tex-math></inline-formula>\u0000OEIN) based on low-radix switches and arrayed waveguide grating routers (AWGRs). In the lower layers, the use of low-radix switches results in lower cost and power consumption. The modular structure composed of low-radix switches facilitates the expansion of the network. At higher layers, high bandwidth and fast switching can be achieved using AWGR based optical interconnects. Because the higher layers of the network are passive, the power consumption can be reduced to a large extent. Network simulation results show that H\u0000<inline-formula><tex-math>$^2$</tex-math></inline-formula>\u0000OEIN reduces the cost by 25 percent and the power consumption by 45 percent compared to a dragonfly network in configurations with over 300,000 nodes.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"722-733"},"PeriodicalIF":0.0,"publicationDate":"2018-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2881715","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Novel, Simulator for Heterogeneous Cloud Systems that Incorporate Custom Hardware Accelerators","authors":"Nikolaos Tampouratzis;Ioannis Papaefstathiou","doi":"10.1109/TMSCS.2018.2879601","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2879601","url":null,"abstract":"The growing use of hardware accelerators in both embedded (e.g., automotive) and high end systems (e.g., Cloud infrastructure) triggers an urgent demand for simulation frameworks that can simulate in an integrated manner all the components (i.e., CPUs, Memories, Networks, and Hardware Accelerators) of a system-under-design(SuD). By utilizing such a simulator, software design can proceed in parallel with hardware development which results in the reduction of the so important time-to-market. The main problem, however, is that currently there is a shortage of such simulation frameworks; most simulators used for modelling the user applications (i.e., full-system CPU/Mem/Peripheral simulators) lack any type of support for tailor-made hardware accelerators. The presented ACSIM framework is the first known open-source, high-performance simulator that can handle holistically system-of-systems including processors, peripherals, accelerators, and networks; such an approach is, for example, very appealing for the design of Cloud Servers that incorporate FPGAs as PCI-connected accelerators. ACSIM is an extension of the COSSIM simulation framework and it integrates, in a novel and efficient way, a combined system and network simulator with a SystemC simulator, in a transparent to the end-user way. ACSIM has been evaluated when executing several real-world use cases; the end results demonstrate that the presented approach has up to 99 percent accuracy in the reported SuD aspects (when compared with the corresponding characteristics measured in the real systems), while the overall simulation time can be accelerated almost linearly with the number of CPUs utilized by the simulator. More importantly, the presented interconnection scheme between the Processing and the SystemC simulators is orders of magnitude faster than the existing solutions, while ACSIM can efficiently simulate up to several hundreds of processing nodes with hardware accelerators interconnected together, in a fully distributed manner.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"565-576"},"PeriodicalIF":0.0,"publicationDate":"2018-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2879601","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enforcing End-to-End I/O Policies for Scientific Workflows Using Software-Defined Storage Resource Enclaves","authors":"Suman Karki;Bao Nguyen;Joshua Feener;Kei Davis;Xuechen Zhang","doi":"10.1109/TMSCS.2018.2879096","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2879096","url":null,"abstract":"Data-intensive knowledge discovery requires scientific applications to run concurrently with analytics and visualization codes executing in situ for timely output inspection and knowledge extraction. Consequently, I/O pipelines of scientific workflows can be long and complex because they comprise many stages of analytics across different layers of the I/O stack of high-performance computing systems. Performance limitations at any I/O layer or stage can cause an I/O bottleneck resulting in greater than expected end-to-end I/O latency. In this paper, we present the design and implementation of a novel data management infrastructure called \u0000<italic>Software-Defined Storage Resource Enclaves</i>\u0000 (SIREN) at system level to enforce end-to-end policies that dictate an I/O pipeline's performance. SIREN provides an I/O performance interface for users to specify the desired storage resources in the context of in-situ analytics. If suboptimal performance of analytics is caused by an I/O bottleneck when data are transferred between simulations and analytics, schedulers in different layers of the I/O stack automatically provide the guaranteed lower bounds on I/O throughput. Our experimental results demonstrate that SIREN provides performance isolation among scientific workflows sharing multiple storage servers across two I/O layers (burst buffer and parallel file systems) while maintaining high system scalability and resource utilization.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"662-675"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2879096","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiafeng Xie;Pramod Kumar Meher;Xiaojun Zhou;Chiou-Yng Lee
{"title":"Low Register-Complexity Systolic Digit-Serial Multiplier Over $GF(2^m)$ Based on Trinomials","authors":"Jiafeng Xie;Pramod Kumar Meher;Xiaojun Zhou;Chiou-Yng Lee","doi":"10.1109/TMSCS.2018.2878437","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2878437","url":null,"abstract":"Digit-serial systolic multipliers over \u0000<inline-formula><tex-math>$GF(2^m)$</tex-math></inline-formula>\u0000 based on the National Institute of Standards and Technology (NIST) recommended trinomials play a critical role in the real-time operations of cryptosystems. Systolic multipliers over \u0000<inline-formula><tex-math>$GF(2^m)$</tex-math></inline-formula>\u0000 involve a large number of registers of size \u0000<inline-formula><tex-math>$O(m^2)$</tex-math></inline-formula>\u0000 which results in significant increase in area complexity. In this paper, we propose a novel low register-complexity digit-serial trinomial-based finite field multiplier. The proposed architecture is derived through two novel coherent interdependent stages: (i) derivation of an efficient hardware-oriented algorithm based on a novel input-operand feeding scheme and (ii) appropriate design of novel low register-complexity systolic structure based on the proposed algorithm. The extension of the proposed design to Karatsuba algorithm (KA)-based structure is also presented. The proposed design is synthesized for FPGA implementation and it is shown that it (the design based on regular multiplication process) could achieve more than 12.1 percent saving in area-delay product and nearly 2.8 percent saving in power-delay product. To the best of the authors’ knowledge, the register-complexity of proposed structure is so far the least among the competing designs for trinomial based systolic multipliers (for the same type of multiplication algorithm).","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"773-783"},"PeriodicalIF":0.0,"publicationDate":"2018-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2878437","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Frequency Offset-Based Ring Oscillator Physical Unclonable Function","authors":"Jiliang Zhang;Xiao Tan;Yuanjing Zhang;Weizheng Wang;Zheng Qin","doi":"10.1109/TMSCS.2018.2877737","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2877737","url":null,"abstract":"Weak Physical Unclonable Function (PUF) is a promising lightweight hardware security primitive that is used for secret key generation without the requirement of secure nonvolatile electrically erasable programmable read-only memory (EEPROM) or battery backed static random-access memory (SRAM) for resource-limited applications such as Internet of Thing (IoT) and embedded systems. The Ring Oscillator (RO) PUF is one of the most popular weak PUFs that can generate the volatile key by comparing the frequency difference between any two ROs. However, it is difficult for the RO PUF to maintain an absolutely stable response with operating environment varies. In order to eliminate the impact of environment factors, previous RO PUFs incur significant hardware overheads to improve the reliability. This paper proposes a frequency offset-based RO PUF structure which exhibits high reliability and low hardware overhead. The key idea is to make the frequency difference larger than a given threshold by offsetting the frequencies of RO pairs to improve reliability. Prototype implementation on Xilinx 65 nm Field-programmable Gate Arrays (FPGAs) shows the low overhead of the new structure and 100 percent reliability with temperature range of 45 \u0000<inline-formula><tex-math>$^circ mathrm{C}$</tex-math></inline-formula>\u0000 \u0000<inline-formula><tex-math>$sim$</tex-math></inline-formula>\u0000 95 \u0000<inline-formula><tex-math>$^circ mathrm{C}$</tex-math></inline-formula>\u0000.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"711-721"},"PeriodicalIF":0.0,"publicationDate":"2018-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2877737","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peyman Faizian;Juan Francisco Alfaro;Md Shafayat Rahman;Md Atiqul Mollah;Xin Yuan;Scott Pakin;Michael Lang
{"title":"TPR: Traffic Pattern-Based Adaptive Routing for Dragonfly Networks","authors":"Peyman Faizian;Juan Francisco Alfaro;Md Shafayat Rahman;Md Atiqul Mollah;Xin Yuan;Scott Pakin;Michael Lang","doi":"10.1109/TMSCS.2018.2877264","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2877264","url":null,"abstract":"The Cray Cascade architecture uses Dragonfly as its interconnect topology and employs a globally adaptive routing scheme called UGAL. UGAL directs traffic based on link loads but may make inappropriate adaptive routing decisions in various situations, which degrades its performance. In this work, we propose traffic pattern-based adaptive routing (TPR) for Dragonfly that improves UGAL by incorporating a traffic pattern-based adaptation mechanism. The idea is to explicitly use the link usage statistics that are collected in performance counters to infer the traffic pattern, and to take the inferred traffic pattern plus link loads into consideration when making adaptive routing decisions. Our performance evaluation results on a diverse set of traffic conditions indicate that by incorporating the traffic pattern-based adaptation mechanism, TPR is much more effective in making adaptive routing decisions and achieves significant lower latency under low load and higher throughput under high load than its underlying UGAL.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"931-943"},"PeriodicalIF":0.0,"publicationDate":"2018-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2877264","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67861394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Fluid-Chip Co-Design for Digital Microfluidic Biochips Considering Cost Drivers and Design Convergence","authors":"Arpan Chakraborty;Piyali Datta;Rajat Kumar Pal","doi":"10.1109/TMSCS.2018.2874248","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2874248","url":null,"abstract":"The design process for digital microfluidic biochips (DMFBs) is becoming more complex due to the growing need for essential bio-protocols. A number of significant fluid- and chip-level synthesis tools have been offered previously for designing an efficient system. Several important cost drivers like bioassay schedule length, total pin count, congestion-free wiring, total wire length, and total layer count together measure the efficiency of the DMFBs. Besides, existing design gaps among the sub-tasks of the fluid and chip level make the design process expensive delaying the time-to-market and increasing the overall cost. In this context, removal of design cycles among the sub-tasks is a prior need to obtain a low-cost and efficient platform. Hence, this paper aims to propose a fluid-chip co-design methodology in dealing with the consideration of the fluid-chip cost drivers, while reducing the design cycles in between. A simulation study considering a number of benchmarks has been presented to observe the performance.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"548-564"},"PeriodicalIF":0.0,"publicationDate":"2018-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2874248","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenbo Qiao;Tao Lu;Huizhang Luo;Qing Liu;Scott Klasky;Norbert Podhorszki;Jinzhen Wang
{"title":"SIRIUS: Enabling Progressive Data Exploration for Extreme-Scale Scientific Data","authors":"Zhenbo Qiao;Tao Lu;Huizhang Luo;Qing Liu;Scott Klasky;Norbert Podhorszki;Jinzhen Wang","doi":"10.1109/TMSCS.2018.2886851","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2886851","url":null,"abstract":"Scientific simulations on high performance computing (HPC) platforms generate large quantities of data. To bridge the widening gap between compute and I/O, and enable data to be more efficiently stored and analyzed, simulation outputs need to be refactored, reduced, and appropriately mapped to storage tiers. However, a systematic solution to support these steps has been lacking in the current HPC software ecosystem. To that end, this paper develops SIRIUS, a progressive JPEG-like data management scheme for storing and analyzing big scientific data. It co-designs data decimation, compression, and data storage, taking the hardware characteristics of each storage tier into considerations. With reasonably low overhead, our approach refactors simulation data, using either topological or uniform decimation, into a much smaller, reduced-accuracy base dataset, and a series of deltas that is used to augment the accuracy if needed. The base dataset and deltas are compressed and written to multiple storage tiers. Data saved on different tiers can then be selectively retrieved to restore the level of accuracy that satisfies data analytics. Thus, SIRIUS provides a paradigm shift towards elastic data analytics and enables end users to make trade-offs between analysis speed and accuracy on-the-fly. This paper further develops algorithms to preserve statistics for data decimation, a common requirement for reducing data. We assess the impact of SIRIUS on unstructured triangular meshes, a pervasive data model used in scientific simulations. In particular, we evaluate two realistic use cases: the blob detection in fusion and high-pressure area extraction in computational fluid dynamics.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"900-913"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2886851","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68025495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"2018 Index IEEE Transactions on Multi-Scale Computing Systems Vol. 4","authors":"","doi":"10.1109/TMSCS.2019.2902963","DOIUrl":"https://doi.org/10.1109/TMSCS.2019.2902963","url":null,"abstract":"Presents the 2018 subject/author index for this publication.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"1-14"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2019.2902963","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}