N. Amirafshar, A. S. Baroughi, H. Shahhoseini, N. Taherinejad
{"title":"An Approximate Carry Disregard Multiplier with Improved Mean Relative Error Distance and Probability of Correctness","authors":"N. Amirafshar, A. S. Baroughi, H. Shahhoseini, N. Taherinejad","doi":"10.1109/DSD57027.2022.00016","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00016","url":null,"abstract":"Nowadays, a wide range of applications can tolerate certain computational errors. Hence, approximate computing has become one of the most attractive topics in computer architecture. Reducing accuracy in computations in a premeditated and appropriate manner reduces architectural complexities, and as a result, performance, power consumption, and area can improve significantly. This paper proposes a novel approximate multiplier design. The proposed design has been implemented using 45 nm CMOS technology and has been extensively evaluated. Compared to existing approximate architectures, the proposed approximate multiplier has higher accuracy. It also achieves better results in critical path delay, power consumption, and area up to 47.54 %, 75.24%, and 92.49%, respectively. Compared to the precise multipliers, our evaluations show that the critical path delay, power consumption, and area have been improved by 39%, 18%, and 6 %, respectively.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125120892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anna Schröder, M. Maktabi, R. Thieme, B. Jansen-Winkeln, I. Gockel, C. Chalopin
{"title":"Evaluation of artificial neural networks for the detection of esophagus tumor cells in microscopic hyperspectral images","authors":"Anna Schröder, M. Maktabi, R. Thieme, B. Jansen-Winkeln, I. Gockel, C. Chalopin","doi":"10.1109/DSD57027.2022.00116","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00116","url":null,"abstract":"Microscopic analysis of histological slides of cancer tissue samples is standardly performed under white light microscopy. Researchers demonstrated the potential of artificial intelligence (AI) methods for the automatic identification of tumor cells. Hyperspectral imaging (HSI) combined with AI approaches can improve the accuracy, reliability, and time of the analysis. In this work, a HSI camera was coupled with a standard microscope to acquire microscopic hyperspectral (HS) images of stained histological slides of esophagus cancer tissue of 95 patients. The HS images were analyzed with deep learning algorithms to discriminate healthy cells (squamous epithelium) and tumors (stroma tumor and esophagus adenocarcinoma EAC). Five models were considered: a 2D CNN, a 2D CNN preserving the spatial relationship between spectral layers, a 3D CNN, a pre-trained 3D CNN and a recurrent neural network (RNN). They were evaluated using a leave-one-patient-out cross-validation. The predicted two classes were visualized with false colors. The RNN obtained the highest quantitative results with an accuracy of 0.791, an AUC of 0.79 and a computing time of 7.57 s per 10,000 patches. The best visual result was obtained on two selected HS images with the 2D CNN model. The performance of the automatic classification was higher on tissue which has not been treated with previous neoadjuvant therapy. The combination of HSI with deep learning method is promising for the automatic analysis of histological slides for cancer diagnosis.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116645594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Resilient QDI Pipeline Implementations","authors":"Zaheer Tabassam, A. Steininger","doi":"10.1109/DSD57027.2022.00093","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00093","url":null,"abstract":"QDI circuits are robust towards timing issues, but this elasticity makes them vulnerable in value-domain fault scenarios because data-accepting windows are flexibly defined by the handshakes, and during these windows any data transition gets latched, even those originating from single event transients. As a solution, locking the data-accepting windows after the first transition contributes to robustness, but still needs consideration. We examine WCHB variants called Interlocking-WCHB and Input/Output-Interlocking-WCHB in this respect. To highlight the relevant error triggering conditions, we chose two target circuits to investigate the behavior in detail: FIFO and pipelined multiplier. Based on the experimental results we investigate the observed errors to understand the main cause of their generation and propagation. We highlight the problematic scenarios and propose modifications in buffer styles that resolve most of these while minimizing the area overhead to 50%.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122113224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Viktor Herrmann, Justin Knapheide, Fritjof Steinert, B. Stabernack
{"title":"A YOLO v3-tiny FPGA Architecture using a Reconfigurable Hardware Accelerator for Real-time Region of Interest Detection","authors":"Viktor Herrmann, Justin Knapheide, Fritjof Steinert, B. Stabernack","doi":"10.1109/DSD57027.2022.00021","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00021","url":null,"abstract":"With the recent advances in the fields of machine learning, neural networks and deep-learning algorithms have become a prevalent subject of computer vision. Especially for tasks like object classification and detection Convolutional Neu-ronal Networks (CNNs) have surpassed the previous traditional approaches. In addition to these applications, CNNs can recently also be found in other applications. For example the parametrization of video encoding algorithms as used in our example is quite a new application domain. Especially CNN's high recognition rate makes them particularly suitable for finding Regions of Interest (ROIs) in video sequences, which can be used for adapting the data rate of the compressed video stream accordingly. On the downside, these CNN require an immense amount of processing power and memory bandwidth. Object detection networks such as You Only Look Once (YOLO) try to balance processing speed and accuracy but still rely on power-hungry GPUs to meet real-time requirements. Specialized hardware like Field Programmable Gate Array (FPGA) implementations proved to strongly reduce this problem while still providing sufficient computational power. In this paper we propose a flexible architecture for object detection hardware acceleration based on the YOLO v3-tiny model. The reconfigurable accelerator comprises a high throughput convolution engine, custom blocks for all additional CNN operations and a programmable control unit to manage on-chip execution. The model can be deployed without significant changes based on 32-bit floating point values and without further methods that would reduce the model accuracy. Experimental results show a high capability of the design to accelerate the object detection task with a processing time of 27.5 ms per frame. It is thus real-time-capable for 30 FPS applications at frequency of 200 MHz.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114578570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware-Software Codesign of a CNN Accelerator","authors":"Changjae Yi, Donghyun Kang, S. Ha","doi":"10.1109/DSD57027.2022.00054","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00054","url":null,"abstract":"The explosive growth of deep learning applications based on convolutional neural network (CNN) in embedded sys-tems is spurring the development of a hardware CNN accelerator, called a neural processing unit (NPU). In this work, we present how the hardware-software codesign methodology could be applied to the design of a novel adder-type NPU. After devising a baseline datapath that enables fully-pipelined execution of layers, we define a high-level behavior model based on which a high-level compiler and a virtual prototyping system are built concurrently. Since it is easy to change the microarchitecture of an NPU by modifying the simulation models of the hardware modules, we could explore the design space of NPU microarchitecture easily. In addition, we could evaluate the effect of hardware extensions to support various types of non-convolutional operations that recent CNN models use widely. After the final datapath is determined, we design the control structure and low-level compiler and implement the NPU prototype. Implementation results on an FPGA prototype show the viability of the proposed methodology and its outcome.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130540886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Coherency Traffic Reduction in Manycore Systems","authors":"Erdem Derebaşoğlu, I. Kadayif, O. Ozturk","doi":"10.1109/DSD57027.2022.00043","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00043","url":null,"abstract":"With the increasing number of cores in manycore accelerators and chip multiprocessors (CMPs), it gets more challenging to provide cache coherency efficiently. Although the snooping-based protocols are appropriate solutions to small-scale systems, they are inefficient for large systems because of the limited bandwidth. Therefore, large-scale manycores require directory-based solutions where a hardware structure called directory holds the information. This directory keeps track of all memory blocks and which cache stores a copy of these blocks. The directory sends messages only to caches that store relevant blocks and also coordinate simultaneous accesses to a cache block. As directory-based protocols scale to many cores, performance, network-on-chip (NoC) traffic, and bandwidth become major problems. In this paper, we present software mechanisms to improve the effectiveness of directory-based cache coherency in manycore and multicore systems with shared memory. In multithreaded applications, some of the data accesses do not disrupt cache coherency, but they still produce coherency messages among cores such as read-only (private) data. However, if data is accessed by at least two cores and at least one of them is a write operation, it is called shared data and requires cache coherency. In our proposed system, private data and shared data are determined at compile time, and cache coherency protocol only applies to shared data. We implement our approach in two stages. First, we use Andersen's static pointer analysis to analyze the program and mark its private instructions, i.e., instructions that load or store private data. Then, we use these analyses to decide if cache coherency protocol will be applied or not at runtime. Our simulation results on parallel benchmarks show that our approach reduces cycle count, dynamic random access memory (DRAM) accesses, and coherency traffic up to 13%.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131944345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Jafari, M. Mayahinia, Soyed Tuhin Ahmed, Christopher Münch, M. Tahoori
{"title":"MVSTT: A Multi-Value Computation-in-Memory based on Spin-Transfer Torque Memories","authors":"A. Jafari, M. Mayahinia, Soyed Tuhin Ahmed, Christopher Münch, M. Tahoori","doi":"10.1109/DSD57027.2022.00052","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00052","url":null,"abstract":"Analog Computation-in-Memory (CiM) with emerging non-volatile memories leads to significant performance and energy efficiency. Spin-Transfer Torque Magnetic Memory (STT-MRAM) is one of the promising technologies for CiM architectures. Although STT-MRAM has various benefits, it does not have the potential to be used directly in analog multi-value CiM operations due to its limited levels of cell resistance states. In this paper, we propose a novel flexible multi-value design for STT-MRAM (MVSTT) with the potential to be used for multi-value CiM. In the multi-value CiM, we are able to have various 2s resistive state combinations from $s$ selected MTJs, which is not possible in the normal STT-MRAM CiM. The size of the MVSTT can be adjusted at run-time depending on the application's requirements. The benefits of the proposed scheme are quantified in representative applications such as multi-value matrix multiplications, which is the basic computation of Neural Networks applications. For the multi-value matrix multiplication, the energy, and delay gain is up to 9.7 × and 13.3 ×, respectively, to non-CiM matrix-vector-multiplication. Also, for the neural network, the proposed design allows up to a 32 × reduction in the STT-MRAM cells per crossbar to achieve a similar inference accuracy as the binarized neural network.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123698341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SecDec: Secure Decode Stage thanks to masking of instructions with the generated signals","authors":"Gaëtan Leplus, O. Savry, L. Bossuet","doi":"10.1109/DSD57027.2022.00080","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00080","url":null,"abstract":"Physical attacks are becoming a major security issue in IOT applications. One of the main vectors of attacks on processors is the corruption of the execution flow. Fault injections allow the modification of instructions, in particular jumps and branches. The proposed approach involves making a RISC-V processor's instruction path more resistant by introducing dependencies between succeeding instructions. The signals extracted from the instruction decoding stage is used to unmask the following instruction. Whereas all instructions have been previously masked during compilation with the expected mask. We show that this solution has a very low hardware overhead of 3.25% and power consumption of 4.33%. But also overhead software of 1.61% in code size and 1.12% in execution time. An instruction corruption or a jump will be detected on average in fewer than 2 cycles after the fault while making disassembling from side-channel leakages becomes more difficult.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115172869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Biagioni, P. Cretaro, O. Frezza, F. L. Cicero, A. Lonardo, Michele Martinelli, P. Paolucci, E. Pastorelli, F. Simula, Matteo Turisini, P. Vicini, R. Ammendola, Pascale Bernier-Bruna, Claire Chen, Said Derradji, Stephane Guez, Pierre-Axel Lagadec, G. Pichon, Etienne Walter, G. D. Gassowski, Matthieu Hautreaux, Stephane Mathieu, G. Moreau, Marc Pérache, Hugo Taboada, T. Hoefler, Timo Schneider, Matteo Barnaba, G. Brandino, F. D. Giorgi, Matteo Poggi, I. Mavroidis, Y. Papaefstathiou, N. Tampouratzis, Benjamin Kalisch, U. Krackhardt, Mondrian Nuessle, Pantelis Xirouchakis, Vangelis Mageiropoulos, Michalis Gianioudis, Harisis Loukas, Aggelos D. Ioannou, Nikos Kallimanis, N. Chrysos, M. Katevenis, Wolfang Frings, Dominik Gottwald, Felime Guimaraes, M. Holicki, Volker Marx, Ya N Muller, Carsten Clauss, H. Falter, Xu Huang, Jennifer Lopez Barillao, Thomas Moschny, Simon Pickartz, F. J. Alfaro, J. Escudero-Sahuquillo, P. García, F. Quiles, J. L. Sánchez, Adrián Castelló, José Duro, M. E. Gómez, E. S. Quintana‐O
{"title":"RED-SEA: Network Solution for Exascale Architectures","authors":"A. Biagioni, P. Cretaro, O. Frezza, F. L. Cicero, A. Lonardo, Michele Martinelli, P. Paolucci, E. Pastorelli, F. Simula, Matteo Turisini, P. Vicini, R. Ammendola, Pascale Bernier-Bruna, Claire Chen, Said Derradji, Stephane Guez, Pierre-Axel Lagadec, G. Pichon, Etienne Walter, G. D. Gassowski, Matthieu Hautreaux, Stephane Mathieu, G. Moreau, Marc Pérache, Hugo Taboada, T. Hoefler, Timo Schneider, Matteo Barnaba, G. Brandino, F. D. Giorgi, Matteo Poggi, I. Mavroidis, Y. Papaefstathiou, N. Tampouratzis, Benjamin Kalisch, U. Krackhardt, Mondrian Nuessle, Pantelis Xirouchakis, Vangelis Mageiropoulos, Michalis Gianioudis, Harisis Loukas, Aggelos D. Ioannou, Nikos Kallimanis, N. Chrysos, M. Katevenis, Wolfang Frings, Dominik Gottwald, Felime Guimaraes, M. Holicki, Volker Marx, Ya N Muller, Carsten Clauss, H. Falter, Xu Huang, Jennifer Lopez Barillao, Thomas Moschny, Simon Pickartz, F. J. Alfaro, J. Escudero-Sahuquillo, P. García, F. Quiles, J. L. Sánchez, Adrián Castelló, José Duro, M. E. Gómez, E. S. Quintana‐O","doi":"10.1109/DSD57027.2022.00100","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00100","url":null,"abstract":"In order to enable Exascale computing, next generation interconnection networks must scale to hundreds of thousands of nodes, and must provide features to also allow the HPC, HPDA, and AI applications to reach Exascale, while benefiting from new hardware and software trends. RED-SEA will pave the way to the next generation of European Exascale interconnects, including the next generation of BXI, as follows: (i) specify the new architecture using hardware-software co-design and a set of applications representative of the new terrain of converging HPC, HPDA, and AI; (ii) test, evaluate, and/or implement the new architectural features at multiple levels, according to the nature of each of them, ranging from mathematical analysis and modeling, to simulation, or to emulation or implementation on FPGA testbeds; (iii) enable seamless communication within and between resource clusters, and therefore development of a high-performance low latency gateway, bridging seamlessly with Ethernet; (iv) add efficient network resource management, thus improving congestion resiliency, virtualization, adaptive routing, collective operations; (v) open the interconnect to new kinds of applications and hardware, with enhancements for end-to-end network services - from programming models to reliability, security, low- latency, and new processors; (vi) leverage open standards and compatible APIs to develop innovative reusable libraries and Fabrics management solutions.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123079713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generation of Verified Programs for In-Memory Computing","authors":"Saman Froehlich, R. Drechsler","doi":"10.1109/DSD57027.2022.00114","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00114","url":null,"abstract":"In order to overcome the von Neumann bottleneck, recently the paradigm of in-memory computing has emerged. Here, instead of transferring data from the memory to the CPU for computation, the computation is directly performed within the memory. ReRAM, a resistance-based storage device, is a promising technology for this paradigm. Based on ReRAM, the PLiM computer architecture and LiM-HDL, an HDL for specifying PLiM programs have emerged. In this paper, we first present a novel levelization algorithm for LiM-HDL. Based on this novel algorithm, large circuits can be compiled to PLiM programs. Then, we present a verification scheme for these programs. This scheme is separated into two steps: (1) A proof of purity and (2) a proof of equivalence. Finally, in the experiments, we first apply our levelization algorithms to a well-known benchmark set, where we show that we can generate PLiM programs for large benchmarks, for which existing levelization algorithms fails. Then, we apply our proposed verification scheme to these PLiM programs.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122654848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}