Iman Firmansyah , Yoshiki Yamaguchi , Tsutomu Maruyama , Yuta Matsuura , Zhang Heming , Shin Kawai , Hajime Nobuhara
{"title":"FPGA-based stereo matching for crop height measurement using monocular camera","authors":"Iman Firmansyah , Yoshiki Yamaguchi , Tsutomu Maruyama , Yuta Matsuura , Zhang Heming , Shin Kawai , Hajime Nobuhara","doi":"10.1016/j.micpro.2024.105063","DOIUrl":"10.1016/j.micpro.2024.105063","url":null,"abstract":"<div><p>We have proposed a hardware-accelerated drone to analyze the condition of farmland right then and there; as a first step, we report that the proposed system can take crop height measurements with high accuracy using a monocular camera. The proposed three-dimensional farmland is generated using stereo matching, where a drone with a monocular camera can extend the parallax distance as the length between two positions when taking a ground image. This means that our approach can improve the accuracy of a reconstructed 3D farmland. In addition, toward real-time computation and low power consumption, the proposed hardware design accelerates image processing efficiently. Thus, to achieve this, we propose a strategy that combines the semi-global matching (SGM) with single path direction and a sum of absolute difference (SAD) with reduced disparity searching length. For example, a semi-global matching (SGM) was employed to smooth the disparity map result before checking the consistency, where the scan line was performed in one direction, from left to right, to speed up the computation time. The experimental result shows that the computation time performed by Xilinx Zynq ZCU102 FPGA achieves 0.77 s for the stereo data set images with 1536 × 1024 pixels resolution. To meet the real-time application and reduce the FPGA resources toward lower power consumption, the experiment discusses reducing the disparity searching length for the SAD computation. In our experiment, the execution time is less than 40 milliseconds, and the circuit volume is around 9,500 LUTs, equivalent to a small-size FPGA. Finally, we also estimated the object's height; a value of 0.43 m was estimated for the object with a physical height of 0.45 m. Meanwhile, for the object with a physical height of 0.65 m, a value of 0.63 m was estimated.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"108 ","pages":"Article 105063"},"PeriodicalIF":2.6,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141053941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OpSAVE: Eviction Based Scheme for Efficient Optical Network-on-Chip","authors":"Uzmat Ul Nisa, Janibul Bashir","doi":"10.1016/j.micpro.2024.105061","DOIUrl":"10.1016/j.micpro.2024.105061","url":null,"abstract":"<div><p>For on-chip networks, nanophotonics has been considered a strong alternative owing to its high speed (due to low latency) and high bandwidth (due to wavelength division multiplexing). However, the major hurdle in the adoption of nanophotonic-based on-chip networks is their high static power consumption. Various proposals are there in the literature which try to reduce the static power consumption either by modulating the laser or by allowing the on-chip stations to share the photonic channels. In this paper, we propose <em>OpSAVE</em>— an optical NoC that combines the above two strategies to effectively reduce static power consumption. It proposes a superior prediction mechanism based on the eviction details from the private caches. It explains how shared channels can be used to dynamically balance the load and at the same time handle mispredictions. It allows the optical stations to share both the power and the available bandwidth to increase their utilization. Moreover, <em>OpSAVE</em> proposes to use a double pumping strategy to improve the system performance. We compared our scheme with the state-of-the-art proposals in this domain and the results show that our scheme consumes 4.4X less optical power and at the same time improves the performance by nearly 28%. In the evaluation, we have considered the multicore benchmarks from the Splash and Parsec benchmark suites.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"108 ","pages":"Article 105061"},"PeriodicalIF":2.6,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141052214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low latency FPGA implementation of NTT for Kyber","authors":"Mohamed Saoudi, Akram Kermiche, Omar Hocine Benhaddad, Nadir Guetmi, Boufeldja Allailou","doi":"10.1016/j.micpro.2024.105059","DOIUrl":"10.1016/j.micpro.2024.105059","url":null,"abstract":"<div><p>This paper presents an FPGA implementation of Number Theoretic Transform (NTT) for the Kyber Post-Quantum Cryptographic (PQC) standard. NTT is the slowest process within Kyber thus a large number of efforts has been conducted to enhance its computational efficiency. Leveraging parallelism and dedicated multipliers, our design achieves state-of-the-art latency, performing NTT/INTT in just 0.4/<span><math><mrow><mn>0</mn><mo>.</mo><mn>5</mn><mspace></mspace><mi>μ</mi><mi>s</mi></mrow></math></span>, surpassing existing designs by at least 3.75/3 times. The proposed design is implemented on the cost-effective Artix-7 FPGA.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"107 ","pages":"Article 105059"},"PeriodicalIF":2.6,"publicationDate":"2024-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140792938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ExTern: Boosting RISC-V core performance using ternary encoding","authors":"Farhad EbrahimiAzandaryani, Dietmar Fey","doi":"10.1016/j.micpro.2024.105058","DOIUrl":"https://doi.org/10.1016/j.micpro.2024.105058","url":null,"abstract":"<div><p>This paper presents an effective <span><math><mi>μ</mi></math></span>-architectural design method, called ExTern, to enhance the performance of a RISC-V processor experiencing computation bottlenecks. ExTern involves integrating Canonical Signed Digit (CSD) representation, a ternary number system enabling carry/borrow-free addition/subtraction in constant time <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mn>1</mn><mo>)</mo></mrow></mrow></math></span>, into the RISC-V processor, particularly into the execution stage. Furthermore, it adopts an extended six-stage pipeline architecture to maximize employed encoding benefits, leading to more improvement in overall execution time and throughput. Despite the presence of optimized circuits, such as fast carry chain (CARRY4) modules for binary encoding on FPGA, the customized processor applying ExTern, RISC-VT, showcases remarkable improvement in computation performance. Experimental results demonstrate a 34.3% (12.2%) improvement in working frequency leading to a lower 31% (11.5%) execution time and a 32% (12%) increase in throughput compared to a State-of-the-Art open-source five(six)-stage RISC-V processor.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"107 ","pages":"Article 105058"},"PeriodicalIF":2.6,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S014193312400053X/pdfft?md5=5219c364add625230da3e174054a963d&pid=1-s2.0-S014193312400053X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140620839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juan Encinas, Alfonso Rodríguez, Andrés Otero, Eduardo de la Torre
{"title":"Data-driven modeling of reconfigurable multi-accelerator systems under dynamic workloads","authors":"Juan Encinas, Alfonso Rodríguez, Andrés Otero, Eduardo de la Torre","doi":"10.1016/j.micpro.2024.105050","DOIUrl":"https://doi.org/10.1016/j.micpro.2024.105050","url":null,"abstract":"<div><p>Reconfigurable multi-accelerator systems used as computing offloading platforms in edge-cloud continuum scenarios usually have to deal with highly dynamic workloads and operating conditions. In order to properly take advantage of their parallel processing capabilities and increase execution performance for a given workload, these systems need to continuously adapt their configuration (i.e., number and type of accelerators) at run time. When working at the edge, additional requirements such as energy efficiency must be also met. In this paper, Machine Learning techniques are applied to extract predictive models of the execution of different combinations of hardware accelerators on a reconfigurable multi-accelerator platform, aiming at satisfying the previously mentioned continuous optimization needs. One of the key benefits of the proposed approach is that its data-driven models can transparently estimate the impact of the complex interactions between hardware accelerators due to run-time resource contention among them and with the rest of the system, as opposed to traditional modeling approaches that cannot include that information in an easy and scalable way (e.g., analytical models). The proposed models are complemented with a complete infrastructure to generate, execute and monitor dynamic workloads in FPGA-based systems. This infrastructure has been used to (i) quantitatively analyze resource contention in reconfigurable multi-accelerator systems and (ii) produce the training and evaluation datasets for the ML-based models using annotated power consumption and execution performance traces. Experimental results obtained with a reconfigurable multi-accelerator platform based on the ARTICo<sup>3</sup> framework running the MachSuite benchmarks show that the proposed modeling approach is highly effective, with a relative prediction error of less than 5% on average for both power consumption and execution performance. Result also show that the ML-based models achieve high accuracy levels when predicting the impact of resource contention and accelerator interaction on both metrics, with a mean relative prediction error of less than 0.6% and a standard deviation below 4.7% for the worst case.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"107 ","pages":"Article 105050"},"PeriodicalIF":2.6,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0141933124000450/pdfft?md5=a52d32f5fafee4bda56df513540d6eb8&pid=1-s2.0-S0141933124000450-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140545665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Santos , E. Mendes , J. Carvalho , F. Alves , J. Azevedo , J. Cabral
{"title":"Hardware accelerated Active Noise Cancellation system using Haar wavelets","authors":"P. Santos , E. Mendes , J. Carvalho , F. Alves , J. Azevedo , J. Cabral","doi":"10.1016/j.micpro.2024.105047","DOIUrl":"https://doi.org/10.1016/j.micpro.2024.105047","url":null,"abstract":"<div><p>Active Noise Cancellation (<em>ANC</em>) systems are widely used to mitigate unwanted noises in several applications, such as automotive environments and high-end headsets. Multi-Channel (<em>MC</em>) <em>ANC</em> systems have shown promise in creating improved silent zones. Typically, these systems are implemented on <em>FPGA</em> platforms due to the systolic nature and granularity of optimization of these devices. This article describes the design, implementation, and evaluation of a wavelet-based <em>MC ANC</em> Filtered-x Normalized Least Mean Square (<em>FxNLMS</em>) on an <em>FPGA</em> platform.</p><p>The use of wavelet transform enables the decomposition of complex noise signals into spectrally more compact signals (i.e., easier to process). In this work, for each decomposed signal, an independent <em>NLMS</em> is applied. The system implements 64 parallel <em>NLMS</em> with 1000 coefficients. Additionally, the static <em>FIR</em> filters employed for secondary and tertiary path estimations are of the 2047th order. The system adopts an integer arithmetic architecture and operates at a sampling rate of 47.97 kHz. To assess the performance of the wavelet-based approach, benchmark tests were conducted by comparing it against a similar implementation without the wavelet transform. The evaluation was performed using noise reduction (<em>NR</em>) tests with spectrally rich (20 Hz to 10 kHz) and high dynamic range noises. The experimental setup involved two error microphones and two secondary sources.</p><p>The results show that the wavelet-based version has overall better performance than the traditional implementation, particularly in the higher frequency band of the spectrum (1 kHz to 8 kHz). For instance, in the case of city ambient noise (a realistic noise with high dynamic range), the relative <em>NR</em> achieved was 8.23 dB.</p><p>To the authors’ knowledge, this is the first time that the implementation and field-test of a wavelet-based <em>MC ANC</em> on an <em>FPGA</em> platform was conducted. Moreover, the obtained results show that the novel approach is better in reducing complex noises than the traditional implementation – without wavelets.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"107 ","pages":"Article 105047"},"PeriodicalIF":2.6,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0141933124000425/pdfft?md5=694a4b8ef90eac68e2e659134a17a6f8&pid=1-s2.0-S0141933124000425-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140539189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ASIC design of power and area efficient programmable FIR filter using optimized Urdhva-Tiryagbhyam Multiplier for impedance cardiography","authors":"Sudhanshu Janwadkar, Rasika Dhavse","doi":"10.1016/j.micpro.2024.105048","DOIUrl":"https://doi.org/10.1016/j.micpro.2024.105048","url":null,"abstract":"<div><p>Impedance cardiography (ICG) is a rapidly growing non-invasive cardiac health monitoring approach. Synchronous detection of ICG requires an FIR filter to remove the high-frequency carrier signal. Low power consumption and compact area are critical considerations in the design of portable biomedical systems. This paper proposes a novel product quantization-based optimization strategy for the Urdhva Tiryagbhyam Sutra-based multiplier architecture. This paper presents an ASIC design of a low-power and low-area 64th-order programmable FIR filter architecture using the optimized Urdhva Tiryagbhyam Multiplier. The programmable architecture empowers medical practitioners to select the carrier frequency at which the ICG analysis will be performed. The elimination of redundant multipliers from the design based on the filter coefficients is demonstrated. The programmable Vedic FIR filter architecture (described in VHDL) is implemented on the Basys-3 FPGA board for rapid prototyping. The RTL-to-GDSII flow has been completed using Cadence digital design and sign-off tools for the SCL-180 nm technology. The results indicate that the proposed filter architecture occupies 41.33% less area and 42.16% lower power consumption than the contemporary designs described in the literature.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"107 ","pages":"Article 105048"},"PeriodicalIF":2.6,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140641142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amjad Rehman , Tanzila Saba , Khalid Haseeb , Teg Alam , Gwanggil Jeon
{"title":"IoT-Edge technology based cloud optimization using artificial neural networks","authors":"Amjad Rehman , Tanzila Saba , Khalid Haseeb , Teg Alam , Gwanggil Jeon","doi":"10.1016/j.micpro.2024.105049","DOIUrl":"https://doi.org/10.1016/j.micpro.2024.105049","url":null,"abstract":"<div><p>In recent decades, artificial intelligence techniques have been adopted for many real-time applications. The Internet of Things (IoT) network comprises many sensing devices and physical objects for information gathering and further transmission. In addition to being sent to the receiving nodes, the collected data also needs to be received promptly. Also, many solutions have been proposed for IoT-based embedded systems using edge computing but they are not fully protected against unidentified communication threats. In such circumstances, such systems decrease the trust ratio, and communication performance is compromised. In this research, we describe an optimization model based on IoT-edged technology that incorporates cloud computational intelligence. Furthermore, edge nodes employ artificial intelligence algorithms to provide the optimal outcome for selecting trustworthy forwarded data and lengthen the connected time for smart devices. Firstly, the edge devices extract useful information from the IoT nodes, and accordingly, it provides a decision module based on optimization computing. Secondly, utilizing cryptographic approaches, edge technology secures the multi-layers of the IoT system and ensures data privacy with integrity. Finally, the proposed model is tested and verified for its performance than other related studies in terms of energy consumption, packet delivery ratio, and data delay.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"106 ","pages":"Article 105049"},"PeriodicalIF":2.6,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140351221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hand-held GPU accelerated device for multiclass classification of X-ray images using CNN model","authors":"K.G. Satheeshkumar , V. Arunachalam , S. Deepika","doi":"10.1016/j.micpro.2024.105046","DOIUrl":"https://doi.org/10.1016/j.micpro.2024.105046","url":null,"abstract":"<div><p>Chest X-ray (CXR) images are the primary investigation aid for many lung diseases and their follow-ups. For diagnosis of SARS-CoV-2, RT–PCR test and chest Computed Tomography (CT) are commonly used but both face false negatives for ruling out the infection. So, there is a demanding need for developing a system combined with Artificial Intelligence (AI) and CXR imaging to detect COVID-19 patients to avoid its spread. Here, a robust and efficient handheld device is proposed. It uses the computational power of the Graphics Processing Unit (GPU) and pre-trained deep learning models for analyzing the CXR images. A Resnet-50 CNN model is deployed on an NVIDIA Jetson Nano GPU module for the real-time classification of COVID-19, Tuberculosis, and Normal using CXR images. The device can perform real-time classification of CXR images from a portable X-ray machine and classify the image into one of the above categories. For the extensive training, a database of 680 COVID-19, 1230 Tuberculosis, and 1050 normal CXR images are extracted by combining several global databases like Kaggle, SIRM, RSNA, and Radiopaedia. The classification accuracy, precision, and loss rate were 0.9879, 0.9758, and 0.0196 respectively and our model would improve with larger data sets. The highly accurate and high-performance GPU device significantly plays a far-reaching role in COVID-19 diagnosis using Chest X-ray, which could be beneficial to triage the health system and to combat the outbreak of COVID-19.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"106 ","pages":"Article 105046"},"PeriodicalIF":2.6,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140537103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CNC: A lightweight architecture for Binary Ring-LWE based PQC","authors":"Shaik Ahmadunnisa, Sudha Ellison Mathe","doi":"10.1016/j.micpro.2024.105044","DOIUrl":"https://doi.org/10.1016/j.micpro.2024.105044","url":null,"abstract":"<div><p>In lattice-based cryptography, Ring Learning with Errors (RLWE) is a computationally hard cryptographic problem, comprising three basic mechanisms i.e., key generation, encryption, and decryption. Binary Ring Learning with Error (BRLWE), a new variant of RLWE has been proposed recently to reduce the key size and computational complexity compared to previous RLWE-based schemes. Based on this BRLWE scheme, efficient hardware architectures have been obtained in recent works for lightweight applications. The key operation involved in this scheme is <span><math><mrow><mi>A</mi><mi>B</mi><mo>+</mo><mi>C</mi></mrow></math></span> , where <span><math><mi>A</mi></math></span> and <span><math><mi>C</mi></math></span> are integer polynomials and <span><math><mi>B</mi></math></span> is a binary polynomial. This paper proposes an efficient hardware architecture for BRLWE-based scheme targeted for lightweight applications. The architecture computes the arithmetic operation <span><math><mrow><mi>A</mi><mi>B</mi><mo>+</mo><mi>C</mi></mrow></math></span>, which includes polynomial multiplication and addition over the polynomial ring <span><math><mrow><msub><mrow><mi>Z</mi></mrow><mrow><mi>q</mi></mrow></msub><mo>/</mo><mrow><mo>(</mo><msup><mrow><mi>x</mi></mrow><mrow><mi>n</mi></mrow></msup><mo>+</mo><mn>1</mn><mo>)</mo></mrow></mrow></math></span>. The proposed architecture is applied in two conditions, fixed and variable values of <span><math><mi>q</mi></math></span>. Experimental results show the architecture proposed has 50% less Area-Delay Product (ADP) and 20% less Power-Delay Product (PDP) compared to the recently reported work for <span><math><mrow><mi>n</mi><mo>=</mo><mn>256</mn></mrow></math></span>.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"106 ","pages":"Article 105044"},"PeriodicalIF":2.6,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140309393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}