{"title":"Ferroelectric Transistor-Based Synaptic Crossbar Arrays: The Impact of Ferroelectric Thickness and Device-Circuit Interactions","authors":"Chunguang Wang;Sumeet Kumar Gupta","doi":"10.1109/JXCDC.2024.3502053","DOIUrl":"https://doi.org/10.1109/JXCDC.2024.3502053","url":null,"abstract":"Ferroelectric transistors (FeFETs)-based crossbar arrays have shown immense promise for computing-in-memory (CiM) architectures targeted for neural accelerator designs. Offering CMOS compatibility, nonvolatility, compact bit cell, and CiM-amenable features, such as multilevel storage and voltage-driven conductance tuning, FeFETs are among the foremost candidates for synaptic devices. However, device and circuit nonideal attributes in FeFETs-based crossbar arrays cause the output currents to deviate from the expected value, which can induce error in CiM of matrix-vector multiplications (MVMs). In this article, we analyze the impact of ferroelectric thickness (\u0000<inline-formula> <tex-math>$T_{text {FE}}$ </tex-math></inline-formula>\u0000) and cross-layer interactions in FeFETs-based synaptic crossbar arrays accounting for device-circuit nonidealities. First, based on a physics-based model of multidomain FeFETs calibrated to experiments, we analyze the impact of \u0000<inline-formula> <tex-math>$T_{text {FE}}$ </tex-math></inline-formula>\u0000 on the characteristics of FeFETs as synaptic devices, highlighting the connections between the multidomain physics and the synaptic attributes. Based on this analysis, we investigate the impact of \u0000<inline-formula> <tex-math>$T_{text {FE}}$ </tex-math></inline-formula>\u0000 in conjunction with other design parameters, such as number of bits stored per device (bit slice), wordline (WL) activation schemes, and FeFETs width on the error probability, area, energy, and latency of CiM at the array level. Our results show that FeFETs with \u0000<inline-formula> <tex-math>$T_{text {FE}}$ </tex-math></inline-formula>\u0000 around 7 nm achieve the highest CiM robustness, while FeFETs with \u0000<inline-formula> <tex-math>$T_{text {FE}}$ </tex-math></inline-formula>\u0000 around 10 nm offer the lowest CiM energy and latency. While the CiM robustness for bit slice 2 is less than bit slice 1, its robustness can be brought to a target level via additional design techniques, such as partial wordline activation and optimization of FeFETs width.","PeriodicalId":54149,"journal":{"name":"IEEE Journal on Exploratory Solid-State Computational Devices and Circuits","volume":"10 ","pages":"144-152"},"PeriodicalIF":2.0,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10756727","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142798031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keming Fan;Ashkan Moradifirouzabadi;Xiangjin Wu;Zheyu Li;Flavio Ponzina;Anton Persson;Eric Pop;Tajana Rosing;Mingu Kang
{"title":"SpecPCM: A Low-Power PCM-Based In-Memory Computing Accelerator for Full-Stack Mass Spectrometry Analysis","authors":"Keming Fan;Ashkan Moradifirouzabadi;Xiangjin Wu;Zheyu Li;Flavio Ponzina;Anton Persson;Eric Pop;Tajana Rosing;Mingu Kang","doi":"10.1109/JXCDC.2024.3498837","DOIUrl":"https://doi.org/10.1109/JXCDC.2024.3498837","url":null,"abstract":"Mass spectrometry (MS) is essential for proteomics and metabolomics but faces impending challenges in efficiently processing the vast volumes of data. This article introduces SpecPCM, an in-memory computing (IMC) accelerator designed to achieve substantial improvements in energy and delay efficiency for both MS spectral clustering and database (DB) search. SpecPCM employs analog processing with low-voltage swing and utilizes recently introduced phase change memory (PCM) devices based on superlattice materials, optimized for low-voltage and low-power programming. Our approach integrates contributions across multiple levels: application, algorithm, circuit, device, and instruction sets. We leverage a robust hyperdimensional computing (HD) algorithm with a novel dimension-packing method and develop specialized hardware for the end-to-end MS pipeline to overcome the nonideal behavior of PCM devices. We further optimize multilevel PCM devices for different tasks by using different materials. We also perform a comprehensive design exploration to improve energy and delay efficiency while maintaining accuracy, exploring various combinations of hardware and software parameters controlled by the instruction set architecture (ISA). SpecPCM, with up to three bits per cell, achieves speedups of up to \u0000<inline-formula> <tex-math>$82times $ </tex-math></inline-formula>\u0000 and \u0000<inline-formula> <tex-math>$143times $ </tex-math></inline-formula>\u0000 for MS clustering and DB search tasks, respectively, along with a four-orders-of-magnitude improvement in energy efficiency compared with state-of-the-art (SoA) CPU/GPU tools.","PeriodicalId":54149,"journal":{"name":"IEEE Journal on Exploratory Solid-State Computational Devices and Circuits","volume":"10 ","pages":"161-169"},"PeriodicalIF":2.0,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10753646","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142859023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giacomo Pedretti;John Moon;Pedro Bruel;Sergey Serebryakov;Ron M. Roth;Luca Buonanno;Archit Gajjar;Lei Zhao;Tobias Ziegler;Cong Xu;Martin Foltin;Paolo Faraboschi;Jim Ignowski;Catherine E. Graves
{"title":"X-TIME: Accelerating Large Tree Ensembles Inference for Tabular Data With Analog CAMs","authors":"Giacomo Pedretti;John Moon;Pedro Bruel;Sergey Serebryakov;Ron M. Roth;Luca Buonanno;Archit Gajjar;Lei Zhao;Tobias Ziegler;Cong Xu;Martin Foltin;Paolo Faraboschi;Jim Ignowski;Catherine E. Graves","doi":"10.1109/JXCDC.2024.3495634","DOIUrl":"https://doi.org/10.1109/JXCDC.2024.3495634","url":null,"abstract":"Structured, or tabular, data are the most common format in data science. While deep learning models have proven formidable in learning from unstructured data such as images or speech, they are less accurate than simpler approaches when learning from tabular data. In contrast, modern tree-based machine learning (ML) models shine in extracting relevant information from structured data. An essential requirement in data science is to reduce model inference latency in cases where, for example, models are used in a closed loop with simulation to accelerate scientific discovery. However, the hardware acceleration community has mostly focused on deep neural networks and largely ignored other forms of ML. Previous work has described the use of an analog content addressable memory (CAM) component for efficiently mapping random forests (RFs). In this work, we develop an analog-digital architecture that implements a novel increased precision analog CAM and a programmable chip for inference of state-of-the-art tree-based ML models, such as eXtreme Gradient Boosting (XGBoost), Categorical Boosting (CatBoost), and others. Thanks to hardware-aware training, X-TIME reaches state-of-the-art accuracy and \u0000<inline-formula> <tex-math>$119times $ </tex-math></inline-formula>\u0000 higher throughput at \u0000<inline-formula> <tex-math>$9740times $ </tex-math></inline-formula>\u0000 lower latency with \u0000<inline-formula> <tex-math>${gt }150times $ </tex-math></inline-formula>\u0000 improved energy efficiency compared with a state-of-the-art GPU for models with up to 4096 trees and depth of 8, with a 19-W peak power consumption.","PeriodicalId":54149,"journal":{"name":"IEEE Journal on Exploratory Solid-State Computational Devices and Circuits","volume":"10 ","pages":"116-124"},"PeriodicalIF":2.0,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10753423","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christian Simonides;Dominik Gausepohl;Peter M. Hinkel;Fabian Seiler;Nima Taherinejad
{"title":"Approximated 2-Bit Adders for Parallel In-Memristor Computing With a Novel Sum-of-Product Architecture","authors":"Christian Simonides;Dominik Gausepohl;Peter M. Hinkel;Fabian Seiler;Nima Taherinejad","doi":"10.1109/JXCDC.2024.3497720","DOIUrl":"https://doi.org/10.1109/JXCDC.2024.3497720","url":null,"abstract":"Conventional computing methods struggle with the exponentially increasing demand for computational power, caused by applications including image processing and machine learning (ML). Novel computing paradigms such as in-memory computing (IMC) and approximate computing (AxC) provide promising solutions to this problem. Due to their low energy consumption and inherent ability to store data in a nonvolatile fashion, memristors are an increasingly popular choice in these fields. There is a wide range of logic forms compatible with memristive IMC, each offering different advantages. We present a novel mixed-logic solution that utilizes properties of the sum-of-product (SOP) representation and propose a full-adder circuit that works efficiently in 2-bit units. To further improve the speed, area usage, and energy consumption, we propose two additional approximate (Ax) 2-bit adders that exhibit inherent parallelization capabilities. We apply the proposed adders in selected image processing applications, where our Ax approach reduces the energy consumption by \u0000<inline-formula> <tex-math>$mathrm {31~!%}$ </tex-math></inline-formula>\u0000–\u0000<inline-formula> <tex-math>$mathrm {40~!%}$ </tex-math></inline-formula>\u0000 and improves the speed by \u0000<inline-formula> <tex-math>$mathrm {50~!%}$ </tex-math></inline-formula>\u0000. To demonstrate the potential gains of our approximations in more complex applications, we applied them in ML. Our experiments indicate that with up to \u0000<inline-formula> <tex-math>$6/16$ </tex-math></inline-formula>\u0000 Ax adders, there is no accuracy degradation when applied in a convolutional neural network (CNN) that is evaluated on MNIST. Our approach can save up to 125.6 mJ of energy and 505 million steps compared to our exact approach.","PeriodicalId":54149,"journal":{"name":"IEEE Journal on Exploratory Solid-State Computational Devices and Circuits","volume":"10 ","pages":"135-143"},"PeriodicalIF":2.0,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10752571","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142798030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nellie Laleni;Franz Müller;Gonzalo Cuñarro;Thomas Kämpfe;Taekwang Jang
{"title":"A High-Efficiency Charge-Domain Compute-in-Memory 1F1C Macro Using 2-bit FeFET Cells for DNN Processing","authors":"Nellie Laleni;Franz Müller;Gonzalo Cuñarro;Thomas Kämpfe;Taekwang Jang","doi":"10.1109/JXCDC.2024.3495612","DOIUrl":"https://doi.org/10.1109/JXCDC.2024.3495612","url":null,"abstract":"This article introduces a 1FeFET-1Capacitance (1F1C) macro based on a 2-bit ferroelectric field-effect transistor (FeFET) cell operating in the charge domain, marking a significant advancement in nonvolatile memory (NVM) and compute-in-memory (CIM). Traditionally, NVMs, such as FeFETs or resistive RAMs (RRAMs), have operated in a single-bit fashion, limiting their computational density and throughput. In contrast, the proposed 2-bit FeFET cell enables higher storage density and improves the computational efficiency in CIM architectures. The macro achieves 111.6 TOPS/W, highlighting its energy efficiency, and demonstrates robust performance on the CIFAR-10 dataset, achieving 89% accuracy with a VGG-8 neural network. These findings underscore the potential of charge-domain, multilevel NVM cells in pushing the boundaries of artificial intelligence (AI) acceleration and energy-efficient computing.","PeriodicalId":54149,"journal":{"name":"IEEE Journal on Exploratory Solid-State Computational Devices and Circuits","volume":"10 ","pages":"153-160"},"PeriodicalIF":2.0,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10750057","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142825857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"System-Technology Co-Optimization for Dense Edge Architectures Using 3-D Integration and Nonvolatile Memory","authors":"Leandro M. Giacomini Rocha;Mohamed Naeim;Guilherme Paim;Moritz Brunion;Priya Venugopal;Dragomir Milojevic;James Myers;Mustafa Badaroglu;Marian Verhelst;Julien Ryckaert;Dwaipayan Biswas","doi":"10.1109/JXCDC.2024.3496118","DOIUrl":"https://doi.org/10.1109/JXCDC.2024.3496118","url":null,"abstract":"High-performance edge artificial intelligence (Edge-AI) inference applications aim for high energy efficiency, memory density, and small form factor, requiring a design-space exploration across the whole stack—workloads, architecture, mapping, and co-optimization with emerging technology. In this article, we present a system-technology co-optimization (STCO) framework that interfaces with workload-driven system scaling challenges and physical design-enabled technology offerings. The framework is built on three engines that provide the physical design characterization, dataflow mapping optimizer, and system efficiency predictor. The framework builds on a systolic array accelerator to provide the design-technology characterization points using advanced imec A10 nanosheet CMOS node along with emerging, high-density voltage-gated spin-orbit torque (VGSOT) magnetic memories (MRAM), combined with memory-on-logic fine-pitch 3-D wafer-to-wafer hybrid bonding. We observe that the 3-D system integration of static random-access memory (SRAM)-based design leads to 9% power savings with 53% footprint reduction at iso-frequency with respect to 2-D implementation for the same memory capacity. Three-dimensional nonvolatile memory (NVM)-VGSOT allows \u0000<inline-formula> <tex-math>$4times $ </tex-math></inline-formula>\u0000 memory capacity increase with 30% footprint reduction at iso-power compared with 2-D SRAM \u0000<inline-formula> <tex-math>$1times $ </tex-math></inline-formula>\u0000. Our exploration with two diverse workloads—image resolution enhancement (FSRCNN) and eye tracking (EDSNet)—shows that more resources allow better workload mapping possibilities, which are able to compensate peak system energy efficiency degradation on high memory capacity cases. We show that a 25% peak efficiency reduction on a \u0000<inline-formula> <tex-math>$32times $ </tex-math></inline-formula>\u0000 memory capacity can lead to a \u0000<inline-formula> <tex-math>$7.4times $ </tex-math></inline-formula>\u0000 faster execution with \u0000<inline-formula> <tex-math>$5.7times $ </tex-math></inline-formula>\u0000 higher effective TOPS/W than the \u0000<inline-formula> <tex-math>$1times $ </tex-math></inline-formula>\u0000 memory capacity case on the same technology.","PeriodicalId":54149,"journal":{"name":"IEEE Journal on Exploratory Solid-State Computational Devices and Circuits","volume":"10 ","pages":"125-134"},"PeriodicalIF":2.0,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10750212","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Adnaan;Sou-Chi Chang;Hai Li;Yu-Ching Liao;Ian A. Young;Azad Naeemi
{"title":"Design Considerations for Sub-1-V 1T1C FeRAM Memory Circuits","authors":"Mohammad Adnaan;Sou-Chi Chang;Hai Li;Yu-Ching Liao;Ian A. Young;Azad Naeemi","doi":"10.1109/JXCDC.2024.3488578","DOIUrl":"https://doi.org/10.1109/JXCDC.2024.3488578","url":null,"abstract":"We present a comprehensive benchmarking framework for one transistor-one capacitor (1T1C) low-voltage ferroelectric random access memory (FeRAM) circuits. We focus on the most promising ferroelectric materials, hafnium zirconium oxide (HZO) and barium titanate (BTO), known for their fast switching speeds and low coercive voltages. We model ferroelectric capacitors using physics-based phase-field models and calibrate the polarization switching speed and hysteresis loop versus experimental data. Ferroelectric memory cells are designed using a 28-nm process design kit (PDK), incorporating peripheral circuitry and interconnect parasitics. We set up the memory array circuit design and analyze its performance by varying the row/column size of the memory array, as well as driver and capacitor sizes. Our results are compared with other emerging memory technologies, particularly magnetic/spintronic memories, in terms of read/write latencies and energy consumption. We identify the critical aspects of the ferroelectric memory array performance, such as the effect of plateline driver and bitline capacitances, and provide recommendations to further optimize the performance of such low operating voltage ferroelectric memory circuits.","PeriodicalId":54149,"journal":{"name":"IEEE Journal on Exploratory Solid-State Computational Devices and Circuits","volume":"10 ","pages":"107-115"},"PeriodicalIF":2.0,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10738514","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142600122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Madison Manley;Ashita Victor;Hyunggyu Park;Ankit Kaul;Mohanalingam Kathaperumal;Muhannad S. Bakir
{"title":"Heterogeneous Integration Technologies for Artificial Intelligence Applications","authors":"Madison Manley;Ashita Victor;Hyunggyu Park;Ankit Kaul;Mohanalingam Kathaperumal;Muhannad S. Bakir","doi":"10.1109/JXCDC.2024.3484958","DOIUrl":"https://doi.org/10.1109/JXCDC.2024.3484958","url":null,"abstract":"The rapid advancement of artificial intelligence (AI) has been enabled by semiconductor-based electronics. However, the conventional methods of transistor scaling are not enough to meet the exponential demand for computing power driven by AI. This has led to a technological shift toward system-level scaling approaches, such as heterogeneous integration (HI). HI is becoming increasingly implemented in many AI accelerator products due to its potential to enhance overall system performance while also reducing electrical interconnect delays and energy consumption, which are critical for supporting data-intensive AI workloads. In this review, we introduce current and emerging HI technologies and their potential for high-performance systems. We then survey recent industrial and research progress in 3-D HI technologies that enable high bandwidth systems and finally present the emergence of glass core packaging for high-performance AI chip packages.","PeriodicalId":54149,"journal":{"name":"IEEE Journal on Exploratory Solid-State Computational Devices and Circuits","volume":"10 ","pages":"89-97"},"PeriodicalIF":2.0,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10731842","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scaling Logic Area With Multitier Standard Cells","authors":"Florian Freye;Christian Lanius;Hossein Hashemi Shadmehri;Diana Göhringer;Tobias Gemmeke","doi":"10.1109/JXCDC.2024.3482464","DOIUrl":"https://doi.org/10.1109/JXCDC.2024.3482464","url":null,"abstract":"While the footprint of digital complementary metal-oxide–semiconductor (CMOS) circuits has continued to decrease over the years, physical limitations for further intralayer geometric scaling become apparent. To further increase the logic density, the international roadmap for devices and systems (IRDS) predicts a transition from a single layer of transistors per die to monolithically stacking transistors in multiple tiers starting in 2031. This raises the question of the extent to which these can be exploited in 3-D standard cells to improve logic density. In this work, we investigate the scaling potential of realizing standard cells employing two or three dedicated tiers. For this, specific multitier virtual physical design kits are derived based on the open ASAP7. A typical RISC-V implementation realized in a classic standard cell library is used to identify the subset of the most relevant standard cells. In accordance with the virtual physical design kit (PDK), 3-D derivatives of the single-tier standard cells are crafted and evaluated with respect to achievable logic density considering standard synthesis benchmarks and blocks on the architecture level.","PeriodicalId":54149,"journal":{"name":"IEEE Journal on Exploratory Solid-State Computational Devices and Circuits","volume":"10 ","pages":"82-88"},"PeriodicalIF":2.0,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10720813","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy-/Carbon-Aware Evaluation and Optimization of 3-D IC Architecture With Digital Compute-in-Memory Designs","authors":"Hyung Joon Byun;Udit Gupta;Jae-Sun Seo","doi":"10.1109/JXCDC.2024.3479100","DOIUrl":"https://doi.org/10.1109/JXCDC.2024.3479100","url":null,"abstract":"Several 2-D architectures have been presented, including systolic arrays or compute-in-memory (CIM) arrays for energy-efficient artificial intelligence (AI) inference. To increase the energy efficiency within constrained area, 3-D technologies have been actively investigated, which have the potential to decrease the data path length or increase the activation buffer size, enabling higher energy efficiency. Several works have reported the 3-D architectures using non-CIM designs, but investigations on 3-D architectures with CIM macros have not been well studied in prior works. In this article, we investigate digital CIM (DCIM) macros and various 3-D architectures to find the opportunity of increased energy efficiency compared with 2-D structures. Moreover, we also investigated the carbon footprint of 3-D architectures. We have built in-house simulators calculating energy and area given high-level hardware descriptions and DNN workloads and integrated with carbon estimation tool to analyze the embodied carbon of various hardware designs. We have investigated different types of 3-D DCIM architectures and dataflows, which have shown 42.5% energy savings compared with 2-D systolic arrays on average. Also, we have analyzed the tradeoff between performance and carbon footprint and their optimization opportunities.","PeriodicalId":54149,"journal":{"name":"IEEE Journal on Exploratory Solid-State Computational Devices and Circuits","volume":"10 ","pages":"98-106"},"PeriodicalIF":2.0,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10714410","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142600413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}