A 22nm 8Mb STT-MRAM Near-Memory-Computing Macro with 8b-Precision and 46.4-160.1TOPS/W for Edge-AI DevicesNonvolatile-memory-based computing in memory (nvCIM) [1–6] is ideal for low-power edge-Al devices requiring neural network (NN) parameter storage in the power-off mode, a rapid response to device wake-up, and high energy efficiency for MAC operations (EFMAC) . Current analog nvCIMs impose a tradeoff between the signal margin (SM) and the number of accumulations (NACU) per cycle versus EFMAC and computing latency (TCD−MAC) . Near-memory computing (NMC), with high precision for inputs (IN), weights (W), and outputs (OUT), and a high NACU is a trend to improve EFMAC,TCD−MAC , and accuracy. A prior STT-MRAM NMC  uses vertical-weight mapping (VWM) to improve the EFMAC ; however, further improvement is challenging: due to (1) the large energy consumption in reading repetitious weight data across multiple inputs for a single NN-layer; (2) a high bitstream toggling-rate (BTR) for digital MAC circuits (DCMAC) reduces EFMAC , and; (3) a limited SM and memory readout latency (TCD−M) for memories with a small R-ratio (e.g. STT-MRAM, see Fig. 33.2.1). In developing an STT-MRAM nvCIM macro, this work moves beyond circuit-level novelty by using system-software-circuit co-design. This work achieves a high EFMAC , a short TCD−M , a high read bandwidth (R-BW), a high IN-W-OUT precision, and a high NACU by using the novel schemes: (1) a hardware based weight-feature aware read (WFAR) to reduce weight accesses and improve EFMAC with a minimal area overhead; (2) toggling-aware weight-tuning (TAWT) to obtain fine-tuned weights (WFT) with a low BTR, which is based on VWM to enhance the EFMAC of the...
A 4nm 6163-TOPS/W/b 4790−TOPS/mm2/b SRAM Based Digital-Computing-in-Memory Macro Supporting Bit-Width Flexibility and Simultaneous MAC and Weight UpdateThe computational load, for accurate AI workloads, is moving from large server clusters to edge devices; thus enabling richer and more personalized AI applications. Compute-in-memory (CIM) is beneficial for edge-AI workloads, specifically ones that are MAC-intensive. However, realizing better power-performance-area (PPA) and high accuracy is a major challenge for practical CIM implementation. Recent work examined tradeoffs between MAC throughput, energy efficiency and accuracy for analog based CIM [1–3]. On the other hand, digital-CIMs (DCIM), which use small, distributed SRAM banks and a customized MAC unit, have demonstrated massively-parallel computation with no accuracy loss and a higher PPA with technology scaling . In this paper, we introduce a 4-nm SRAM-based DCIM macro that handles variable 8/12b-inteteger weights and 8/12/16b-integer inputs in a single macro. The proposed 8-transistor 2b OAI (or-and-invert) cell achieves a 11 % smaller combined bit cell and multiplier area, and supports ultra-low voltage operation, down to 0.32V. Furthermore, the signed-extended carry-look-ahead adder (signed-CLA) and an adder tree pipeline are introduced to boost throughput. Figure 7.4.1 shows the implementation of the bit cell structure, and a neural network accuracy comparison with various bit precisions. Since we targeted concurrent write and MAC operations, ping-pong for weight updates and MAC operations, array needs to have an even number of rows: a classical approach is to use two 12T bit cells and a 2-input NOR. The 12T cell supports simultaneous read and write operations, as its read- and write-port are independent. The 2-input NOR is used for bitwise multiplication with input activations (XIN) and weights (W). On the other hand, two 8T cells and an OAI is used in the proposed SRAM-based DCIM macro. In the proposed bitcell topology, the 8T bitcells act as memory data storage and row selection for the write operation. The OAI performs row selection and bitwise mu...
A 73.53TOPS/W 14.74TOPS Heterogeneous RRAM In-Memory and SRAM Near-Memory SoC for Hybrid Frame and Event-Based Target TrackingVision-based high-speed target-identification and tracking is a critical application in unmanned aerial vehicles (UAV) with wide military and commercial usage. Traditional frame cameras processed through convolutional neural networks (CNN) exhibit high target-identification accuracy but with low throughput (hence low tracking speed) and high power. On the other hand, event cameras or dynamic vision sensors (DVS) generate a stream of binary asynchronous events corresponding to the changing intensity of the pixels capturing high-speed temporal information, characteristic of high-speed tracking. Such event streams with high spatial sparsity processed with bio-mimetic spiking neural networks (SNN) provide low power consumption and high throughput. However, the accuracy of object detection using such event cameras and SNNs is limited. Thus, a frame pipeline with a CNN and an event pipeline with a SNN (Fig. 29.5.1) possess complementary strengths in capturing and processing the spatial and temporal details, respectively. Hence, a hybrid network that fuses frame data processed using a CNN pipeline with event data processed through an SNN pipeline provides a platform for high-speed, high-accuracy and low-power target-identification and tracking. To address this need, we present a fully-programmable heterogeneous ARM Cortex-based SoC with an in-memory low-power RRAM-based CNN and a near-memory high-speed SRAM-based SNN in a hybrid architecture with applications in high-speed target identification and tracking.
A Nonvolatile Al-Edge Processor with 4MB SLC-MLC Hybrid-Mode ReRAM Compute-in-Memory Macro and 51.4-251TOPS/WLow-power Al edge devices should provide short-latency (TWK−RP) and low-energy (EWK−RP) wakeup responses from power-off mode to handle event-triggered computing tasks with high inference accuracy (IA), which requires high-capacity nonvolatile memory (NVM) to store high-precision weight data in power-off and high bit-precision multiply and accumulate (MAC) operations with high energy efficiency. SRAM computing-in-memory (CIM) and digital processors suffer large EWK−RP and long TWK−RP due to the movement of weight data from off-chip NVM to the on-chip buffer and processing unit after wakeup. Thus, on-chip nonvolatile CIM (nvCIM) is preferred for Al-edge processors by combining NVM storage and computing tasks on the same macro. Among nvCIM structures, in-memory compute (IMC)  provides short computing latency and high energy efficiency, but suffers from low computing yield. Near-memory compute (NMC) [2–4] provides high computing yield, but suffers long computing latency and low energy efficiency.
A 2.38 MCells/mm2 9.81 -350 TOPS/W RRAM Compute-in-Memory Macro in 40nm CMOS with Hybrid Offset/IOFF Cancellation and ICELL RBLSL Drop MitigationA dense compute-in-memory (CIM) macro using resistive random-access memory (RRAM) showing solutions to read channel mismatch, high I OFF , ADC offset, IR drop, and cell resistance variation is presented. By combining a hybrid analog/mixed-signal offset cancellation scheme and ICELLRBLSL drop mitigation with a low cell bias target voltage, the proposed macro demonstrates robust operation (post-ECC bit error rate (BER )<5×10−8 for 8WL CIM) while maintaining an effective cell density 1.03 – 33.1× higher than prior art and achieving 1.74 – 13.35× improved average MAC efficiency relative to the previous highest-density RRAM CIM macro.
A 28nm Nonvolatile AI Edge Processor using 4Mb Analog-Based Near-Memory-Compute ReRAM with 27.2 TOPS/W for Tiny AI Edge DevicesTiny AI edge processors prefer using nvCIM to achieve low standby power, high energy efficiency (EF), and short wakeupto-response latency (T WR ). Most nvCIMs use in-memory computing for MAC operations; however, this imposes a tradeoff between EF and accuracy, due to MAC accumulationnumber (N ACU ) versus signal margin and readout quantization. To achieve high EF and high accuracy, we developed a systemlevel nvCIM-friendly control scheme and a nvCIM macro with two analog near-memory computing schemes. The proposed 28nm nonvolatile AI edge processor with 4Mb ReRAMnvCIM achieved high EF (27.2 TOPS/W), short T WR (3.19 ms), and low accuracy loss (<0.5%) The EF of the ReRAM-nvCIM macro was 38.6 TOPS/W.
A 22nm 4Mb STT-MRAM Data-Encrypted Near-Memory Computation Macro with a 192GB/s Read-and-Decryption Bandwidth and 25.1-55.1TOPS/W 8b MAC for AI OperationsNonvolatile computing-in-memory (nvCIM) – is ideal for battery-powered tiny artificial intelligence (AI) edge devices that require nonvolatile data storage and low system-level power consumption. Data encryption/decryption (data-ED) is also required to prevent access to the neural network (NN) model weights and the personalized data used to improve inference accuracy. This paper presents an AI nvCIM data-ED-capable macro with high energy efficiency (EF MAC ), a low macro-level read latency (t AC-M ), a high read bandwidth (R-BW), and high-precision inputs (IN), weights (W), and outputs (OUT) for multiply-and-accumulate (MAC) operations. Prior nvCIM macros designed for MAC operations – do not support data-ED or a high number of accumulations (ACU). The use of a single NN layer also requires multiple cycles for full-channel MAC (MAC FC-L operations. A low computing latency (t AC-FC-L ) and high-precision nvCIM macro with data-ED design faces the following challenges: (1) long t AC-FC-L and low EF MAC for MAC FC-L operations, which requires multiple memory accesses with a limited R-BW; (2) long t AC-M due to BL pre-charge (t PRE ), signal development (t SD ), sensing (t SA ), and data-D (t OE ); (3) High power consumption for BL precharge, particularly when using a high BL read voltage (V RD ) to increase sensing yield.
A 40-nm, 2M-Cell, 8b-Precision, Hybrid SLC-MLC PCM Computing-in-Memory Macro with 20.5 - 65.0TOPS/W for Tiny-Al Edge DevicesEfficient edge computing, with sufficiently large on-chip memory capacity, is essential in the internet-of-everything era. Nonvolatile computing-in-memory (nvCIM) reduces the data transfer overhead by bringing computation closer, in proximity, to the memory –. While the multi-level cell (MLC) has higher storage density than the single-level cell (SLC). A few MLC or analog nvCIM designs had been proposed, but they either target simpler neural-net models  or are implemented using a less area-efficient differential cell . Furthermore, representing the entire weight vector using one storage type does not exploit the drastic accuracy difference between the upper and the lower bits.
A 40nm 60.64TOPS/W ECC-Capable Compute-in-Memory/Digital 2.25MB/768KB RRAM/SRAM System with Embedded Cortex M3 Microprocessor for Edge Recommendation SystemsResistive RAM (RRAM) is an exciting technology that exhibits various new properties that have been long absent in traditional charge-based memories. RRAM features high-bit density, non-volatile storage, accurate compute in-memory (CIM), and both process and voltage compatibility. Each of these properties makes RRAM a compelling candidate for Al applications, particularly at the edge. To demonstrate the utility of these properties, we direct our effort to real-world event-driven and memory-constrained applications, such as recommendation systems and natural language processing (NLP). To enable these applications at the edge, higher memory capacity and bandwidth must be achieved despite irregular data access patterns that prevent effective caching and data reuse. Furthermore, we find that these applications are rarely (if ever) run continuously, but instead execution is triggered by events. The combination of these two challenges makes RRAM an ideal candidate given its high density and non-volatility enabling near-zero leakage power and complete power down. To address these challenges, this paper presents a 2.25MB RRAM based CIM accelerator with 765kB of SRAM and an embedded Cortex M3 processor for edge devices.
A 40nm 64kb 26.56TOPS/W 2.37Mb/mm2RRAM Binary/Compute-in-Memory Macro with 4.23x Improvement in Density and >75% Use of Sensing Dynamic RangeCompute-in-Memory (CIM) using emerging nonvolatile (eNVM) memory technologies, such as resistive random-access memory (RRAM), has been shown by several implemented macros to be an energy-efficient alternative to traditional von Neumann architectures –. Since moving data on- and off-chip has a high energy cost, area efficiency is important to the practical utility of CIM with RRAM. Many systems demonstrated so far have not reported area efficiency or addressed the challenges CIM with RRAM presents with respect to practical area-constrained integrated circuits.
Since the last decade, we have been witnessing a steep rise of Artificial Intelligence (AI) as an alternative computing paradigm. Although the idea has been around since 1950s, AI needed progress in algorithms, capable hardware, and sufficiently large training data to become a practical and powerful tool. Progress in computing hardware has been a key ingredient for the AI renaissance and will remain increasingly critical to realize future AI applications.
We are particularly well-positioned to supply the most advanced AI hardware to our customers thanks to our leading-edge logic, memory, and packaging technologies. We have established a research pipeline for technology to enable leading-edge AI devices, circuits, and systems for decades to come. Near- and in-memory computing, embedded non-volatile memory technologies, 3D integration, and error-resilient computing are amongst our specific AI hardware research areas. Our in-house research is complemented by strong academic and governmental partnerships, which allow us to interact with and influence leading AI researchers around the world.