Artificial Intelligence

Artificial Intelligence

Since the last decade, we have been witnessing a steep rise of Artificial Intelligence (AI) as an alternative computing paradigm. Although the idea has been around since 1950s, AI needed progress in algorithms, capable hardware, and sufficiently large training data to become a practical and powerful tool. Progress in computing hardware has been a key ingredient for the AI renaissance and will remain increasingly critical to realize future AI applications.

We are particularly well-positioned to supply the most advanced AI hardware to our customers thanks to our leading-edge logic, memory, and packaging technologies. We have established a research pipeline for technology to enable leading-edge AI devices, circuits, and systems for decades to come. Near- and in-memory computing, embedded non-volatile memory technologies, 3D integration, and error-resilient computing are amongst our specific AI hardware research areas. Our in-house research is complemented by strong academic and governmental partnerships, which allow us to interact with and influence leading AI researchers around the world.

Sort by:
1-10 of 24
  • Focus Session Invited Paper: Next Generation TSMC-SoIC® Platform for Ultra-High Bandwidth HPC Application

    2024
    This work introduces the next generation System-on-Integrated-Chips (SoIC) 3D stacking technology, facilitating ultra-high bandwidth density between stacked dies, with device performance remaining stable. This innovative 3D technology, with TSMC's cutting-edge System-on-Chip nodes and advanced package solutions, pushes the boundaries of Moore’s Law for the next generation of High-Performance Computing applications.
  • MRAM Design-Technology-System Co-Optimization for Artificial Intelligence Edge Devices

    2024
    STT-MRAM shows great promise for use in artificial intelligence (AI) edge devices due to its compact bitcell area and high endurance. However, it faces read challenges because of its low TMR and RP. Conventional sense amplifiers have limitations in optimizing read energy and robustness while providing flexibility to exploit neural-net error tolerance. This article explores the design challenges of conventional sense amplifiers and examines how device parameters (TMR and RP) impact read performances. A novel capacitive-coupling sense amplifier is introduced to offer a new design space for balancing read energy and robustness. Combining the exploitation of neural-net error tolerance with sense amplifier and device co-design, a Design-Technology-System Co-Optimization (DTSCO) approach demonstrates a read energy reduction of 27.1% to 45.3% with minimal inference accuracy degradation in edge AI applications
  • Novel Parallel Digital Optical Computing System (DOC) for Generative A.I.

    2024
    Generative A.I.’s (GAI) popularity has made photonics based computation an attractive approach for its potential to meet the demands for higher energy efficiency performance (EEP). However, previous optical solutions for multiply-accumulate (MAC) operations focused on either analog architecture [1-7] which is limited by its precision and data conversion, or free-space optical architecture with limited scalability [8]. Here, a world’s first on-chip large-scale Digital Optical Computing System (DOC) for GAI training is reported. DOC employs a novel wafer-based system integration technology featuring multilayer low-loss photonic interconnect fan-out (PIFO) and EIC/PIC stack architecture leveraging TSMC SoICR. It reduces the data movement and memory hierarchy leading to improvement in critical path latency and system energy efficiency (EE). Compared to conventional electrical designs, DOC can be scaled to larger coherent networks and operate at higher speeds with lower energies per MAC operation. A low energy consumption of <0.08 pJ/MAC at 8-bit operation with a >20x improvement in EEP compared to the state-of-the-art GPU [10] is achieved for a 512 x 512 MAC large scale operation. The EEP further improves at higher precisions due to the relative minimal fan-out energy. This architecture has full potential for continuous EEP scaling in future generations.
  • A 16nm 96Kb Integer-floating-point Dual-mode Gain-cell-Computing-in-Memory Macro with 53.3-163.3TOPS/W and 23.2-121.2TFLOPS/W for AI-Edge Devices

    2024
    Advanced AI-edge chips require computational flexibility and high-energy efficiency (EEF) with sufficient inference accuracy for a variety of applications. Floating-point (FP) numerical representation can be used for complex neural networks (NN) requiring a high inference accuracy; however, such an approach requires higher energy and more parameter storage than does a fixed-point integer (INT) numerical representation. Many compute-in-memory (CIM) designs have a good EEF for INT multiply-and-accumulate (MAC) operations; however, few support FP-MAC operations [1–3]. Implementing INT/FP dual-mode (DM) MAC operations presents challenges (Fig. 34.2.1), including (1) low-area efficiency, since FP-MAC functions become idle during INT-MAC operations; (2) a high system-level latency, due to NN data update interruptions on small-capacity SRAM-CIM without concurrent write-and-compute functionality; and (3) high-energy consumption, due to repeated system-to-CIM data transfers during computation. This work presents an INT/FP DM macro featuring (1) a DM zone-based input (IN) processing scheme (ZB-IPS) to eliminate subtraction in exponent (EXP) computation, while reusing the alignment circuit in INT-mode to improve EEF and area efficiency (AEF); (2) a DM local-computing-cell (DM-LCC), which reuses the EXP addition as an adder tree stage for INT-MAC to improve AEF in INT mode; and (3) a stationary-based two-port gain-cell (GC) array (SB-TP-GCA) to support concurrent data updates and computation, while reducing system-to-CIM and internal data accesses to improve EEF and latency (T MAC ). A 16nm 96-Kb INT-FP DM GC-CIM macro with 4T GCs is fabricated to support FP-MAC with 64 accumulations (N ACCU ) for BF16-IN, BF16-W, and FP32-OUT as well as an INT-MAC with N ACCU =128 for 8b-IN, 8b-W, and 23b-OUT. This CIM macro achieves a 163.3TOPS/W INT-MAC and a 91.2TFLOPS/W FP-MAC EEF.
  • A 22nm Nonvolatile AI-Edge Processor with 21.4TFLOPS/W using 47.25Mb Lossless-Compressed-Computing STT-MRAM Near-Memory-Compute Macro

    2024
    Battery-powered AI-edge processors require short wakeup-to-response latency (T WR ) and high energy efficiency (EF) for accurate real-time inference. This necessitates high-capacity nvCIM macros to store floating-point (FP) neural network (NN) data (e.g., BF16) and perform MAC operations with short latency (T CD ) and high EF. This paper presents an STT - MRAM nvCIM macro with lossless compression computation, a near-far aware readout scheme, and system-level CIM-friendly hybrid weight mapping. The proposed 22nm nonvolatile processor (nvProcessor) with 47.25-Mb STT-MRAM nvCIM achieved high EF (21.4TFLOPS/W), short TWR(428.58μs) , and high macro-level EF (27.6TFLOPS/W). Keywords: multiply-and-accumulate (MAC), nonvolatile memory (NVM), nonvolatile compute-in-memory (nvCIM)
  • MINOTAUR: An Edge Transformer Inference and Training Accelerator with 12 MBytes On-Chip Resistive RAM and Fine-Grained Spatiotemporal Power Gating

    2024
    MINOTAUR is the first energy-efficient edge SoC for inference and training of Transformers (and other networks, e.g., CNNs) with all memory on-chip. MINOTAUR leverages a configurable 8-bit posit-based accelerator, fine-grained spatiotemporal power gating enabled by on-chip resistive-RAM (RRAM) for dynamically adjustable bandwidth, and on-chip fine-tuning through full-network low-rank adaptation (LoRA). MINOTAUR achieves an average utilization of 93% and 74% and energy of 8.1 mJ and 8.2 mJ on ResNet-18 and MobileBERT Tiny inference respectively, and on-chip fine-tuning within 1.7% of offline training without RRAM-induced energy limitations or endurance degradations.
  • A 22nm 8Mb STT-MRAM Near-Memory-Computing Macro with 8b-Precision and 46.4-160.1TOPS/W for Edge-AI Devices

    2023
    Nonvolatile-memory-based computing in memory (nvCIM) [1–6] is ideal for low-power edge-Al devices requiring neural network (NN) parameter storage in the power-off mode, a rapid response to device wake-up, and high energy efficiency for MAC operations (EFMAC) . Current analog nvCIMs impose a tradeoff between the signal margin (SM) and the number of accumulations (NACU) per cycle versus EFMAC and computing latency (TCD−MAC) . Near-memory computing (NMC), with high precision for inputs (IN), weights (W), and outputs (OUT), and a high NACU is a trend to improve EFMAC,TCD−MAC , and accuracy. A prior STT-MRAM NMC [1] uses vertical-weight mapping (VWM) to improve the EFMAC ; however, further improvement is challenging: due to (1) the large energy consumption in reading repetitious weight data across multiple inputs for a single NN-layer; (2) a high bitstream toggling-rate (BTR) for digital MAC circuits (DCMAC) reduces EFMAC , and; (3) a limited SM and memory readout latency (TCD−M) for memories with a small R-ratio (e.g. STT-MRAM, see Fig. 33.2.1). In developing an STT-MRAM nvCIM macro, this work moves beyond circuit-level novelty by using system-software-circuit co-design. This work achieves a high EFMAC , a short TCD−M , a high read bandwidth (R-BW), a high IN-W-OUT precision, and a high NACU by using the novel schemes: (1) a hardware based weight-feature aware read (WFAR) to reduce weight accesses and improve EFMAC with a minimal area overhead; (2) toggling-aware weight-tuning (TAWT) to obtain fine-tuned weights (WFT) with a low BTR, which is based on VWM to enhance the EFMAC of the...
  • A 4nm 6163-TOPS/W/b 4790−TOPS/mm2/b SRAM Based Digital-Computing-in-Memory Macro Supporting Bit-Width Flexibility and Simultaneous MAC and Weight Update

    2023
    The computational load, for accurate AI workloads, is moving from large server clusters to edge devices; thus enabling richer and more personalized AI applications. Compute-in-memory (CIM) is beneficial for edge-AI workloads, specifically ones that are MAC-intensive. However, realizing better power-performance-area (PPA) and high accuracy is a major challenge for practical CIM implementation. Recent work examined tradeoffs between MAC throughput, energy efficiency and accuracy for analog based CIM [1–3]. On the other hand, digital-CIMs (DCIM), which use small, distributed SRAM banks and a customized MAC unit, have demonstrated massively-parallel computation with no accuracy loss and a higher PPA with technology scaling [4]. In this paper, we introduce a 4-nm SRAM-based DCIM macro that handles variable 8/12b-inteteger weights and 8/12/16b-integer inputs in a single macro. The proposed 8-transistor 2b OAI (or-and-invert) cell achieves a 11 % smaller combined bit cell and multiplier area, and supports ultra-low voltage operation, down to 0.32V. Furthermore, the signed-extended carry-look-ahead adder (signed-CLA) and an adder tree pipeline are introduced to boost throughput. Figure 7.4.1 shows the implementation of the bit cell structure, and a neural network accuracy comparison with various bit precisions. Since we targeted concurrent write and MAC operations, ping-pong for weight updates and MAC operations, array needs to have an even number of rows: a classical approach is to use two 12T bit cells and a 2-input NOR. The 12T cell supports simultaneous read and write operations, as its read- and write-port are independent. The 2-input NOR is used for bitwise multiplication with input activations (XIN) and weights (W). On the other hand, two 8T cells and an OAI is used in the proposed SRAM-based DCIM macro. In the proposed bitcell topology, the 8T bitcells act as memory data storage and row selection for the write operation. The OAI performs row selection and bitwise mu...
  • A 73.53TOPS/W 14.74TOPS Heterogeneous RRAM In-Memory and SRAM Near-Memory SoC for Hybrid Frame and Event-Based Target Tracking

    2023
    Vision-based high-speed target-identification and tracking is a critical application in unmanned aerial vehicles (UAV) with wide military and commercial usage. Traditional frame cameras processed through convolutional neural networks (CNN) exhibit high target-identification accuracy but with low throughput (hence low tracking speed) and high power. On the other hand, event cameras or dynamic vision sensors (DVS) generate a stream of binary asynchronous events corresponding to the changing intensity of the pixels capturing high-speed temporal information, characteristic of high-speed tracking. Such event streams with high spatial sparsity processed with bio-mimetic spiking neural networks (SNN) provide low power consumption and high throughput. However, the accuracy of object detection using such event cameras and SNNs is limited. Thus, a frame pipeline with a CNN and an event pipeline with a SNN (Fig. 29.5.1) possess complementary strengths in capturing and processing the spatial and temporal details, respectively. Hence, a hybrid network that fuses frame data processed using a CNN pipeline with event data processed through an SNN pipeline provides a platform for high-speed, high-accuracy and low-power target-identification and tracking. To address this need, we present a fully-programmable heterogeneous ARM Cortex-based SoC with an in-memory low-power RRAM-based CNN and a near-memory high-speed SRAM-based SNN in a hybrid architecture with applications in high-speed target identification and tracking.
  • A Nonvolatile Al-Edge Processor with 4MB SLC-MLC Hybrid-Mode ReRAM Compute-in-Memory Macro and 51.4-251TOPS/W

    2023
    Low-power Al edge devices should provide short-latency (TWK−RP) and low-energy (EWK−RP) wakeup responses from power-off mode to handle event-triggered computing tasks with high inference accuracy (IA), which requires high-capacity nonvolatile memory (NVM) to store high-precision weight data in power-off and high bit-precision multiply and accumulate (MAC) operations with high energy efficiency. SRAM computing-in-memory (CIM) and digital processors suffer large EWK−RP and long TWK−RP due to the movement of weight data from off-chip NVM to the on-chip buffer and processing unit after wakeup. Thus, on-chip nonvolatile CIM (nvCIM) is preferred for Al-edge processors by combining NVM storage and computing tasks on the same macro. Among nvCIM structures, in-memory compute (IMC) [1] provides short computing latency and high energy efficiency, but suffers from low computing yield. Near-memory compute (NMC) [2–4] provides high computing yield, but suffers long computing latency and low energy efficiency.
1-10 of 24