Home/ Research/ Research Areas/ Artificial Intelligence

Artificial Intelligence

Since the last decade, we have been witnessing a steep rise of Artificial Intelligence (AI) as an alternative computing paradigm. Although the idea has been around since 1950s, AI needed progress in algorithms, capable hardware, and sufficiently large training data to become a practical and powerful tool. Progress in computing hardware has been a key ingredient for the AI renaissance and will remain increasingly critical to realize future AI applications.

We are particularly well-positioned to supply the most advanced AI hardware to our customers thanks to our leading-edge logic, memory, and packaging technologies. We have established a research pipeline for technology to enable leading-edge AI devices, circuits, and systems for decades to come. Near- and in-memory computing, embedded non-volatile memory technologies, 3D integration, and error-resilient computing are amongst our specific AI hardware research areas. Our in-house research is complemented by strong academic and governmental partnerships, which allow us to interact with and influence leading AI researchers around the world.

Sort by:

A 16nm 216kb, 188.4TOPS/W and 133.5TFLOPS/W Microscaling Multi-Mode Gain-Cell CIM Macro for Edge-AI Devices
2025
TSMC and National Tsing Hua University demonstrate the first CIM macro demonstrating the microscaling data format; achieving 133.5TFLPOS/W in a 16nm process.
A 22nm 104.5TOPS/W μ-NMC-Δ-IMC Heterogeneous STT-MRAM CIM Macro for Noise-Tolerant Bayesian Neural Networks
2025
National Tsing Hua University and TSMC present an STT-MRAM CIM macro for noise-tolerant Bayesian neural networks with a heterogeneous in- and near-memory MAC structure. The 22nm macro achieves 104.5TOPS/W with a 0.03% accuracy loss for CIFAR-100.
Focus Session Invited Paper: Next Generation TSMC-SoIC^® Platform for Ultra-High Bandwidth HPC Application
2024
This work introduces the next generation System-on-Integrated-Chips (SoIC) 3D stacking technology, facilitating ultra-high bandwidth density between stacked dies, with device performance remaining stable. This innovative 3D technology, with TSMC's cutting-edge System-on-Chip nodes and advanced package solutions, pushes the boundaries of Moore’s Law for the next generation of High-Performance Computing applications.
MRAM Design-Technology-System Co-Optimization for Artificial Intelligence Edge Devices
2024
STT-MRAM shows great promise for use in artificial intelligence (AI) edge devices due to its compact bitcell area and high endurance. However, it faces read challenges because of its low TMR and RP. Conventional sense amplifiers have limitations in optimizing read energy and robustness while providing flexibility to exploit neural-net error tolerance. This article explores the design challenges of conventional sense amplifiers and examines how device parameters (TMR and RP) impact read performances. A novel capacitive-coupling sense amplifier is introduced to offer a new design space for balancing read energy and robustness. Combining the exploitation of neural-net error tolerance with sense amplifier and device co-design, a Design-Technology-System Co-Optimization (DTSCO) approach demonstrates a read energy reduction of 27.1% to 45.3% with minimal inference accuracy degradation in edge AI applications
Novel Parallel Digital Optical Computing System (DOC) for Generative A.I.
2024
Generative A.I.’s (GAI) popularity has made photonics based computation an attractive approach for its potential to meet the demands for higher energy efficiency performance (EEP). However, previous optical solutions for multiply-accumulate (MAC) operations focused on either analog architecture [1-7] which is limited by its precision and data conversion, or free-space optical architecture with limited scalability [8]. Here, a world’s first on-chip large-scale Digital Optical Computing System (DOC) for GAI training is reported. DOC employs a novel wafer-based system integration technology featuring multilayer low-loss photonic interconnect fan-out (PIFO) and EIC/PIC stack architecture leveraging TSMC SoICR. It reduces the data movement and memory hierarchy leading to improvement in critical path latency and system energy efficiency (EE). Compared to conventional electrical designs, DOC can be scaled to larger coherent networks and operate at higher speeds with lower energies per MAC operation. A low energy consumption of <0.08 pJ/MAC at 8-bit operation with a >20x improvement in EEP compared to the state-of-the-art GPU [10] is achieved for a 512 x 512 MAC large scale operation. The EEP further improves at higher precisions due to the relative minimal fan-out energy. This architecture has full potential for continuous EEP scaling in future generations.
A 16nm 96Kb Integer-floating-point Dual-mode Gain-cell-Computing-in-Memory Macro with 53.3-163.3TOPS/W and 23.2-121.2TFLOPS/W for AI-Edge Devices
2024
Advanced AI-edge chips require computational flexibility and high-energy efficiency (EEF) with sufficient inference accuracy for a variety of applications. Floating-point (FP) numerical representation can be used for complex neural networks (NN) requiring a high inference accuracy; however, such an approach requires higher energy and more parameter storage than does a fixed-point integer (INT) numerical representation. Many compute-in-memory (CIM) designs have a good EEF for INT multiply-and-accumulate (MAC) operations; however, few support FP-MAC operations [1–3]. Implementing INT/FP dual-mode (DM) MAC operations presents challenges (Fig. 34.2.1), including (1) low-area efficiency, since FP-MAC functions become idle during INT-MAC operations; (2) a high system-level latency, due to NN data update interruptions on small-capacity SRAM-CIM without concurrent write-and-compute functionality; and (3) high-energy consumption, due to repeated system-to-CIM data transfers during computation. This work presents an INT/FP DM macro featuring (1) a DM zone-based input (IN) processing scheme (ZB-IPS) to eliminate subtraction in exponent (EXP) computation, while reusing the alignment circuit in INT-mode to improve EEF and area efficiency (AEF); (2) a DM local-computing-cell (DM-LCC), which reuses the EXP addition as an adder tree stage for INT-MAC to improve AEF in INT mode; and (3) a stationary-based two-port gain-cell (GC) array (SB-TP-GCA) to support concurrent data updates and computation, while reducing system-to-CIM and internal data accesses to improve EEF and latency (T MAC ). A 16nm 96-Kb INT-FP DM GC-CIM macro with 4T GCs is fabricated to support FP-MAC with 64 accumulations (N ACCU ) for BF16-IN, BF16-W, and FP32-OUT as well as an INT-MAC with N ACCU =128 for 8b-IN, 8b-W, and 23b-OUT. This CIM macro achieves a 163.3TOPS/W INT-MAC and a 91.2TFLOPS/W FP-MAC EEF.
A 22nm Nonvolatile AI-Edge Processor with 21.4TFLOPS/W using 47.25Mb Lossless-Compressed-Computing STT-MRAM Near-Memory-Compute Macro
2024
Battery-powered AI-edge processors require short wakeup-to-response latency (T WR ) and high energy efficiency (EF) for accurate real-time inference. This necessitates high-capacity nvCIM macros to store floating-point (FP) neural network (NN) data (e.g., BF16) and perform MAC operations with short latency (T CD ) and high EF. This paper presents an STT - MRAM nvCIM macro with lossless compression computation, a near-far aware readout scheme, and system-level CIM-friendly hybrid weight mapping. The proposed 22nm nonvolatile processor (nvProcessor) with 47.25-Mb STT-MRAM nvCIM achieved high EF (21.4TFLOPS/W), short T_WR(428.58μs) , and high macro-level EF (27.6TFLOPS/W). Keywords: multiply-and-accumulate (MAC), nonvolatile memory (NVM), nonvolatile compute-in-memory (nvCIM)
MINOTAUR: An Edge Transformer Inference and Training Accelerator with 12 MBytes On-Chip Resistive RAM and Fine-Grained Spatiotemporal Power Gating
2024
MINOTAUR is the first energy-efficient edge SoC for inference and training of Transformers (and other networks, e.g., CNNs) with all memory on-chip. MINOTAUR leverages a configurable 8-bit posit-based accelerator, fine-grained spatiotemporal power gating enabled by on-chip resistive-RAM (RRAM) for dynamically adjustable bandwidth, and on-chip fine-tuning through full-network low-rank adaptation (LoRA). MINOTAUR achieves an average utilization of 93% and 74% and energy of 8.1 mJ and 8.2 mJ on ResNet-18 and MobileBERT Tiny inference respectively, and on-chip fine-tuning within 1.7% of offline training without RRAM-induced energy limitations or endurance degradations.
A 22nm 8Mb STT-MRAM Near-Memory-Computing Macro with 8b-Precision and 46.4-160.1TOPS/W for Edge-AI Devices
2023
Nonvolatile-memory-based computing in memory (nvCIM) [1–6] is ideal for low-power edge-Al devices requiring neural network (NN) parameter storage in the power-off mode, a rapid response to device wake-up, and high energy efficiency for MAC operations (EF_MAC) . Current analog nvCIMs impose a tradeoff between the signal margin (SM) and the number of accumulations (N_ACU) per cycle versus EF_MAC and computing latency (T_CD−MAC) . Near-memory computing (NMC), with high precision for inputs (IN), weights (W), and outputs (OUT), and a high N_ACU is a trend to improve EF_MAC,T_CD−MAC , and accuracy. A prior STT-MRAM NMC [1] uses vertical-weight mapping (VWM) to improve the EF_MAC ; however, further improvement is challenging: due to (1) the large energy consumption in reading repetitious weight data across multiple inputs for a single NN-layer; (2) a high bitstream toggling-rate (BTR) for digital MAC circuits (DC_MAC) reduces EF_MAC , and; (3) a limited SM and memory readout latency (T_CD−M) for memories with a small R-ratio (e.g. STT-MRAM, see Fig. 33.2.1). In developing an STT-MRAM nvCIM macro, this work moves beyond circuit-level novelty by using system-software-circuit co-design. This work achieves a high EF_MAC , a short T_CD−M , a high read bandwidth (R-BW), a high IN-W-OUT precision, and a high N_ACU by using the novel schemes: (1) a hardware based weight-feature aware read (WFAR) to reduce weight accesses and improve EF_MAC with a minimal area overhead; (2) toggling-aware weight-tuning (TAWT) to obtain fine-tuned weights (W_FT) with a low BTR, which is based on VWM to enhance the EF_MAC of the...
A 4nm 6163-TOPS/W/b 4790−TOPS/mm²/b SRAM Based Digital-Computing-in-Memory Macro Supporting Bit-Width Flexibility and Simultaneous MAC and Weight Update
2023
The computational load, for accurate AI workloads, is moving from large server clusters to edge devices; thus enabling richer and more personalized AI applications. Compute-in-memory (CIM) is beneficial for edge-AI workloads, specifically ones that are MAC-intensive. However, realizing better power-performance-area (PPA) and high accuracy is a major challenge for practical CIM implementation. Recent work examined tradeoffs between MAC throughput, energy efficiency and accuracy for analog based CIM [1–3]. On the other hand, digital-CIMs (DCIM), which use small, distributed SRAM banks and a customized MAC unit, have demonstrated massively-parallel computation with no accuracy loss and a higher PPA with technology scaling [4]. In this paper, we introduce a 4-nm SRAM-based DCIM macro that handles variable 8/12b-inteteger weights and 8/12/16b-integer inputs in a single macro. The proposed 8-transistor 2b OAI (or-and-invert) cell achieves a 11 % smaller combined bit cell and multiplier area, and supports ultra-low voltage operation, down to 0.32V. Furthermore, the signed-extended carry-look-ahead adder (signed-CLA) and an adder tree pipeline are introduced to boost throughput. Figure 7.4.1 shows the implementation of the bit cell structure, and a neural network accuracy comparison with various bit precisions. Since we targeted concurrent write and MAC operations, ping-pong for weight updates and MAC operations, array needs to have an even number of rows: a classical approach is to use two 12T bit cells and a 2-input NOR. The 12T cell supports simultaneous read and write operations, as its read- and write-port are independent. The 2-input NOR is used for bitwise multiplication with input activations (XIN) and weights (W). On the other hand, two 8T cells and an OAI is used in the proposed SRAM-based DCIM macro. In the proposed bitcell topology, the 8T bitcells act as memory data storage and row selection for the write operation. The OAI performs row selection and bitwise mu...

1 2 3 Next

Artificial Intelligence

Logic

Interconnect

Memory