cover image

A new technical paper titled “Prefill vs. Decode Bottlenecks: SRAM-Frequency Tradeoffs and the Memory-Bandwidth Ceiling” was published by researchers at Uppsala University. Abstract “Energy consumption dictates the cost and environmental impact of deploying Large Language Models. This paper investigates the impact of on-chip SRAM size and operating frequency on the energy efficiency and performance of... » read more

cover image

Discover SRAM PUF’s security benefits and how Synopsys combines it with OTP memory for advanced, secure key storage in embedded systems.

cover image

Compute-in-SRAM architectures offer a promising approach to achieving higher performance and energy efficiency across a range of data-intensive applications. However, prior evaluations have largely relied on simulators or small prototypes, limiting the understanding of their real-world potential. In this work, we present a comprehensive performance and energy characterization of a commercial compute-in-SRAM device, the GSI APU, under realistic workloads. We compare the GSI APU against established architectures, including CPUs and GPUs, to quantify its energy efficiency and performance potential. We introduce an analytical framework for general-purpose compute-in-SRAM devices that reveals fundamental optimization principles by modeling performance trade-offs, thereby guiding program optimizations. Exploiting the fine-grained parallelism of tightly integrated memory-compute architectures requires careful data management. We address this by proposing three optimizations: communication-aware reduction mapping, coalesced DMA, and broadcast-friendly data layouts. When applied to retrieval-augmented generation (RAG) over large corpora (10GB--200GB), these optimizations enable our compute-in-SRAM system to accelerate retrieval by 4.8$\times$--6.6$\times$ over an optimized CPU baseline, improving end-to-end RAG latency by 1.1$\times$--1.8$\times$. The shared off-chip memory bandwidth is modeled using a simulated HBM, while all other components are measured on the real compute-in-SRAM device. Critically, this system matches the performance of an NVIDIA A6000 GPU for RAG while being significantly more energy-efficient (54.4$\times$-117.9$\times$ reduction). These findings validate the viability of compute-in-SRAM for complex, real-world applications and provide guidance for advancing the technology.

Fully integrated dToF modules and iToF VCSEL illuminators for short range applications. Laser sources for long range LIDAR systems.

cover image

The idea isn't novel, but presents major challenges. Tensordyne thinks it has solved them, and promises massive speed and efficiency gains as a result.

cover image

A new technical paper titled “Energy-Accuracy Trade-Offs in Massive MIMO Signal Detection Using SRAM-Based In-Memory Computing” was published by researchers at the University of Illinois at Urbana–Champaign. Abstract “This paper investigates the use of SRAM-based in-memory computing (IMC) architectures for designing energy efficient and accurate signal detectors for massive multi-input multi-output (MIMO) systems. SRAM-based IMCs... » read more

cover image

Traditional von Neumann architectures suffer from fundamental bottlenecks due to continuous data movement between memory and processing units, a challenge that worsens with technology scaling as electrical interconnect delays become more significant. These limitations impede the performance and energy efficiency required for modern data-intensive applications. In contrast, photonic in-memory computing presents a promising alternative by harnessing the advantages of light, enabling ultra-fast data propagation without length-dependent impedance, thereby significantly reducing computational latency and energy consumption. This work proposes a novel differential photonic static random access memory (pSRAM) bitcell that facilitates electro-optic data storage while enabling ultra-fast in-memory Boolean XOR computation. By employing cross-coupled microring resonators and differential photodiodes, the XOR-augmented pSRAM (X-pSRAM) bitcell achieves at least 10 GHz read, write, and compute operations entirely in the optical domain. Additionally, wavelength-division multiplexing (WDM) enables n-bit XOR computation in a single-shot operation, supporting massively parallel processing and enhanced computational efficiency. Validated on GlobalFoundries' 45SPCLO node, the X-pSRAM consumed 13.2 fJ energy per bit for XOR computation, representing a significant advancement toward next-generation optical computing with applications in cryptography, hyperdimensional computing, and neural networks.

cover image

Table of Contents Motivation Optimization goal of GPUs Key concepts of GPUs - software and...

cover image
First-Time Silicon Success Plummets
27 Mar 2025
semiengineering.com

Number of designs that are late increases. Rapidly rising complexity is the leading cause, but tools, training, and workflows need to improve.

cover image

The move to nanosheet transistors is a boon for SRAM

cover image

After persistent rumors refused to recede, AMD steps in with a clear explanation why dual-CCD V-Cache doesn't exist.

cover image
Why AI language models choke on too much text
22 Dec 2024
arstechnica.com

Compute costs scale with the square of the input size. That’s not great.

cover image

The CCD stack with 3D V-Cache on the AMD Ryzen 7 9800X3D is only 40-45µm in total, but the rest of the layers add up to a whopping 750µm.

cover image

Large Language Models (LLMs) have become a cornerstone of artificial intelligence, driving advancements in natural language processing and decision-making tasks. However, their extensive power demands, resulting from high computational overhead and frequent external memory access, significantly hinder their scalability and deployment, especially in energy-constrained environments such as edge devices. This escalates the cost of operation while also limiting accessibility to these LLMs, which therefore calls for energy-efficient approaches designed to handle billion-parameter models. Current approaches to reduce the computational and memory needs of LLMs are based either on general-purpose processors or on GPUs, with a combination of weight quantization and

cover image
TSMC Lifts the Curtain on Nanosheet Transistors
15 Dec 2024
spectrum.ieee.org

And Intel shows how far these devices could go

cover image
Is In-Memory Compute Still Alive?
12 Dec 2024
semiengineering.com

It hasn’t achieved commercial success, but there is still plenty of development happening; analog IMC is getting a second chance.

cover image

Notes from the Latent Space paper club. Follow along or start your own! - eugeneyan/llm-paper-notes

cover image

As awareness of environmental, social, and governance (ESG) issues grows, companies are adopting strategies for sustainable operations.

cover image

There are many chip partitioning and placement tradeoffs when comparing top-tier smartphone processor designs.

cover image

Attention, as a core layer of the ubiquitous Transformer architecture, is a bottleneck for large language models and long-context applications. FlashAttention (and FlashAttention-2) pioneered an approach to speed up attention on GPUs by minimizing memory reads/writes, and is now used by most libraries to accelerate Transformer training and inference. This has contributed to a massive increase in LLM context length in the last two years, from 2-4K (GPT-3, OPT) to 128K (GPT-4), or even 1M (Llama 3). However, despite its success, FlashAttention has yet to take advantage of new capabilities in modern hardware, with FlashAttention-2 achieving only 35% utilization of theoretical max FLOPs on the H100 GPU. In this blogpost, we describe three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) incoherent processing that leverages hardware support for FP8 low-precision.

cover image

The impact of quantum algorithms on different cryptographic techniques and what can be done about it.

cover image
How to Put a Data Center in a Shoebox
16 May 2024
spectrum.ieee.org

Imec’s plan to use superconductors to shrink computers

cover image
SRAM Security Concerns Grow
10 May 2024
semiengineering.com

Volatile memory threat increases as chips are disaggregated into chiplets, making it easier to isolate memory and slow data degradation.

cover image
ASAP5: A predictive PDK for the 5 nm node
24 Feb 2024
sciencedirect.com

We present a predictive process design kit (PDK) for the 5 nm technology node, the ASAP5 PDK. ASAP5 is not related to a particular foundry and the ass…

cover image
Grokking Groq’s Groqness
22 Feb 2024
blocksandfiles.com

Startup Groq has developed an machine learning processor that it claims blows GPUs away in large language model workloads – 10x faster than an Nvidia GPU at 10 percent of the cost, and needing a tenth of the electricity. Update: Groq model compilation time and time from access to getting it up and running clarified. […]

cover image

Faster than Nvidia? Dissecting the economics

cover image
Downfall Attacks
9 Aug 2023
downfall.page

Downfall attacks targets a critical weakness found in billions of modern processors used in personal and cloud computers.

cover image

Atom-thin layers of oxygen in a chip’s silicon can make devices speedier and more reliable

cover image
ELI5: FlashAttention
24 Jul 2023
gordicaleksa.medium.com

Step by step explanation of how one of the most important MLSys breakthroughs work — in gory detail.

cover image

A technical paper titled “Benchmarking and modeling of analog and digital SRAM in-memory computing architectures” was published by researchers at KU Leuven. Abstract: “In-memory-computing is emerging as an efficient hardware paradigm for deep neural network accelerators at the edge, enabling to break the memory wall and exploit massive computational parallelism. Two design models have surged:... » read more

cover image

tldr; techniques to speed up training and inference of LLMs to use large context window up to 100K input tokens during training and…

cover image
AMD’s RX 7600: Small RDNA 3 Appears
5 Jun 2023
chipsandcheese.com

Editor’s Note (6/14/2023): We have a new article that reevaluates the cache latency of Navi 31, so please refer to that article for some new latency data.

cover image

Floorplanning plays a crucial role in the physical design of an SoC and lays the foundation for an efficient and high-performance ASIC layout. In this article, we will discuss ten essential floorplanning commandments that physical design engineers can follow to ensure a correct-by-construction design.   Design Partitioning   Design Partitioning refers to dividing a large

cover image

New memory technologies have emerged to push the boundaries of conventional computer storage.

cover image

After dipping this year, the growth of 300mm semiconductor manufacturing capacity is set to gain momentum.

cover image

Sponsored Feature: Training an AI model takes an enormous amount of compute capacity coupled with high bandwidth memory. Because the model training can be

cover image
Taking a look at the ReRAM state of play
16 Mar 2023
blocksandfiles.com

ReRAM startup Intrinsic Semiconductor Technologies has raised $9.73 million to expand its engineering team and bring its product to market.

cover image

Security IP cores are blocks that provide security features for integrated circuits (ICs) and systems-on-chips (SoCs). It includes encryption, decryption, authentication, and key management functions that protect against unauthorized access or hacking. The IP core can be integrated into a larger IC design to provide enhanced security for applications such as IoT devices, payment systems,

cover image
TSMC Might Cut 3nm Prices to Lure AMD, Nvidia
14 Jan 2023
tomshardware.com

Industry sources say TSMC is considering lowering 3nm prices to stimulate interest from chip designers

cover image

List of awesome open source hardware tools, generators, and reusable designs - aolofsson/awesome-opensource-hardware

cover image
Safeguarding SRAMs From IP Theft (Best Paper Award)
18 Dec 2022
semiengineering.com

A technical paper titled “Beware of Discarding Used SRAMs: Information is Stored Permanently” was published by researchers at Auburn University. The paper won “Best Paper Award” at the IEEE International Conference on Physical Assurance and Inspection of Electronics (PAINE) Oct. 25-27 in Huntsville. Abstract: “Data recovery has long been a focus of the electronics industry... » read more

cover image

The world's largest chip scales to new heights.

cover image
How Memory Design Optimizes System Performance
26 Sep 2022
semiengineering.com

Changes are steady in the memory hierarchy, but how and where that memory is accessed is having a big impact.

cover image
Ultimate Guide: Clock Tree Synthesis
24 Sep 2022
anysilicon.com

A vast majority of modern digital integrated circuits are synchronous designs. They rely on storage elements called registers or flip-flops, all of which change their stored data in a lockstep manner with respect to a control signal called the clock. In many ways, the clock signal is like blood flowing through the veins of a

cover image
DRAM Thermal Issues Reach Crisis Point
18 Jul 2022
semiengineering.com

Increased transistor density and utilization are creating memory performance issues.

cover image
CXL: Protocol for Heterogenous Datacenters
8 Jul 2022
fabricatedknowledge.com

Let's learn more about the world's most important manufactured product. Meaningful insight, timely analysis, and an occasional investment idea.

cover image

There are two types of packaging that represent the future of computing, and both will have validity in certain domains: Wafer scale integration and

cover image

EE Times Compares SRAM vs. DRAM, Common Issues With Each Type Of Memory, And Takes A Look At The Future For Computer Memory.

SweRV - An Annotated Deep Dive
10 Dec 2021
tomverbeure.github.io
cover image

This blog post is in response to a recent topic on the Parallella forum regarding Adapteva’s chip cost efficiency (GFLOPS/$): [forum discussion thread]. I had to be a little vague on some poi…

cover image

Explore Synopsys Blog for the latest insights and trends in EDA, IP, and Systems Design. Stay updated with expert articles and industry news.

cover image

I have written a lot of articles looking at leading…

cover image
How to make your own deep learning accelerator chip!
3 Dec 2021
towardsdatascience.com

Currently there are more than 100 companies all over the world building ASIC’s (Application specific integrated circuit) or SOC’s (System…

cover image
Using Memory Differently To Boost Speed
3 Dec 2021
semiengineering.com

Getting data in and out of memory faster is adding some unexpected challenges.

cover image
DRAM Tradeoffs: Speed Vs. Energy
3 Dec 2021
semiengineering.com

Experts at the Table: Which type of DRAM is best for different applications, and why performance and power can vary so much.

cover image
TOPS, Memory, Throughput And Inference Efficiency
3 Dec 2021
semiengineering.com

Evaluate inference accelerators to find the best throughput for the money.

cover image
Next-Gen Chips Will Be Powered From Below
28 Aug 2021
spectrum.ieee.org

Buried interconnects will help save Moore's Law

cover image
Impact Of GAA Transistors At 3/2nm
17 Aug 2021
semiengineering.com

Some things will get better from a design perspective, while others will be worse.

cover image
Bumps Vs. Hybrid Bonding For Advanced Packaging
23 Jun 2021
semiengineering.com

New interconnects offer speed improvements, but tradeoffs include higher cost, complexity, and new manufacturing challenges.

cover image
AMD 3D Stacks SRAM Bumplessly
12 Jun 2021
fuse.wikichip.org

AMD recently unveiled 3D V-Cache, their first 3D-stacked technology-based product. Leapfrogging contemporary 3D bonding technologies, AMD jumped directly into advanced packaging with direct bonding and an order of magnitude higher wire density.

cover image
11 Ways To Reduce AI Energy Consumption
13 May 2021
semiengineering.com

Pushing AI to the edge requires new architectures, tools, and approaches.

cover image
SVT: Six Stacked Vertical Transistors
18 Mar 2021
semiengineering.com

SRAM cell architecture introduction: design and process challenges assessment.

This is a list of semiconductor fabrication plants. A semiconductor fabrication plant is where integrated circuits (ICs), also known as microchips, are manufactured. They are either operated by Integrated Device Manufacturers (IDMs) that design and manufacture ICs in-house and may also manufacture designs from design-only (fabless firms), or by pure play foundries that manufacture designs from fabless companies and do not design their own ICs. Some pure play foundries like TSMC offer IC design services, and others, like Samsung, design and manufacture ICs for customers, while also designing, manufacturing and selling their own ICs.

cover image
New And Innovative Supply Chain Threats Emerging
5 Nov 2020
semiengineering.com

But so are better approaches to deal with thorny counterfeiting issues.

cover image

Looking at a typical SoC design today it's likely to…

cover image
Intel Networking: Not Just A Bag Of Parts
16 Oct 2020
nextplatform.com

What is the hardest job at Intel, excepting whoever is in charge of the development of chip etching processes and the foundries that implement it? We

cover image
TSMC Details 5 nm
23 Mar 2020
fuse.wikichip.org

TSMC details its 5-nanometer node for mobile and HPC applications. The process features the industry's highest density transistors with a high-mobility channel and highest-density SRAM cells.

96-Core Processor Made of Chiplets
19 Feb 2020
spectrum.ieee.org
cover image

"...Google's people analytics experts had been studying how to onboard new hires effectively. They came back with a list of tips. Here’s the one that jumped…

cover image

A look at Cerebras Wafer-Scale Engine (WSE), a chip the size of a wafer, packing over 400K tiny AI cores using 1.2 trillion transistors on a half square foot of silicon.

cover image
Building An MRAM Array
17 Oct 2019
semiengineering.com

Why MRAM is so attractive.

New chips for machine intelligence
7 Oct 2019
jameswhanlon.com
cover image
AI Inference Memory System Tradeoffs
29 Aug 2019
semiengineering.com

TOPS isn't all you need to know about an inference chip.

cover image

Researchers at imec explore strategy that could make memory more efficient and pack in more transistors

cover image
Startup Runs AI in Novel SRAM
22 Jul 2019
eetimes.com

Areanna claims that a custom SRAM delivers 100 TOPS/W on deep learning, but it’s early days for the startup.

cover image

How the wrong benchmark can lead to incorrect conclusions.

cover image

The previous post in this series (excerpted from the Objective Analysis and Coughlin Associates Emerging Memory report) explained why emerging memories are necessary. Oddly enough, this series will explain bit selectors before defining all of the emerging memory technologies themselves. The reason why is that the bit selector determines how small a bit cell can

cover image
Processing In Memory
6 Sep 2018
semiengineering.com

Processing In Memory Growing volume of data and limited improvements in performance create new opportunities for approaches that never got off the ground.

cover image
Imperfect Silicon, Near-Perfect Security
7 Feb 2018
semiengineering.com

Imperfect Silicon, Near-Perfect Security Physically unclonable functions (PUF) seem tailor-made for IoT security.

cover image

Katherine Bourzac / IEEE Spectrum: The Northwest-AI-Hub, which is researching hybrid gain cell memory that combines DRAM's density with SRAM's speed, gets a $16.3M CHIPS Act grant via the US DOD

cover image
Hybrid Memory Designed to Cut AI Energy Use
24 Oct 2010
spectrum.ieee.org

Researchers developing dense, speedy hybrid gain cell memory recently got a boost from CHIPS Act funding

cover image

Fab Cost, WFE Implications, Backside Power Details