attention-llms

Google's new compression algorithm cut memory stocks within hours of publication

25 Mar 2026

thenextweb.com

Google published a research blog post on Tuesday about a new compression algorithm for AI models. Within hours, memory stocks were falling. Micron dropped 3 per cent, Western Digital ...

A Visual Guide to Attention Variants in Modern LLMs

22 Mar 2026

open.substack.com

From MHA and GQA to MLA, sparse attention, and hybrid architectures

Topic 33: Slim Attention, KArAt, XAttention and Multi-Token Attention Explained – What’s Really Changing in Transformers?

7 Apr 2025

huggingface.co

A Blog post by Ksenia Se on Hugging Face

Multi-Head Latent Attention and Other KV Cache Tricks

29 Jan 2025

pyspur.dev

How a Key-Value (KV) cache reduces Transformer inference time by trading memory for computation

On MLA

28 Jan 2025

planetbanatt.net

Why AI language models choke on too much text

22 Dec 2024

arstechnica.com

Compute costs scale with the square of the input size. That’s not great.

Understanding Positional Embeddings in Transformers: From Absolute to Rotar

20 Jul 2024

towardsdatascience.com

A deep dive into absolute, relative, and rotary positional embeddings with code examples

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-preci

14 Jul 2024

pytorch.org

Attention, as a core layer of the ubiquitous Transformer architecture, is a bottleneck for large language models and long-context applications. FlashAttention (and FlashAttention-2) pioneered an approach to speed up attention on GPUs by minimizing memory reads/writes, and is now used by most libraries to accelerate Transformer training and inference. This has contributed to a massive increase in LLM context length in the last two years, from 2-4K (GPT-3, OPT) to 128K (GPT-4), or even 1M (Llama 3). However, despite its success, FlashAttention has yet to take advantage of new capabilities in modern hardware, with FlashAttention-2 achieving only 35% utilization of theoretical max FLOPs on the H100 GPU. In this blogpost, we describe three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) incoherent processing that leverages hardware support for FP8 low-precision.

Deep Learning Architectures From CNN, RNN, GAN, and Transformers To Encoder

15 Apr 2024

marktechpost.com

Deep learning architectures have revolutionized the field of artificial intelligence, offering innovative solutions for complex problems across various domains, including computer vision, natural language processing, speech recognition, and generative models. This article explores some of the most influential deep learning architectures: Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Generative Adversarial Networks (GANs), Transformers, and Encoder-Decoder architectures, highlighting their unique features, applications, and how they compare against each other. Convolutional Neural Networks (CNNs) CNNs are specialized deep neural networks for processing data with a grid-like topology, such as images. A CNN automatically detects the important features without any human supervision.

How Chain-of-Thought Reasoning Helps Neural Networks Compute

29 Mar 2024

quantamagazine.org

Large language models do better at solving problems when they show their work. Researchers are beginning to understand why.

Attention for Vision Transformers, Explained

29 Feb 2024

towardsdatascience.com

The Math and the Code Behind Attention Layers in Computer Vision

How do transformers work?+Design a Multi-class Sentiment Analysis for Custo

22 Feb 2024

open.substack.com

We will deep dive into understanding how transformer model work like BERT(Non-mathematical Explanation of course!). system design to use the transformer to build a Sentiment Analysis

Text vs. Images: Which Content Format is Effective?

5 Feb 2024

noupe.com

When we talk about using different ways to share information, it's like picking the one that fits what you need! Words, pictures, and mixes of both have

FlashSigmoid: A Hardware-Aware and Memory-Efficient Implementation of Sigmoid Attention Yielding a 1

24 Sep 2014

marktechpost.com

Large Language Models (LLMs) have gained significant prominence in modern machine learning, largely due to the attention mechanism. This mechanism employs a sequence-to-sequence mapping to construct context-aware token representations. Traditionally, attention relies on the softmax function (SoftmaxAttn) to generate token representations as data-dependent convex combinations of values. However, despite its widespread adoption and effectiveness, SoftmaxAttn faces several challenges. One key issue is the tendency of the softmax function to concentrate attention on a limited number of features, potentially overlooking other informative aspects of the input data. Also, the application of SoftmaxAttn necessitates a row-wise reduction along the input sequence length,

FlexAttention: The Flexibility of PyTorch with the Performance of FlashAtte

24 Aug 2007

pytorch.org

attention-llms — my Raindrop.io articles