transformers (llms)

To Make Language Models Work Better, Researchers Sidestep...

www.quantamagazine.org (2025-04-20)

We insist that large language models repeatedly translate their mathematical processes into words. There may be a better way.

On MLA

planetbanatt.net (2025-01-28)

How LLMs Store and Use Knowledge? This AI Paper Introduce...

www.marktechpost.com (2024-12-15)

Large language models (LLMs) can understand and generate human-like text by encoding vast knowledge repositories within their parameters. This capacity enables them to perform complex reasoning tasks, adapt to various applications, and interact effectively with humans. However, despite their remarkable achievements, researchers continue to investigate the mechanisms underlying the storage and utilization of knowledge in these systems, aiming to enhance their efficiency and reliability further. A key challenge in using large language models is their propensity to generate inaccurate, biased, or hallucinatory outputs. These problems arise from a limited understanding of how such models organize and access knowledge. Without clear

Transformers Key-Value (KV) Caching Explained

towardsdatascience.com (2024-12-12)

Speed up your LLM inference

Hugging Face Releases Sentence Transformers v3.3.0: A Maj...

www.marktechpost.com (2024-11-11)

Natural Language Processing (NLP) has rapidly evolved in the last few years, with transformers emerging as a game-changing innovation. Yet, there are still notable challenges when using NLP tools to develop applications for tasks like semantic search, question answering, or document embedding. One key issue has been the need for models that not only perform well but also work efficiently on a range of devices, especially those with limited computational resources, such as CPUs. Models tend to require substantial processing power to yield high accuracy, and this trade-off often leaves developers choosing between performance and practicality. Additionally, deploying large models

Understanding Positional Embeddings in Transformers: From...

towardsdatascience.com (2024-07-20)

A deep dive into absolute, relative, and rotary positional embeddings with code examples

Meet Sohu: The World’s First Transfor...

www.marktechpost.com (2024-06-28)

The Sohu AI chip by Etched is a thundering breakthrough, boasting the title of the fastest AI chip to date. Its design is a testament to cutting-edge innovation, aiming to redefine the possibilities within AI computations and applications. At the center of Sohu's exceptional performance is its advanced processing capabilities, which enable it to handle complex computations at unprecedented speeds. With a capability of processing over 500,000 tokens per second on the Llama 70B model, the Sohu chip enables the creation of unattainable products with traditional GPUs. An 8xSohu server can effectively replace 160 H100 GPUs, showcasing their remarkable efficiency

A Visual Guide to Vision Transformers | MDTURP

blog.mdturp.ch (2024-04-16)

This is a visual guide (scroll story) to Vision Transformers (ViTs), a class of deep learning models that have achieved state-of-the-art performance on image classification tasks.

Mamba Explained

thegradient.pub (2024-03-30)

Is Attention all you need? Mamba, a novel AI model based on State Space Models (SSMs), emerges as a formidable alternative to the widely used Transformer models, addressing their inefficiency in processing long sequences.

Position Embeddings for Vision Transformers, Explained

towardsdatascience.com (2024-02-29)

The Math and the Code Behind Position Embeddings in Vision Transformers

Attention for Vision Transformers, Explained

towardsdatascience.com (2024-02-29)

The Math and the Code Behind Attention Layers in Computer Vision

Vision Transformers, Explained

towardsdatascience.com (2024-02-29)

A Full Walk-Through of Vision Transformers in PyTorch

[2302.07730] Transformer models: an introduction and catalog

arxiv.org (2023-10-05)

In the past few years we have seen the meteoric appearance of dozens of foundation models of the Transformer family, all of which have memorable and sometimes funny, but not self-explanatory,...

Hugging Face 101: A Tutorial for Absolute Beginners!

dev.to (2023-09-25)

Welcome to this beginner-friendly tutorial on sentiment analysis using Hugging Face's transformers...

Cracking Open the Hugging Face Transformers Library

towardsdatascience.com (2023-09-25)

A quick-start guide to using open-source LLMs

Optimizing Memory Usage for Training LLMs and Vision Tran...

lightning.ai (2023-07-23)

This article provides a series of techniques that can lower memory consumption in PyTorch (when training vision transformers and LLMs) by approximately 20x without sacrificing modeling performance and prediction accuracy.

Edge 291: Reinforcement Learning with Human Feedback

thesequence.substack.com (2023-05-18)

1) Reinforcement Learning with Human Feedback(RLHF) 2) The RLHF paper, 3) The transformer reinforcement learning framework.

Meta has built a massive new language AI—and it’s giving ...

www.technologyreview.com (2023-04-21)

Facebook’s parent company is inviting researchers to pore over and pick apart the flaws in its version of GPT-3

What Are Transformer Models and How Do They Work?

txt.cohere.ai (2023-04-19)

Transformer models are one of the most exciting new developments in machine learning. They were introduced in the paper Attention is All You Need. Transformers can be used to write stories, essays, poems, answer questions, translate between languages, chat with humans, and they can even pass exams that are hard for humans! But what are they? You’ll be happy to know that the architecture of transformer models is not that complex, it simply is a concatenation of some very useful components, each o

Hacker News

magazine.sebastianraschka.com (2023-04-19)

A Cross-Section of the Most Relevant Literature To Get Up to Speed

Hacker News

johanwind.github.io (2023-03-31)

I explain what is so unique about the RWKV language model.

Optical Transformers

arxiv.org (2023-02-26)

The rapidly increasing size of deep-learning models has caused renewed and growing interest in alternatives to digital computers to dramatically reduce the energy cost of running state-of-the-art...

Hacker News

lilianweng.github.io (2023-02-07)

Many new Transformer architecture improvements have been proposed since my last post on “The Transformer Family” about three years ago. Here I did a big refactoring and enrichment of that 2020 post — restructure the hierarchy of sections and improve many sections with more recent papers. Version 2.0 is a superset of the old version, about twice the length. Notations Symbol Meaning $d$ The model size / hidden state dimension / positional encoding size.

lucidrains/vit-pytorch: Implementation of Vision Transfor...

github.com (2022-12-18)

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch - lucidrains/vit-pytorch

All you need to know about ‘Attention’ and ‘Transformers’...

towardsdatascience.com (2022-09-20)

Attention, Self-Attention, Multi-head Attention, Masked Multi-head Attention, Transformers, BERT, and GPT

All you need to know about ‘Attention’ and ‘Transformers’...

towardsdatascience.com (2022-09-20)

Attention, Self-Attention, Multi-head Attention, and Transformers

Transformers

e2eml.school (2021-11-29)

ankane/transformers-ruby: State-of-the-art transformers f...

github.com (2021-08-24)

State-of-the-art transformers for Ruby.

GPT-J-6B: 6B JAX-Based Transformer – Aran Komatsuzaki

arankomatsuzaki.wordpress.com (2021-07-05)

Summary: We have released GPT-J-6B, 6B JAX-based (Mesh) Transformer LM (Github).GPT-J-6B performs nearly on par with 6.7B GPT-3 (or Curie) on various zero-shot down-streaming tasks.You can try out …

2106

arxiv.org (2021-06-15)

NielsRogge/Transformers-Tutorials: This repository contai...

email.mg2.substack.com (2021-06-03)

This repository contains demos I made with the Transformers library by HuggingFace. - NielsRogge/Transformers-Tutorials

The Illustrated Transformer – Jay Alammar – Visualizing m...

jalammar.github.io (2021-05-29)

Discussions: Hacker News (65 points, 4 comments), Reddit r/MachineLearning (29 points, 3 comments) Translations: Arabic, Chinese (Simplified) 1, Chinese (Simplified) 2, French 1, French 2, Italian, Japanese, Korean, Persian, Russian, Spanish 1, Spanish 2, Vietnamese Watch: MIT’s Deep Learning State of the Art lecture referencing this post Featured in courses at Stanford, Harvard, MIT, Princeton, CMU and others In the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformer outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their Cloud TPU offering. So let’s try to break the model apart and look at how it functions. The Transformer was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the Tensor2Tensor package. Harvard’s NLP group created a guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter. 2020 Update: I’ve created a “Narrated Transformer” video which is a gentler approach to the topic: A High-Level Look Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.

Understanding Transformers, the machine learning model be...

thenextweb.com (2021-05-22)

How this novel neural network architecture changes the way we analyze complex data types, and powers revolutionary models like GPT-3 and BERT.

How Transformers work in deep learning and NLP: an intuit...

theaisummer.com (2021-05-18)

An intuitive understanding on Transformers and how they are used in Machine Translation. After analyzing all subcomponents one by one such as self-attention and positional encodings , we explain the principles behind the Encoder and Decoder and why Transformers work so well