cpus | Perfectly Awesome

Performance Analysis and Tuning on Modern CPUs

Notes on the Pentium's microcode circuitry

Most people think of machine instructions as the fundamental steps that a computer performs. However, many processors have another layer of ...

AMD's Strix Halo - Under the Hood

Hello you fine Internet folks,

The Pentium contains a complicated circuit to multiply by three

In 1993, Intel released the high-performance Pentium processor, the start of the long-running Pentium line. I've been examining the Pentium'...

The Road Ahead For Datacenter Compute Engines: The CPUs

It is often said that companies – particularly large companies with enormous IT budgets – do not buy products, they buy roadmaps. No one wants to go to

Pi in the Pentium: reverse-engineering the constants in its floating-point unit

Intel released the powerful Pentium processor in 1993, establishing a long-running brand of high-performance processors. 1 The Pentium incl...

AMD Reveals Real Reason It Won't Put 3D V-Cache On Multiple CCDs

After persistent rumors refused to recede, AMD steps in with a clear explanation why dual-CCD V-Cache doesn't exist.

Intel's $475 million error: the silicon behind the Pentium division bug

In 1993, Intel released the high-performance Pentium processor, the start of the long-running Pentium line. The Pentium had many improvement...

AMD Ryzen 7 9800X3D Uses A Thick Dummy Silicon That Comprises 93% Of The CCD Stack And Has No Performance Purpose

The CCD stack with 3D V-Cache on the AMD Ryzen 7 9800X3D is only 40-45µm in total, but the rest of the layers add up to a whopping 750µm.

Antenna diodes in the Pentium processor

I was studying the silicon die of the Pentium processor and noticed some puzzling structures where signal lines were connected to the silico...

AMD Disables Zen 4's Loop Buffer

A loop buffer sits at a CPU's frontend, where it holds a small number of previously fetched instructions.

Why Intel Lost Its CPU Crown To AMD (And How Ryzen Changed The Game) - SlashGear

Intel was a dominant leader in the CPU market for the better part of a decade, but AMD has seen massive success in recent years thanks to its Ryzen chips.

Amazon’s Cloud Crisis: How AWS Will Lose The Future Of Computing

Nitro, Graviton, EFA, Inferentia, Trainium, Nvidia Cloud, Microsoft Azure, Google Cloud, Oracle Cloud, Handicapping Infrastructure, AI As A Service, Enterprise Automation, Meta, Coreweave, TCO

HPC Gets A Reconfigurable Dataflow Engine To Take On CPUs And GPUs

No matter how elegant and clever the design is for a compute engine, the difficulty and cost of moving existing – and sometimes very old – code from the

Intel’s Redwood Cove: Baby Steps are Still Steps

Intel’s Meteor Lake chip signaled a change in Intel’s mobile strategy, moving away from the monolithic designs that had characterized Intel’s client designs for more than a decade.

Intel Core Ultra 200 “Arrow Lake” Desktop CPU Specs Leak: Core Ultra 9 285K & Ultra 7 265K With 250W

Intel's Core Ultra 200 "Arrow Lake" Desktop CPU specifications have now been finalized and we are just a month away from the official launch.

Report: Intel Meteor Lake In Short Supply Due to Yield Issues, Intel Runnin

Surveying the Landscape of Smartphone Processors

There are many chip partitioning and placement tradeoffs when comparing top-tier smartphone processor designs.

Zen 5’s 2-Ahead Branch Predictor Unit: How a 30 Year Old Idea Allows for Ne

When I recently interviewed Mike Clark, he told me, “…you’ll see the actual foundational lift play out in the future on Zen 6, even though it was really Zen 5 that set the table for that.” And at that same Zen 5 architecture event, AMD’s Chief Technology Officer Mark Papermaster said, “Zen 5 is a ground-up redesign of the Zen architecture,” which has brought numerous and impactful changes to the design of the core.

Testing AMD’s Bergamo: Zen 4c Spam

Server CPUs have pushed high core counts for a long time, though they way they got high core counts has varied.

Flow claims it can 100x any CPU’s power with its companion chip and some el

A Finnish startup called Flow Computing is making one of the wildest claims ever heard in silicon engineering: by adding its proprietary companion chip,

AMD announces 3nm EPYC Turin with 192 cores and 384 threads — 5.4X faster t

192 cores, 385 threads, socket compatibility. What's not to like?

Half of Russian-Made Chips Are Defective

Anton Shilov reports via Tom's Hardware: About half of the processors packaged in Russia are defective. This has prompted Baikal Electronics, a Russian processor developer, to expand the number of packaging partners in the country, according to a report in Vedomosti, a Russian-language business dai...

“Downfall” bug affects years of Intel CPUs, can leak encryption keys and mo

Researchers also disclosed a separate bug called “Inception” for newer AMD CPUs.

Downfall Attacks

Downfall attacks targets a critical weakness found in billions of modern processors used in personal and cloud computers.

Calculate Computational Efficiency of Deep Learning Models with FLOPs and M

In this article we will learn about its definition, differences and how to calculate FLOPs and MACs using Python packages.

Gallery of Processor Cache Effects

Kryo: Qualcomm’s Last In-House Mobile Core

CPU design is hard.

AI Server Cost Analysis – Memory Is The Biggest Loser

Micron $MU looks very weak in AI

Intel Is All-In on Back-Side Power Delivery

The company’s PowerVia interconnect tech demonstrated a 6 percent performance gain

The Case for Running AI on CPUs Isn’t Dead Yet

GPUs may dominate, but CPUs could be perfect for smaller AI models

ARM’s Cortex A53: Tiny But Important

Tech enthusiasts probably know ARM as a company that develops reasonably performant CPU architectures with a focus on power efficiency.

Intel CPU Die Topology - by Jason Rahman - Delayed Branch

Over the past 10-15 years, per-core throughput increases have slowed, and in response CPU designers have scaled up core counts and socket counts to continue increasing performance across generations of new CPU models.

Reverse-engineering the division microcode in the Intel 8086 processor

While programmers today take division for granted, most microprocessors in the 1970s could only add and subtract — division required a sl...

Hacker News

While microprocessors are used in various applications, they are precluded from the use in high-energy physics applications due to the harsh radiation present. To overcome this limitation a...

Interconnect Under the Spotlight as Core Counts Accelerate - SemiWiki

In the march to more capable, faster, smaller, and lower…

Why AI Inference Will Remain Largely On The CPU

Sponsored Feature: Training an AI model takes an enormous amount of compute capacity coupled with high bandwidth memory. Because the model training can be

RISC-V In The Datacenter Is No Risky Proposition

It was only a matter of time, perhaps, but the skyrocketing costs of designing chips is colliding with the ever-increasing need for performance,

China’s flagship CPU designer puts on a brave face amid US sanctions

Chinese chip designer Loongson, which has tried to reduce the country’s reliance on Intel and AMD, is developing its own general-purpose GPU despite being added to a US trade blacklist.

Make your sklearn models up to 100 times faster

How to considerable reduce training time changing only 1 line of code

The basics of Arm64 Assembly - by Diego Crespo

Just one instruction at a time!

Google increases server life to six years, will save billions of dollars

While Meta ups to five years

More CPU Cores Isn’t Always Better, Especially In HPC

If a few cores are good, then a lot of cores ought to be better. But when it comes to HPC this isn’t always the case, despite what the Top500 ranking –

Inside the 8086 processor's instruction prefetch circuitry

The groundbreaking 8086 microprocessor was introduced by Intel in 1978 and led to the x86 architecture that still dominates desktop and se...

https://squeaky.ai/blog/development/how-switching-to-aws-graviton-slashed-our-infrastructure-bill-by-35-percent

Four Cornerstones of CPU Performance.

Monolithic Sapphire Rapids

Absolute Reticle Limit

Performance Benefits of Using Huge Pages for Code. | Easyperf

Use One Big Server - Speculative Branches

New working speculative execution attack sends Intel and AMD scrambling

Both companies are rolling out mitigations, but they add overhead of 12 to 28 percent.

A new vulnerability in Intel and AMD CPUs lets hackers steal encryption keys

Hertzbleed attack targets power-conservation feature found on virtually all modern CPUs.

HPC-oriented Latency Numbers Every Programmer Should Know

HPC-oriented Latency Numbers Every Programmer Should Know · GitHub

5.5 mm in 1.25 nanoseconds | Random ASCII – tech blog of Bruce Dawson

In 2004 I was working for Microsoft in the Xbox group, and a new console was being created. I got a copy of the detailed descriptions of the Xbox 360 CPU and I read it through multiple times and su…

Top-Down performance analysis methodology. | Easyperf

You Won’t Believe This One Weird CPU Instruction! - Vaibhav Sagar

larsbrinkhoff/awesome-cpus: All CPU and MCU documentation in one place

All CPU and MCU documentation in one place.

How FPGAs Can Take On GPUs And Knights Landing

Nallatech doesn't make FPGAs, but it does have several decades of experience turning FPGAs into devices and systems that companies can deploy to solve

Analysis and Comparison of Performance and Power Consumption of Neural Netw

In this work, we analyze the performance of neural networks on a variety of heterogenous platforms. We strive to find the best platform in terms of raw benchmark performance, performance per watt a…

The Story of the IBM Pentium 4 64-bit CPU | The CPU Shack Museum

Intel Processor Trace Part4. Better profiling experience. | Easyperf

Asplos 17 cam 📄

Intel Processor Trace Part3. Analyzing performance glitches. | Easyperf

Domain-Specific Hardware Accelerators – Communications of the ACM

Sushi Roll: A CPU research kernel with minimal noise for cycle-by-cycle micro-architectural introspection

Twitter

Precise timing of machine code with Linux perf. | Easyperf

Getting started with bare-metal assembly

Microarchitecture 📄

For Better Computing, Liberate CPUs From Garbage Collection

An accelerator unit improves both the performance and efficiency of a system by taking over one simple task

openhwgroup/cva6: The CORE-V CVA6 is an Application class 6-stage RISC-V CPU capable of booting Linux

The CORE-V CVA6 is an Application class 6-stage RISC-V CPU capable of booting Linux - openhwgroup/cva6

cirosantilli/x86-bare-metal-examples: Dozens of minimal operating systems to learn x86 system programming. Tested on Ubuntu 17.10 host in QEMU 2.10 and real hardware. Userland cheat at: https://github.com/cirosantilli/linux-kernel-module-cheat#userland-assembly ARM baremetal setup at: https://github.com/cirosantilli/linux-kernel-module-cheat#baremetal-setup 学习x86系统编程的数十个最小操作系统。已在QEMU 2.10中的Ubuntu 17.10主机和真实硬件上进行了测试。 Userland作弊网址：https：//github.com/cirosantilli/linux-kernel-module-cheat#userland-assembly ARM裸机安装程序位于：https：//github.com/cirosantilli/linux-kernel-module-cheat#baremetal- 设置 21世纪新政宣言（2020年4月5曰笫四次修改稿)（2020年6月19曰第七次修改，以下“【】”内文字为非正文内容的说明）20世纪苏联的消亡和东欧的大变革，使这21世纪初的现中国大陆成为世界关注的最主要焦点和影响新世纪文明发展的关键。特别是大陆这些年对外意识形态渗透，震撼整个世界。美中贸易战实际已打响人类意识形态领域最后的冷战，海峡两岸关系恶化，香港不断的百万人游行，南海邻国关系紧张。大陆经济急速下滑衰退，内外矛盾激化高端深感前所未有的生存危机。包括中共上下在内的几乎所有人都很清楚，大陆已到非政治体制改革而不可的时候了，大变革将是民意世潮下的必然结局。中国大陆内外即全球正合力促成这人口第一大国的大变革，这也为中国开创新政提供了一次最佳机会。综合各政体和各国现实，绝大多数国家改革选择了西方民主政体，但其固有的越来越明显的缺陷已成为有人攻击、拒绝或怀疑的理由。这也是近年来西方国家出现了宽容那必将灭亡...

Dozens of minimal operating systems to learn x86 system programming. Tested on Ubuntu 17.10 host in QEMU 2.10 and real hardware. Userland cheat at: https://github.com/cirosantilli/linux-kernel-modu...

Eecs 2016 1 📄

To reinvent the processor

A detailed, critical, technical essay on upcoming CPU architectures.

1804 📄

Software optimization resources. C++ and assembly. Windows, Linux, BSD, Mac OS X

Software optimization manuals for C++ and assembly code. Intel and AMD x86 microprocessors. Windows, Linux, BSD, Mac OS X. 16, 32 and 64 bit systems. Detailed descriptions of microarchitectures.

Estimating branch probability using Intel LBR feature. | Easyperf

How does tomasulos algorithm work

Does an AMD Chiplet Have a Core Count Limit?

Did IBM Just Preview The Future of Caches?

Gutting Decades Of Architecture To Build A New Kind Of Processor

There are some features in any architecture that are essential, foundational, and non-negotiable. Right up to the moment that some clever architect shows

AMD 3D Stacks SRAM Bumplessly

AMD recently unveiled 3D V-Cache, their first 3D-stacked technology-based product. Leapfrogging contemporary 3D bonding technologies, AMD jumped directly into advanced packaging with direct bonding and an order of magnitude higher wire density.

Intel: AMD Threat Is Finished (NASDAQ:INTC)

Although competition from Arm is increasing, AMD remains Intel’s biggest competitor, as concerns of losing market share weigh on Intel’s valuation.

New 'Morpheus' CPU Design Defeats Hundreds of Hackers in DARPA Tests - Extr

A new CPU design has won accolades for defeating the hacking efforts of nearly 600 experts during a DARPA challenge. Its approach could help us close side-channel vulnerabilities in the future.

Apple's M1 Positioning Mocks the Entire x86 Business Model

Apple is positioning its M1 quite differently from any CPU Intel or AMD has released. The long-term impact on the PC market could be significant.

Sapphire Rapids CPU Leak: Up to 56 Cores, 64GB of Onboard HBM2

Sapphire Rapids, Intel's next server architecture, looks like a large leap over the just-launched Ice Lake SP.

CPU-based algorithm trains deep neural nets up to 15 times faster than top

Rice University computer scientists have demonstrated artificial intelligence (AI) software that runs on commodity processors and trains deep neural networks 15 times faster than platforms based on graphics ...

The MIPS R4000, part 9: Stupid branch delay slot tricks

Technically legal, but strange.

Deep Dive Into AMD’s “Milan” Epyc 7003 Architecture

The “Milan” Epyc 7003 processors, the third generation of AMD’s revitalized server CPUs, is now in the field, and we await the entry of the “Ice Lake”

The Rise, Fall and Revival of AMD (2020)

AMD is one of the oldest designers of large scale microprocessors and has been the subject of polarizing debate among technology enthusiasts for nearly 50 years. Its...

The Third Time Charm Of AMD’s Milan Epyc Processors

With every passing year, as AMD first talked about its plans to re-enter the server processor arena and give Intel some real, much needed, and very direct

AMD's Reliance on TSMC Isn't Harming the Company's Growth Prospects - Extre

Intel Processor Names, Numbers and Generation List

Understanding Intel® processor names and numbers helps identify the best laptop, desktop, or mobile device CPU for your computing needs.

Threadripper 3990X: The Quest To Compile 1 BILLION Lines Of C++ On 64 Cores

Intel Core i9-10850K Review: The Real Intel Flagship

Use `nproc` and not grep /proc/cpuinfo

There’s something really quite subtle about how the nproc utility from GNU coreutils works. If you look at the man page, it’s even the very first sentence: Print the number of processin…

How Debuggers Work: Getting and Setting x86 Registers

In this article, I would like to shortly describe the methods used to dump and restore the different kinds of registers on 32-bit and 64-bit x86 CPUs. The first part will focus on General Purpose Registers, Debug Registers and Floating-Point Registers up to the XMM registers provided by the SSE extension. I will explain how their values can be obtained via the ptrace(2) interface.

TamaGo - bare metal Go for ARM SoCs

TamaGo - ARM/RISC-V bare metal Go.

Performance analysis & tuning on modern CPU - DEV Community ?‍??‍?

They say "performance is king'... It was true a decade ago and it certainly is now. With more and mor...

An ex-ARM engineer critiques RISC-V

RISC-V.md · GitHub

Optimizing 128-bit Division

When it comes to hashing, sometimes 64 bit is not enough, for example, because of birthday paradox — the hacker can iterate through random $latex 2^{32}$ entities and it can be proven that wi…

x86 instruction listings

The x86 instruction set refers to the set of instructions that x86-compatible microprocessors support. The instructions are usually part of an executable program, often stored as a computer file and executed on the processor.

Fujitsu Begins Shipping Supercomputer Fugaku - Fujitsu Global

Fujitsu Limited today announced that it began shipping the supercomputer Fugaku, which is jointly developed with RIKEN and promoted by the Ministry of Education, Culture, Sports, Science and Technology with the aim of starting general operation between 2021 and 2022. The first machine to be shipped this time is one of the computer units of Fugaku, a supercomputer system comprised of over 150,000 high-performance CPUs connected together. Fujitsu will continue to deliver the units to RIKEN Center for Computational Science in Kobe, Japan, for installation and tuning.

Let’s Build a Simple Interpreter. Part 18: Executing Procedure Calls

Do the best you can until you know better. Then when you know better, do better. ― Maya Angelou

Undocumented CPU Behavior: Analyzing Undocumented Opcodes on Intel x86-64 [ 📄

96-Core Processor Made of Chiplets

64 Core Threadripper 3990X CPU Review

bhive/README.md at master · ithemal/bhive

It’s a Cascade of 14nm CPUs: AnandTech’s Intel Core i9-10980XE Review

Counting FLOPS and other CPU counters in Python

On the Linux command line it is fairly easy to use the perf command to measure number of floating point operations (or other performance metrics). (See for example this old blog post ) with this approach it is not easy to get a fine grained view of how different stages of processings within a single process. In this short note I describe how the python-papi package can be used to measure the FLOP requirements of any section of a Python program.

Intel 10th Gen Comet Lake CPU Family Leaks With 10-Core, 20-Thread LGA-1200

Recent leaks may shed some light on Intel's upcoming mainstream desktop Comet Lake-S CPUs.

Intel Tremont CPU Microarchitecture: Power Efficient, High-Performance x86

Intel's Tremont CPU microarchitecture will be the foundation of a next-generation, low-power processors that target a wide variety of products across

Intel's new Atom Microarchitecture: The Tremont Core in Lakefield

RISC-V from scratch 2: Hardware layouts, linker scripts, and C runtimes

A post describing how C programs get to the main function. Devicetree layouts, linker scripts, minimal C runtimes, GDB and QEMU, basic RISC-V assembly, and other topics are reviewed along the way.

“Essentials of Garbage Collectors” full course is now available

Course overview Memory leaks and dangling pointers are the main issues of the manual memory management. You delete a parent node in a linked list, forgetting to delete all its children first -- and your

Avoiding Instruction Cache Misses

Excessive instruction cache misses are the kind of a performance problem that's going to appear only in larger codebases. In this article, I'm describing some ideas on how to deal with this issue.

Amp

PrincetonUniversity/accelerator-wall: Repository for the tools and non-comm

Repository for the tools and non-commercial data used for the "Accelerator wall" paper. - PrincetonUniversity/accelerator-wall

Benchmarking Amazon's ARM Graviton CPU With EC2's A1 Instances

Monday night Amazon announced the new 'A1' instance type for the Elastic Compute Cloud (EC2) that is powered by their own 'Graviton' ARMv8 processors.

ARM is the NNSA’s New Secret Weapon

It might have been difficult to see this happening a mere few years ago, but the National Nuclear Security Administration and one of its key

Google’s new Bristlecone processor brings it one step closer to quantum sup

Every major tech company is looking at quantum computers as the next big breakthrough in computing. Teams at Google, Microsoft, Intel, IBM and various

CPU DB - Looking At 40 Years of Processor Improvements | A complete databas