benchmarks

15 Mar 2026

cemrehancavdar.com

Python loses every public benchmark by 21-875x. I took the exact problems people use to dunk on Python and climbed every rung of the optimization ladder -- from CPython version upgrades to Rust. Real numbers, real code, real effort costs.

CanIRun.ai — Can your machine run AI models?

14 Mar 2026

canirun.ai

Detect your hardware and find out which AI models you can run locally. GPU, CPU, and RAM analysis in your browser.

Cloud VM benchmarks 2026: performance / price

8 Mar 2026

devblog.ecuadors.net

nvidia's B200 does 1,760 TFLOPS on a square GEMM and 4 TFLOPS on a skinny one. the shape of your matmul is the single biggest variable in GPU performance. but why? same hardware, same operation… | Emilio Andere

23 Feb 2026

linkedin.com

nvidia's B200 does 1,760 TFLOPS on a square GEMM and 4 TFLOPS on a skinny one.

Alibaba unveils new Qwen3.5 model for 'agentic AI era'

16 Feb 2026

reuters.com

Alibaba on Monday unveiled a new artificial intelligence model Qwen 3.5 designed to execute complex tasks independently, with big improvements in performance and cost that the Chinese tech giant claims beat major U.S. rival models on several benchmarks.

Autonomous Driving: Assessment Of YOLO Algorithms (RMIT et al.)

11 Feb 2026

semiengineering.com

A new technical paper titled “Advances in You Only Look Once (YOLO) algorithms for lane and object detection in autonomous vehicles” was published by RMIT University, Kyungpook National University, Deakin University and the RCA Robotics Laboratory, Royal College of Art. Abstract “Ensuring the safety and efficiency of Autonomous Vehicles (AVs) necessitates highly accurate perception, especially... » read more

'Observational memory' cuts AI agent costs 10x and outscores RAG on long-context benchmarks

11 Feb 2026

venturebeat.com

As AI agents move into production, teams are rethinking memory. Mastra’s open-source observational memory shows how stable context can outperform RAG while cutting token costs.

LLM Research Papers: The 2025 List (July to December)

2 Jan 2026

magazine.sebastianraschka.com

In June, I shared a bonus article with my curated and bookmarked research paper lists to the paid subscribers who make this Substack possible.

LLM Research Papers: The 2025 List (July to December)

31 Dec 2025

sebastianraschka.com

A curated list of LLM research papers from July–December 2025, organized by reasoning models, inference-time scaling, architectures, training efficiency, and...

GraphQL: the enterprise honeymoon is over

15 Dec 2025

johnjames.blog

A production-tested take on GraphQL in enterprise systems, why the honeymoon phase fades, and when its complexity outweighs the benefits.

The 70% factuality ceiling: why Google’s new ‘FACTS’ benchmark is a wake-up call

11 Dec 2025

venturebeat.com

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

4 Nov 2025

marktechpost.com

Compare the top 7 large language models and systems for coding in 2025. Discover which ones excel for software engineering tasks.

The Big LLM Architecture Comparison

28 Oct 2025

open.substack.com

From DeepSeek-V3 to Kimi K2: A Look At Modern LLM Architecture Design

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

5 Oct 2025

magazine.sebastianraschka.com

Multiple-Choice Benchmarks, Verifiers, Leaderboards, and LLM Judges with Code Examples

319 Top Ecommerce Sites Ranked by User Experience Performance – Baymard

20 Aug 2025

baymard.com

See the ranked UX performance of the 319 leading ecommerce sites in the US and Europe. The chart summarizes 100,000+ UX performance ratings.

AI Model & API Providers Analysis | Artificial Analysis

14 Aug 2025

artificialanalysis.ai

Comparison and analysis of AI models and API hosting providers. Independent benchmarks across key performance metrics including quality, price, output speed & latency.

A Technical Roadmap to Context Engineering in LLMs: Mechanisms, Benchmarks, and Open Challenges

3 Aug 2025

marktechpost.com

Context engineering for large language models—frameworks, architectures, and strategies to optimize AI reasoning, and scalability

Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots

1 May 2025

lmarena.ai

The Leaderboard Illusion

30 Apr 2025

arxiv.org

Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field

QOpt / QOBLIB - Quantum Optimization Benchmarking Library · GitLab

26 Apr 2025

git.zib.de

This is the ZIB GitLab instance

The Man Out to Prove How Dumb AI Still Is

10 Apr 2025

theatlantic.com

François Chollet has constructed the ultimate test for the bots.

A look at the ARC-AGI exam designed by French computer scientist François Chollet to show the gulf between AI models' memorized answers and “fluid intelligence”

7 Apr 2025

techmeme.com

By Matteo Wong / The Atlantic. View the full context on Techmeme.

LLM Benchmarking: Fundamental Concepts | NVIDIA Technical Blog

2 Apr 2025

developer.nvidia.com

The past few years have witnessed the rise in popularity of generative AI and large language models (LLMs), as part of a broad AI revolution.

What is METEOR score? - Dataconomy

2 Apr 2025

dataconomy.com

METEOR Score is a metric used to evaluate the quality of machine translation based on precision, recall, word alignment, and linguistic flexibility.

LLM Leaderboard

21 Feb 2025

artificialanalysis.ai

Comparison and ranking the performance of over 30 AI models (LLMs) across key metrics including quality, price, performance and speed (output speed - tokens per second & latency - TTFT), context window & others.

aidanmclaughlin/AidanBench: Aidan Bench attempts to measure in LLMs.

1 Feb 2025

github.com

Aidan Bench attempts to measure in LLMs. - aidanmclaughlin/AidanBench

AMD’s Instinct MI300X AI Throughput Performance & Latency Improved By 7x Wi

3 Jul 2024

wccftech.com

Nscale has tested AMD's flagship Instinct MI300X AI accelerator utilizing the GEMM tuning framework, achieving 7x faster performance.

Key Metrics for Evaluating Large Language Models (LLMs)

20 Jun 2024

marktechpost.com

Evaluating Large Language Models (LLMs) is a challenging problem in language modeling, as real-world problems are complex and variable. Conventional benchmarks frequently fail to fully represent LLMs' all-encompassing performance. A recent LinkedIn post has emphasized a number of important measures that are essential to comprehend how well new models function, which are as follows. MixEval Achieving a balance between thorough user inquiries and effective grading systems is necessary for evaluating LLMs. Conventional standards based on ground truth and LLM-as-judge benchmarks encounter difficulties such as biases in grading and possible contamination over time. MixEval solves these problems by combining real-world user

Nvidia Conquers Latest AI Tests

13 Jun 2024

spectrum.ieee.org

GPU maker tops new MLPerf benchmarks on graph neural nets and LLM fine-tuning

Open LLM Leaderboard : a Hugging Face Space by HuggingFaceH4

25 Sep 2023

huggingface.co

Track, rank and evaluate open LLMs and chatbots

Asking 60+ LLMs a set of 20 questions

25 Sep 2023

benchmarks.llmonitor.com

Human-readable benchmarks of 60+ open-source and proprietary LLMs.

A Deep Dive Into LLaMA, Falcon, Llama 2 and Their Remarkable Fine-Tuned Ver

28 Jul 2023

turingpost.com

Exploring the Development of the 3 Leading Open LLMs and Their Chatbot Derivatives

Calculate Computational Efficiency of Deep Learning Models with FLOPs and M

24 Jul 2023

kdnuggets.com

In this article we will learn about its definition, differences and how to calculate FLOPs and MACs using Python packages.

Towards a Benchmarking Suite for Kernel Tuners

19 Mar 2023

hgpu.org

As computing system become more complex, it is becoming harder for programmers to keep their codes optimized as the hardware gets updated. Autotuners try to alleviate this by hiding as many archite…

How to Choose the Best Machine Learning Technique: Comparison Table

23 Nov 2022

datasciencecentral.com

How to Choose the Best Machine Learning Technique: Comparison Table

7 Steps to Benchmark Your Product’s UX

17 Jan 2022

nngroup.com

Benchmark your UX by first determining appropriate metrics and a study methodology. Then track these metrics across different releases of your product by running studies that follow the same established methodology.

12 Twitter Sentiment Analysis Algorithms Compared

1 Feb 2021

towardsdatascience.com

12 sentiment analysis algorithms were compared on the accuracy of tweet classification. The fasText deep learning system was the winner.

Benchmark functions | BenchmarkFcns

1 Jan 2021

benchmarkfcns.xyz

This website is for sale! benchmarkfcns.xyz is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, benchmarkfcns.xyz has it all. We hope you find what you are searching for!

AI and Efficiency

19 May 2020

openai.com

We’re releasing an analysis showing that since 2012 the amount of compute needed to train a neural net to the same performance on ImageNet classification has been decreasing by a factor of 2 every 16 months. Compared to 2012, it now takes 44 times less compute to train a neural network to the level of AlexNet (by contrast, Moore’s Law would yield an 11x cost improvement over this period). Our results suggest that for AI tasks with high levels of recent investment, algorithmic progress has yielded more gains than classical hardware efficiency.

Machine Learning Benchmarking: You’re Doing It Wrong

1 Apr 2020

blog.bigml.com

I’m not going to bury the lede: Most machine learning benchmarks are bad. And not just kinda-sorta nit-picky bad, but catastrophically and fundamentally flawed. TL;DR: Please, for the love of sta…

Benchmark Work | Benchmarks MLCommons

30 Mar 2020

mlperf.org

MLCommons ML benchmarks help balance the benefits and risks of AI through quantitative tools that guide responsible AI development.

Inference Results – MLPerf

7 Nov 2019

mlperf.org

MLCommons ML benchmarks help balance the benefits and risks of AI through quantitative tools that guide responsible AI development.

One Deep Learning Benchmark to Rule Them All

30 Aug 2018

nextplatform.com

Over the last few years we have detailed the explosion in new machine learning systems with the influx of novel architectures from deep learning chip

Start With Gradient Boosting, Results from Comparing 13 Algorithms on 165 D

1 Apr 2018

machinelearningmastery.com

Which machine learning algorithm should you use? It is a central question in applied machine learning. In a recent paper by Randal Olson and others, they attempt to answer it and give you a guide for algorithms and parameters to try on your problem first, before spot checking a broader suite of algorithms. In this post, you will discover a…

machine learning benchmarks - Google Search

27 Dec 2017

google.com

Why Python is Slow: Looking Under the Hood | Pythonic Perambulations

25 Oct 2017

jakevdp.github.io

benchmarks — my Raindrop.io articles