prob-stats

cover image

In this article, we'll explore 10 Python one-liners that showcase the progression from basic statistical tests to sophisticated analyses.

cover image

Let’s break down seven statistical concepts that even seasoned machine learning engineers often trip over — and why getting them right matters more than you think.

cover image

So, if you’ve ever asked, “How long until X happens?” and wanted to back that up with solid data, you’re in the right place.

cover image

This article explains its features, installation, and how to use it with examples.

cover image

Let's clarify this important statistical pattern and understand its significance in analysis.

cover image

The Poisson distribution is a discrete probability distribution that expresses the likelihood of a specific number of events occurring within a fixed time or space interval.

cover image

Hellinger Distance is a statistical measure that quantifies the similarity between two probability distributions, useful in data analysis and machine learning applications.

cover image

The Kaplan-Meier survival curve estimator is one of the most cited ideas in science

cover image

Leverage helps us identify observations that could significantly influence our regression results, even in ways that aren't immediately obvious.

cover image

Heteroscedasticity might seem like just the opposite of homoscedasticity, but understanding it in its own right is crucial for any data analyst.

cover image

Homoscedasticity stands as one of those statistical terms that can seem unnecessarily complex at first glance.

cover image

Still, the exact application, challenges and shortcuts related to this technique are relatively unknown, and that’s what this article seeks to change.

cover image

Statistical power might be the most frequently misunderstood concept in research design. While many researchers know they "need" it, few truly understand

cover image

Degrees of freedom (df) represent the number of independent values in a dataset that are free to vary while still satisfying the statistical constraints imposed on the data.

cover image

Have a look at Statology's most popular articles of the year!

cover image

Masked diffusion has emerged as a promising alternative to autoregressive models for the generative modeling of discrete data. Despite its potential, existing research has been constrained by overly complex model formulations and ambiguous relationships between different theoretical perspectives. These limitations have resulted in suboptimal parameterization and training objectives, often requiring ad hoc adjustments to address inherent challenges. Diffusion models have rapidly evolved since their inception, becoming a dominant approach for generative media and achieving state-of-the-art performance across various domains. Significant breakthroughs have been particularly notable in image synthesis, audio generation, and video production, demonstrating the transformative potential of this innovative

cover image

The calculators in this guide follow a natural progression, starting with basic probabilities and z-scores, moving through hypothesis testing tools, and concluding with specialized distributions.

cover image

A brief numerical and graphical check on a 3Blue1Brown video.

cover image

Let's have a closer look at EVT, its applications, and its challenges.

cover image

In this tutorial, we’ll learn more about the Cauchy distribution, visualize its probability density function, and learn how to use it in Python.

cover image

A deep-dive into how and why Statsmodels uses numerical optimization instead of closed-form formulas

cover image

A Discussion of the go-to methods for 5 Types of A/B Metrics

cover image

While Fisher’s exact test is a convenient tool for A/B testing, the idea and results of the test are often hard to grasp and difficult to…

cover image

An in-depth guide to the state-of-the-art variance reduction technique for A/B tests

cover image

How not to fail your online controlled experimentation

cover image

Discover why Welch’s t-Test is the go-to method for accurate statistical comparison, even when variances differ.

cover image

The most common test of statistical significance originated from the Guinness brewery. Here’s how it works

cover image

Your Guide to Choosing the Right Test for Your Data

cover image

Understanding probability distributions in data science is crucial. They provide a mathematical framework for modeling and analyzing data.

cover image

Discover the origins, theory and uses behind the famous t-distribution

cover image

Total Productive Maintenance

cover image

How a Scientist Playing Solitaire Forever Changed the Game of Statistics

cover image

On the method’s advantages and disadvantages, demonstrated with the synthdid package in R

cover image

A primer on the math, logic, and pragmatic application of JS Divergence — including how it is best used in drift monitoring

cover image

Think of your last card game – euchre, poker, Go Fish, whatever it was. Would you believe every time you gave the whole deck a proper shuffle, you were holding a sequence of cards which had never

cover image

Complete Guideline to Find Dependencies among Categorical Variables with Chi-Square Test

cover image

How to use the bootstrap for tests or confidence intervals and why it works

cover image

Articles, software, calculators, and opinions.

cover image

How to select control variables for causal inference using Directed Acyclic Graphs

cover image

Understanding the model’s output plays a major role in business-driven projects, and Sobol can help

cover image

A concise explanation of confidence intervals.

cover image

An introduction to the Student’s t-distribution and the Student’s t-test

cover image

How to calculate μ & σ, the mode, mean, median & variance

cover image

Top 30 Probability and Statistics Interview Questions that can help you sharpen your skills to ace your data science interview

cover image

P-values & ice cream consumption simply explained.

cover image

How to Model random Processes with Distributions and Fit them to Observational Data

cover image

In 1996, Appleton, French, and Vanderpump conducted an experiment to study the effect of smoking on a sample of people. The study was conducted over twenty years and included 1314 English women…

cover image

Free books, lectures, blogs, papers, and more for a causal inference crash course

cover image

The smart trick to choose the right model

cover image

Strip charts are extremely useful to make heads or tails from dozens (and up to several hundred) of time series over very long periods of…

cover image

Part one of a series on how we will measure discrepancies in Airbnb guest acceptance rates using anonymized perceived demographic data.

cover image

A Log-Normal Distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed.

cover image

This article introduces important subcategories of inferential statistical tests and discusses descriptive statistical measures related to the normal distribution.

cover image

Intuitive explanations for the Normal, Bernoulli, Binomial, Poisson, Exponential, Gamma and Weibull distribution — with Python example code

cover image

Normality tests to check if a variable or sample has a normal distribution.

cover image

It is often desirable to quantify the difference between probability distributions for a given random variable. This occurs frequently in machine learning, when we may be interested in calculating the difference between an actual and observed probability distribution. This can be achieved using techniques from information theory, such as the Kullback-Leibler Divergence (KL divergence), or relative entropy, and the Jensen-Shannon…

cover image

Assumptions, relationships, simulations, and so on

With so many types of data distributions to consider in data science, how do you choose the right one to model your data? This guide will overview the most important distributions you should be familiar with in your work.

cover image

An easy-to-follow guide next time you forget how to do power calculations

cover image

At first glance, the Lognormal, Weibull, and Gamma distributions distributions look quite similar to each other. Selecting between the three models is “quite difficult” (Siswadi & Quesenberry) and the problem of testing which distribution is the best fit for data has been studied by a multitude of researchers. If all the models fit the data fairly well,… Read More »Lognormal, Weibull, and Gamma distribution in One Picture

cover image

Data Science & Machine Learning Interviews

cover image

From Controlling for Testing Errors to Selecting the Right Test

cover image

A Must Know Topic For Data Scientists Who Work With Data And Statistical Inference

cover image

I got a customer ticket the other day that said they weren’t worried about response time because “New Relic is showing our average response time to be sub 20...

cover image

One of the most important concepts for Data Scientists

cover image

By Winnifred Louis, Associate Professor, Social Psychology, The University of Queensland, and Cassandra Chapman,PhD Candidate in Social Psychology, The University of Queensland. Here are the 7 sins: Assuming small differences are meaningful Equating statistical significance with real-world significance Neglecting to look at extremes Trusting coincidence Getting causation backwards Forgetting to consider outside causes Deceptive graphs To read… Read More »The seven deadly sins of statistical misinterpretation, and how to avoid them

cover image

Solving real-world problems with probabilities

cover image

This post is about various evaluation metrics and how and when to use them.

cover image

This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, decision trees, ensembles, correlation, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, cross-validation, model fitting, and many more. To keep receiving these articles, sign up on DSC. The full series is accessible here.  29 Statistical Concepts… Read More »29 Statistical Concepts Explained in Simple English – Part 3

cover image

Quick-reference guide to the 17 statistical hypothesis tests that you need in applied machine learning, with sample code in Python. Although there are hundreds of statistical hypothesis tests that you could use, there is only a small subset that you may need to use in a machine learning project. In this post, you will discover a cheat sheet for the…

This post explains how those numbers were derived in the hope that they can be more interpretable for your future endeavors.

cover image

During my years as a Consultant Data Scientist I have received many requests from my clients to provide frequency distribution

Kurtosis and Skewness are very close relatives of the “data normalized statistical moment” family – Kurtosis being the fourth and Skewness the third moment, and yet they are often used to detect very different phenomena in data. At the same time, it is typically recommendable to analyse the outputs of…

I recently ran across this bloom filter post by Michael Schmatz and it inspired me to write about a neat variation on the bloom filter that…

cover image

Standard Deviation is one of the most underrated statistical tools out there. It’s an extremely useful metric that most people know how to calculate but very few know how to use effectively.

cover image

This article was written by Sunil Ray. Sunil is a Business Analytics and Intelligence professional with deep experience. Introduction – the difference in mindset I started my career as a MIS professional and then made my way into Business Intelligence (BI) followed by Business Analytics, Statistical modeling and more recently machine learning. Each of these transition has required… Read More »Your Guide to Master Hypothesis Testing in Statistics

The author presents 10 statistical techniques which a data scientist needs to master. Build up your toolbox of data science tools by having a look at this great overview post.

cover image

In statistics, Fisher's method, also known as Fisher's combined probability test, is a technique for data fusion or "meta-analysis" (analysis of analyses). It was developed by and named for Ronald Fisher. In its basic form, it is used to combine the results from several independence tests bearing upon the same overall hypothesis (H0).

cover image

Nina Zumel prepared an excellent article on the consequences of working with relative error distributed quantities (such as wealth, income, sales, and many more) called “Living in A Lognormal…