
In this article, we'll explore 10 Python one-liners that showcase the progression from basic statistical tests to sophisticated analyses.
In this article, we'll explore 10 Python one-liners that showcase the progression from basic statistical tests to sophisticated analyses.
Let’s break down seven statistical concepts that even seasoned machine learning engineers often trip over — and why getting them right matters more than you think.
So, if you’ve ever asked, “How long until X happens?” and wanted to back that up with solid data, you’re in the right place.
This article explains its features, installation, and how to use it with examples.
Let's clarify this important statistical pattern and understand its significance in analysis.
The Poisson distribution is a discrete probability distribution that expresses the likelihood of a specific number of events occurring within a fixed time or space interval.
Hellinger Distance is a statistical measure that quantifies the similarity between two probability distributions, useful in data analysis and machine learning applications.
The Kaplan-Meier survival curve estimator is one of the most cited ideas in science
Leverage helps us identify observations that could significantly influence our regression results, even in ways that aren't immediately obvious.
Heteroscedasticity might seem like just the opposite of homoscedasticity, but understanding it in its own right is crucial for any data analyst.
Homoscedasticity stands as one of those statistical terms that can seem unnecessarily complex at first glance.
Still, the exact application, challenges and shortcuts related to this technique are relatively unknown, and that’s what this article seeks to change.
Statistical power might be the most frequently misunderstood concept in research design. While many researchers know they "need" it, few truly understand
Degrees of freedom (df) represent the number of independent values in a dataset that are free to vary while still satisfying the statistical constraints imposed on the data.
Have a look at Statology's most popular articles of the year!
Masked diffusion has emerged as a promising alternative to autoregressive models for the generative modeling of discrete data. Despite its potential, existing research has been constrained by overly complex model formulations and ambiguous relationships between different theoretical perspectives. These limitations have resulted in suboptimal parameterization and training objectives, often requiring ad hoc adjustments to address inherent challenges. Diffusion models have rapidly evolved since their inception, becoming a dominant approach for generative media and achieving state-of-the-art performance across various domains. Significant breakthroughs have been particularly notable in image synthesis, audio generation, and video production, demonstrating the transformative potential of this innovative
The calculators in this guide follow a natural progression, starting with basic probabilities and z-scores, moving through hypothesis testing tools, and concluding with specialized distributions.
Let's have a closer look at EVT, its applications, and its challenges.
In this tutorial, we’ll learn more about the Cauchy distribution, visualize its probability density function, and learn how to use it in Python.
A deep-dive into how and why Statsmodels uses numerical optimization instead of closed-form formulas
A Discussion of the go-to methods for 5 Types of A/B Metrics
While Fisher’s exact test is a convenient tool for A/B testing, the idea and results of the test are often hard to grasp and difficult to…
An in-depth guide to the state-of-the-art variance reduction technique for A/B tests
Discover why Welch’s t-Test is the go-to method for accurate statistical comparison, even when variances differ.
The most common test of statistical significance originated from the Guinness brewery. Here’s how it works
Understanding probability distributions in data science is crucial. They provide a mathematical framework for modeling and analyzing data.
Discover the origins, theory and uses behind the famous t-distribution
How a Scientist Playing Solitaire Forever Changed the Game of Statistics
On the method’s advantages and disadvantages, demonstrated with the synthdid package in R
A primer on the math, logic, and pragmatic application of JS Divergence — including how it is best used in drift monitoring
Think of your last card game – euchre, poker, Go Fish, whatever it was. Would you believe every time you gave the whole deck a proper shuffle, you were holding a sequence of cards which had never
Complete Guideline to Find Dependencies among Categorical Variables with Chi-Square Test
How to use the bootstrap for tests or confidence intervals and why it works
How to select control variables for causal inference using Directed Acyclic Graphs
Understanding the model’s output plays a major role in business-driven projects, and Sobol can help
An introduction to the Student’s t-distribution and the Student’s t-test
How to calculate μ & σ, the mode, mean, median & variance
Top 30 Probability and Statistics Interview Questions that can help you sharpen your skills to ace your data science interview
How to Model random Processes with Distributions and Fit them to Observational Data
In 1996, Appleton, French, and Vanderpump conducted an experiment to study the effect of smoking on a sample of people. The study was conducted over twenty years and included 1314 English women…
Free books, lectures, blogs, papers, and more for a causal inference crash course
Strip charts are extremely useful to make heads or tails from dozens (and up to several hundred) of time series over very long periods of…
Part one of a series on how we will measure discrepancies in Airbnb guest acceptance rates using anonymized perceived demographic data.
A Log-Normal Distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed.
This article introduces important subcategories of inferential statistical tests and discusses descriptive statistical measures related to the normal distribution.
Intuitive explanations for the Normal, Bernoulli, Binomial, Poisson, Exponential, Gamma and Weibull distribution — with Python example code
Normality tests to check if a variable or sample has a normal distribution.
It is often desirable to quantify the difference between probability distributions for a given random variable. This occurs frequently in machine learning, when we may be interested in calculating the difference between an actual and observed probability distribution. This can be achieved using techniques from information theory, such as the Kullback-Leibler Divergence (KL divergence), or relative entropy, and the Jensen-Shannon…
Assumptions, relationships, simulations, and so on
With so many types of data distributions to consider in data science, how do you choose the right one to model your data? This guide will overview the most important distributions you should be familiar with in your work.
An easy-to-follow guide next time you forget how to do power calculations
At first glance, the Lognormal, Weibull, and Gamma distributions distributions look quite similar to each other. Selecting between the three models is “quite difficult” (Siswadi & Quesenberry) and the problem of testing which distribution is the best fit for data has been studied by a multitude of researchers. If all the models fit the data fairly well,… Read More »Lognormal, Weibull, and Gamma distribution in One Picture
From Controlling for Testing Errors to Selecting the Right Test
A Must Know Topic For Data Scientists Who Work With Data And Statistical Inference
I got a customer ticket the other day that said they weren’t worried about response time because “New Relic is showing our average response time to be sub 20...
One of the most important concepts for Data Scientists
By Winnifred Louis, Associate Professor, Social Psychology, The University of Queensland, and Cassandra Chapman,PhD Candidate in Social Psychology, The University of Queensland. Here are the 7 sins: Assuming small differences are meaningful Equating statistical significance with real-world significance Neglecting to look at extremes Trusting coincidence Getting causation backwards Forgetting to consider outside causes Deceptive graphs To read… Read More »The seven deadly sins of statistical misinterpretation, and how to avoid them
Data Science, Machine Learning, AI & Analytics
Solving real-world problems with probabilities
This post is about various evaluation metrics and how and when to use them.
This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, decision trees, ensembles, correlation, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, cross-validation, model fitting, and many more. To keep receiving these articles, sign up on DSC. The full series is accessible here. 29 Statistical Concepts… Read More »29 Statistical Concepts Explained in Simple English – Part 3
Quick-reference guide to the 17 statistical hypothesis tests that you need in applied machine learning, with sample code in Python. Although there are hundreds of statistical hypothesis tests that you could use, there is only a small subset that you may need to use in a machine learning project. In this post, you will discover a cheat sheet for the…
This post explains how those numbers were derived in the hope that they can be more interpretable for your future endeavors.
During my years as a Consultant Data Scientist I have received many requests from my clients to provide frequency distribution
Kurtosis and Skewness are very close relatives of the “data normalized statistical moment” family – Kurtosis being the fourth and Skewness the third moment, and yet they are often used to detect very different phenomena in data. At the same time, it is typically recommendable to analyse the outputs of…
I recently ran across this bloom filter post by Michael Schmatz and it inspired me to write about a neat variation on the bloom filter that…
Standard Deviation is one of the most underrated statistical tools out there. It’s an extremely useful metric that most people know how to calculate but very few know how to use effectively.
This article was written by Sunil Ray. Sunil is a Business Analytics and Intelligence professional with deep experience. Introduction – the difference in mindset I started my career as a MIS professional and then made my way into Business Intelligence (BI) followed by Business Analytics, Statistical modeling and more recently machine learning. Each of these transition has required… Read More »Your Guide to Master Hypothesis Testing in Statistics
The author presents 10 statistical techniques which a data scientist needs to master. Build up your toolbox of data science tools by having a look at this great overview post.
In statistics, Fisher's method, also known as Fisher's combined probability test, is a technique for data fusion or "meta-analysis" (analysis of analyses). It was developed by and named for Ronald Fisher. In its basic form, it is used to combine the results from several independence tests bearing upon the same overall hypothesis (H0).
Nina Zumel prepared an excellent article on the consequences of working with relative error distributed quantities (such as wealth, income, sales, and many more) called “Living in A Lognormal…