The science of collecting, analyzing, interpreting, and presenting data. Statistics and probability provide the essential toolkit for making informed decisions under uncertainty, uncovering patterns in data, and drawing reliable conclusions from evidence.
Introduction to Statistics
Statistics is the branch of mathematics devoted to the collection, organization, analysis, interpretation, and presentation of data. In an age overflowing with information, statistics empowers us to transform raw numbers into actionable knowledge — from predicting election outcomes and testing new medications to optimizing business strategies and training machine-learning models.
Statistics is broadly divided into two major areas:
Descriptive Statistics: Summarizing and describing the features of a dataset — through measures like averages, variability, and visual displays.
Inferential Statistics: Drawing conclusions about a larger population based on a smaller sample, using probability theory as the bridge.
A population is the entire set of individuals or observations you want to study. A sample is a subset drawn from that population. Since studying entire populations is usually impractical, statistics relies on samples to make inferences about the whole.
Key terminology you will encounter throughout this page:
Variable: A characteristic that can take on different values (e.g., height, test score, color).
Quantitative variable: Numerical data — either discrete (countable values) or continuous (measured values on a continuum).
Parameter: A numerical summary of a population (e.g., population mean μ).
Statistic: A numerical summary of a sample (e.g., sample mean x̄).
Statistics is foundational to nearly every empirical discipline — medicine, psychology, economics, physics, engineering, ecology, sports analytics, and data science all depend on statistical reasoning daily.
Descriptive Statistics
Descriptive statistics condense large datasets into a handful of meaningful numbers. We typically describe data using measures of central tendency (where is the center?) and measures of spread (how spread out are the values?).
Measures of Central Tendency
Mean (Arithmetic Average)
The mean is the sum of all values divided by the number of values. It is the most common measure of center and is sensitive to every data point — including outliers.
Population mean: μ = (Σ xᵢ) / N
Sample mean: x̄ = (Σ xᵢ) / n
The median is the middle value when data is sorted in ascending order. If the dataset has an even number of values, the median is the average of the two middle values. The median is resistant to outliers, making it preferable for skewed distributions.
Example: Find the median
Dataset: 4, 8, 6, 5, 3, 7, 9, 5
Step 1: Sort the data: 3, 4, 5, 5, 6, 7, 8, 9
Step 2: Since n = 8 (even), the median is the average of the 4th and 5th values.
Step 3: Median = (5 + 6) / 2 = 5.5
Mode
The mode is the value that appears most frequently. A dataset can be unimodal (one mode), bimodal (two modes), multimodal (multiple modes), or have no mode (all values equally frequent).
Example: Find the mode
Dataset: 4, 8, 6, 5, 3, 7, 9, 5
The value 5 appears twice; all other values appear once.
Mode = 5
Measures of Spread (Dispersion)
Range
The range is the difference between the largest and smallest values. It is simple but highly sensitive to outliers.
Range = Maximum − Minimum
Example: Find the range
Dataset: 3, 4, 5, 5, 6, 7, 8, 9
Range = 9 − 3 = 6
Variance
Variance measures the average squared deviation from the mean. Squaring ensures that deviations above and below the mean don't cancel out. A larger variance indicates data more spread out from the center.
Population variance: σ² = Σ(xᵢ − μ)² / N
Sample variance: s² = Σ(xᵢ − x̄)² / (n − 1)
Why divide by (n − 1) instead of n for the sample variance? This is called Bessel's correction. Using (n − 1) makes s² an unbiased estimator of the population variance σ². The sample tends to underestimate variability because it is less likely to capture the extreme values of the population.
The standard deviation is the square root of the variance. It is expressed in the same units as the original data, which makes it far more interpretable than variance.
Population: σ = √(σ²)
Sample: s = √(s²)
Example: Standard deviation (continued from above)
s = √3.70 ≈ 1.924
Interpretation: On average, data values deviate about 1.924 units from the mean of 5.2.
Interquartile Range (IQR)
The IQR measures the spread of the middle 50% of data. It is resistant to outliers and is defined as:
IQR = Q₃ − Q₁
where Q₁ is the 25th percentile (first quartile) and Q₃ is the 75th percentile (third quartile).
Example: Calculate Q₁, Q₃, and IQR
Sorted dataset: 2, 4, 5, 7, 8, 10, 12, 15
Q₁: Median of the lower half (2, 4, 5, 7) = (4 + 5) / 2 = 4.5
Q₃: Median of the upper half (8, 10, 12, 15) = (10 + 12) / 2 = 11
IQR = 11 − 4.5 = 6.5
Data Visualization
Visualizing data is essential for understanding its shape, identifying patterns, detecting outliers, and communicating findings effectively. Here we cover the most important types of statistical plots.
Histograms
A histogram displays the distribution of a continuous variable by dividing the data range into non-overlapping bins (intervals) and plotting a bar whose height represents the frequency (or relative frequency) of observations in each bin.
Bars are adjacent (no gaps), reflecting the continuous nature of the data.
The shape of a histogram reveals whether data is symmetric, left-skewed, right-skewed, uniform, or bimodal.
Choosing the right number of bins matters — too few bins hide detail; too many create noise.
A common rule of thumb for the number of bins is the Sturges' formula: k = 1 + 3.322 · log₁₀(n), where n is the number of data points. Another popular choice is the square root rule: k ≈ √n.
Box Plots (Box-and-Whisker Plots)
A box plot provides a concise five-number summary of a dataset:
Minimum (or lower fence)
Q₁ (first quartile — 25th percentile)
Median (Q₂ — 50th percentile)
Q₃ (third quartile — 75th percentile)
Maximum (or upper fence)
The "box" spans from Q₁ to Q₃ (the IQR), with a line at the median. "Whiskers" extend to the most extreme data points within 1.5 × IQR of the quartiles. Points beyond the whiskers are plotted individually as potential outliers.
Lower fence = 5 − 1.5(11) = −11.5 → Minimum in data is 1 (within fence)
Upper fence = 16 + 1.5(11) = 32.5 → The value 50 exceeds 32.5, so 50 is an outlier
Whiskers extend from 1 to 18; the point 50 is plotted as an individual outlier dot.
Scatter Plots
A scatter plot displays the relationship between two quantitative variables by plotting each observation as a point on a coordinate plane. Scatter plots reveal:
Direction: Positive (upward trend), negative (downward), or no association.
Form: Linear, curved, or clustered.
Strength: How tightly the points follow the pattern (tight = strong, spread = weak).
Outliers: Points that deviate markedly from the overall pattern.
Scatter plots should always be examined before computing correlation or fitting a regression line. A strong correlation coefficient can be misleading if the relationship is actually non-linear or if outliers dominate.
Probability Fundamentals
Probability is the mathematical framework for quantifying uncertainty. It assigns a number between 0 and 1 to events — where 0 means the event is impossible and 1 means the event is certain.
Sample Spaces and Events
The sample space (S) is the set of all possible outcomes of a random experiment. An event (A) is any subset of the sample space.
Example: Rolling a die
Sample space: S = {1, 2, 3, 4, 5, 6}
Event A = "rolling an even number" = {2, 4, 6}
Event B = "rolling a number greater than 4" = {5, 6}
Axioms of Probability (Kolmogorov's Axioms)
All of probability theory is built upon three axioms:
Non-negativity: P(A) ≥ 0 for any event A.
Normalization: P(S) = 1 — the probability of the entire sample space is 1.
Additivity: If A and B are mutually exclusive events (A ∩ B = ∅), then P(A ∪ B) = P(A) + P(B).
From these axioms, we can derive all the fundamental rules of probability.
The Complement Rule
P(A') = 1 − P(A)
The probability that event A does not occur equals 1 minus the probability that it does.
Example: Complement Rule
If the probability of rain tomorrow is P(Rain) = 0.35, then:
P(No rain) = 1 − 0.35 = 0.65
The Addition Rule
For any two events A and B:
P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
We subtract P(A ∩ B) to avoid double-counting outcomes that belong to both events. If A and B are mutually exclusive (cannot occur simultaneously), then P(A ∩ B) = 0, and the formula simplifies to P(A ∪ B) = P(A) + P(B).
Example: Addition Rule
A standard deck of 52 cards. What is the probability of drawing a King or a Heart?
If A and B are independent (the occurrence of one does not affect the other), this simplifies to:
P(A ∩ B) = P(A) · P(B) (independent events)
Example: Multiplication Rule
A bag contains 5 red and 3 blue marbles. You draw two marbles without replacement. What is P(both red)?
P(1st red) = 5/8
P(2nd red | 1st red) = 4/7 (one red removed, 4 red remain out of 7 total)
P(both red) = (5/8)(4/7) = 20/56 = 5/14 ≈ 0.357
Conditional Probability and Bayes' Theorem
Conditional probability measures the likelihood of an event given that another event has already occurred. It is a cornerstone of probabilistic reasoning and is essential for medical testing, spam filtering, forensic analysis, and machine learning.
Conditional Probability
P(A | B) = P(A ∩ B) / P(B), provided P(B) > 0
Read as: "the probability of A given B." We restrict the sample space to outcomes where B has occurred and ask how likely A is within that restricted space.
Example: Conditional Probability
In a class of 40 students, 15 take French, 10 take Spanish, and 5 take both.
What is the probability a student takes French, given they take Spanish?
P(A | B) = P(A) (equivalently, P(A ∩ B) = P(A) · P(B))
Knowing B occurred gives no information about A.
The Law of Total Probability
If B₁, B₂, …, Bₙ form a partition of the sample space (mutually exclusive, collectively exhaustive), then for any event A:
P(A) = Σ P(A | Bᵢ) · P(Bᵢ)
This is invaluable when the probability of A is hard to compute directly but easy to compute within each partition piece.
Bayes' Theorem
Bayes' theorem allows us to reverse conditional probabilities — to update our belief about a cause after observing evidence.
P(A | B) = P(B | A) · P(A) / P(B)
In words: the posterior probability of A given B equals the likelihood P(B | A) times the prior P(A), divided by the marginal likelihood P(B). Using the law of total probability for P(B):
P(A | B) = P(B | A) · P(A) / [P(B | A) · P(A) + P(B | A') · P(A')]
Example: Medical Testing with Bayes' Theorem
A disease affects 1% of a population. A test has a 95% sensitivity (P(positive | disease) = 0.95) and a 90% specificity (P(negative | no disease) = 0.90). If a person tests positive, what is the probability they actually have the disease?
Step 1: Define events. D = has disease, + = tests positive.
Interpretation: Even with a positive test, there is only an 8.8% chance the person truly has the disease. This counterintuitive result arises because the disease is rare — the large number of false positives from the healthy majority overwhelms the true positives.
Bayes' Theorem is the foundation of Bayesian statistics, a branch of statistics that treats probability as a degree of belief and continuously updates this belief as new data arrives. It is central to spam filters, recommendation engines, and many machine-learning algorithms.
Random Variables and Distributions
A random variable is a numerical quantity whose value is determined by the outcome of a random experiment. It is the bridge between probability theory and data: it assigns numbers to outcomes so we can use mathematical tools to analyze them.
Discrete vs. Continuous Random Variables
Discrete: Takes on a countable number of distinct values (e.g., number of heads in 10 coin flips, the number of customers in a queue).
Continuous: Takes on any value within an interval or continuum (e.g., height, temperature, time).
Probability Mass Function (PMF) — Discrete
For a discrete random variable X, the PMF gives the probability that X equals each possible value:
p(x) = P(X = x)
Requirements: p(x) ≥ 0 for all x, and Σ p(x) = 1.
Example: PMF of a fair die
X = number shown on a fair six-sided die.
p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 1/6
Σ p(x) = 6 × (1/6) = 1 ✓
Probability Density Function (PDF) — Continuous
For a continuous random variable X, the PDF f(x) describes the relative likelihood of X being near a specific value. The probability that X falls within an interval [a, b] is the area under the PDF curve over that interval:
P(a ≤ X ≤ b) = ∫ₐᵇ f(x) dx
Requirements: f(x) ≥ 0 for all x, and ∫₋∞⁺∞ f(x) dx = 1.
For a continuous random variable, P(X = any single value) = 0. Probability is defined only over intervals. This is because there are uncountably many possible values, so the probability of any single exact value is zero.
Cumulative Distribution Function (CDF)
The CDF applies to both discrete and continuous random variables and gives the probability that X is less than or equal to x:
F(x) = P(X ≤ x)
Properties of the CDF:
F(x) is non-decreasing.
lim(x → −∞) F(x) = 0 and lim(x → +∞) F(x) = 1.
For continuous variables: F'(x) = f(x) (the PDF is the derivative of the CDF).
Expected Value and Variance of a Random Variable
The expected value (mean) of a random variable is the long-run average value:
Discrete: E(X) = Σ x · p(x)
Continuous: E(X) = ∫₋∞⁺∞ x · f(x) dx
The variance measures how much X deviates from its expected value:
Var(X) = E[(X − μ)²] = E(X²) − [E(X)]²
Example: Expected value and variance of a die roll
X = number on a fair die. Each outcome has probability 1/6.
The Poisson distribution is the limiting case of the binomial distribution when n → ∞ and p → 0 such that np = λ remains constant. It is widely used to model rare events: arrivals at a server, radioactive decay, typos on a page, and traffic accidents.
Normal (Gaussian) Distribution
The most important distribution in statistics. The famous "bell curve" arises naturally in countless phenomena and is the basis of the Central Limit Theorem.
Models the time between events in a Poisson process. If events occur at a constant average rate λ, the waiting time between successive events follows an exponential distribution.
Parameter: λ (rate parameter).
f(x) = λ · e^(−λx), x ≥ 0
E(X) = 1/λ, Var(X) = 1/λ²
Example: Exponential Distribution
Customers arrive at a store at an average rate of 3 per hour. What is the probability that the next customer arrives within 10 minutes (1/6 hour)?
The exponential distribution has the memoryless property: P(X > s + t | X > s) = P(X > t). No matter how long you have already waited, the probability of waiting at least t more units is the same. This is a unique and defining characteristic of the exponential distribution (among continuous distributions).
A random number generator produces values uniformly between 0 and 10. What is P(3 ≤ X ≤ 7)?
P(3 ≤ X ≤ 7) = (7 − 3) / (10 − 0) = 4/10 = 0.4
Sampling and the Central Limit Theorem
In practice, we almost never know the true population parameters. Instead, we collect samples and use sample statistics to estimate population parameters. The reliability of these estimates depends critically on how we sample and how many observations we take.
Sampling Methods
Simple Random Sampling: Every member of the population has an equal chance of being selected. This is the gold standard.
Stratified Sampling: The population is divided into subgroups (strata) based on a characteristic, and random samples are drawn from each stratum.
Cluster Sampling: The population is divided into clusters (often geographic), and entire clusters are randomly selected.
Systematic Sampling: Every kth individual is selected from a list after a random starting point.
Convenience Sampling: Selecting whoever is easiest to reach (prone to bias — avoid when possible).
Sampling Distribution of the Mean
If we repeatedly draw random samples of size n from a population with mean μ and standard deviation σ, the distribution of sample means (x̄) has special properties:
E(X̄) = μ (the sample mean is an unbiased estimator of the population mean)
SD(X̄) = σ / √n (called the standard error)
The standard error decreases as the sample size increases. Quadrupling the sample size cuts the standard error in half. This is why larger samples give more precise estimates.
The Central Limit Theorem (CLT)
The Central Limit Theorem is arguably the most important theorem in all of statistics. It states:
For a random sample of size n drawn from any population with mean μ and finite standard deviation σ,
the sampling distribution of x̄ approaches a normal distribution as n → ∞:
This holds regardless of the shape of the original population distribution — even if the population is skewed, uniform, bimodal, or otherwise non-normal. This is what makes the CLT so powerful: it justifies using normal-distribution-based methods (like z-tests and confidence intervals) for sample means, even when the underlying data isn't normal.
How large must n be? A common guideline is n ≥ 30, but this depends on how non-normal the population is. For roughly symmetric populations, n ≥ 15 may suffice. For highly skewed populations, larger samples (n ≥ 40 or more) may be needed.
Example: Central Limit Theorem in Action
A factory produces bolts whose lengths have μ = 5.00 cm and σ = 0.10 cm. The distribution of individual bolt lengths is unknown (not necessarily normal). A quality inspector measures a random sample of 36 bolts.
By the CLT: X̄ ~ N(5.00, (0.10)²/36) = N(5.00, 0.000278)
Standard error: SE = 0.10/√36 = 0.10/6 ≈ 0.01667
What is the probability the sample mean is between 4.97 and 5.03?
Hypothesis testing is a formal procedure for using data to decide between two competing claims about a population parameter. It is the backbone of scientific inference, clinical trials, A/B testing, and quality control.
The Framework
State the hypotheses:
Null hypothesis (H₀): The default claim — typically "no effect" or "no difference." We assume H₀ is true until the data provides strong evidence against it.
Alternative hypothesis (H₁ or Hₐ): The claim we are trying to find evidence for. It can be one-sided (e.g., μ > μ₀) or two-sided (μ ≠ μ₀).
Choose a significance level (α): The threshold for "strong evidence." Common choices are α = 0.05 (5%) or α = 0.01 (1%).
Compute the test statistic: A number summarizing how far the sample result is from what H₀ predicts.
Determine the p-value: The probability of observing a test statistic as extreme as (or more extreme than) the one calculated, assuming H₀ is true.
Make a decision:
If p-value ≤ α: reject H₀ (the data provides sufficient evidence for H₁).
If p-value > α: fail to reject H₀ (insufficient evidence to support H₁).
"Fail to reject H₀" is not the same as "accept H₀." We never prove the null hypothesis — we only assess whether the evidence is strong enough to reject it. The absence of evidence is not evidence of absence.
Z-Test (for large samples or known σ)
When the population standard deviation σ is known and the sample is large (n ≥ 30), the test statistic for the population mean is:
Z = (x̄ − μ₀) / (σ / √n)
Example: One-Sample Z-Test
A company claims its light bulbs last μ₀ = 1000 hours on average. A consumer group tests 50 bulbs and finds x̄ = 985 hours. The known population standard deviation is σ = 40 hours. Test at α = 0.05 (two-sided).
H₀: μ = 1000 H₁: μ ≠ 1000
Test statistic: Z = (985 − 1000) / (40/√50) = −15 / 5.657 ≈ −2.65
Conclusion: There is statistically significant evidence that the true mean lifespan differs from 1000 hours.
T-Test (for small samples or unknown σ)
When the population standard deviation is unknown (the usual case) and we use the sample standard deviation s, the test statistic follows a t-distribution with (n − 1) degrees of freedom:
t = (x̄ − μ₀) / (s / √n), df = n − 1
The t-distribution looks similar to the standard normal but has heavier tails, especially for small n. As n increases, the t-distribution approaches the standard normal.
A nutritionist claims that a new diet reduces cholesterol by μ₀ = 20 mg/dL. A study of 12 patients shows a mean reduction of x̄ = 24.5 mg/dL with s = 8.2 mg/dL. Test at α = 0.05 (one-sided: H₁: μ > 20).
H₀: μ = 20 H₁: μ > 20
Test statistic: t = (24.5 − 20) / (8.2/√12) = 4.5 / 2.367 ≈ 1.901
Degrees of freedom: df = 12 − 1 = 11
p-value: P(t₁₁ ≥ 1.901) ≈ 0.042
Decision: p-value = 0.042 < 0.05 = α → Reject H₀
Conclusion: There is statistically significant evidence that the mean cholesterol reduction exceeds 20 mg/dL.
Type I and Type II Errors
Hypothesis testing can lead to two types of errors:
H₀ is true
H₀ is false
Reject H₀
Type I error (α) — "false positive"
Correct decision (Power = 1 − β)
Fail to reject H₀
Correct decision
Type II error (β) — "false negative"
Type I error (α): Rejecting H₀ when it is actually true. The significance level α is the maximum tolerable probability of a Type I error.
Type II error (β): Failing to reject H₀ when it is actually false. The power of a test (1 − β) is the probability of correctly rejecting a false H₀.
There is a trade-off between Type I and Type II errors. Lowering α (making it harder to reject H₀) decreases the chance of a false positive but increases the chance of a false negative. The best way to reduce both error types simultaneously is to increase the sample size.
Confidence Intervals
A confidence interval provides a range of plausible values for a population parameter. A 95% confidence interval means: if we repeated the sampling process many times, about 95% of our intervals would contain the true parameter.
CI for μ (known σ): x̄ ± z* · (σ / √n)
CI for μ (unknown σ): x̄ ± t* · (s / √n)
where z* and t* are the critical values for the desired confidence level.
Example: 95% Confidence Interval
A sample of n = 25 students has x̄ = 82 and s = 6. Construct a 95% confidence interval for the population mean score.
df = 24, t* ≈ 2.064 (from t-table for 95% CI with 24 df)
Margin of error: E = 2.064 · (6/√25) = 2.064 · 1.2 = 2.477
CI: 82 ± 2.477 = (79.52, 84.48)
We are 95% confident that the true population mean lies between 79.52 and 84.48.
Regression and Correlation
Regression and correlation are tools for exploring and quantifying relationships between variables. They are among the most widely used statistical techniques in science, business, and engineering.
Correlation Coefficient (Pearson's r)
The Pearson correlation coefficient measures the strength and direction of the linear relationship between two quantitative variables, X and Y.
r = 0: no linear relationship (but there may still be a non-linear one!)
|r| near 1 indicates a strong linear association; |r| near 0 indicates a weak one.
Correlation does not imply causation. Two variables can be strongly correlated without one causing the other. The correlation might be due to a lurking (confounding) variable, reverse causation, or pure coincidence. Establishing causation requires controlled experiments or careful observational study designs.
This indicates a very strong positive linear relationship between hours studied and exam score.
Simple Linear Regression
Simple linear regression fits a straight line through the data to predict Y from X. The equation of the least-squares regression line (the line that minimizes the sum of squared residuals) is:
Interpretation: For each additional hour of study, the predicted exam score increases by about 4.27 points.
Prediction: If a student studies for 6 hours: ŷ = 57.055 + 4.269(6) = 57.055 + 25.614 = 82.67
Coefficient of Determination (R²)
R² measures the proportion of variance in Y that is explained by the linear relationship with X. For simple linear regression, R² = r².
R² = 1 − (SS_res / SS_tot)
where:
SS_res = Σ(yᵢ − ŷᵢ)² (residual sum of squares)
SS_tot = Σ(yᵢ − ȳ)² (total sum of squares)
R² = 0: The model explains none of the variability in Y.
R² = 1: The model explains all of the variability in Y (perfect fit).
R² = 0.85 means 85% of the variation in Y is explained by the linear relationship with X.
Example: Calculating R²
From our previous example, r ≈ 0.992.
R² = (0.992)² ≈ 0.984
Interpretation: About 98.4% of the variation in exam scores can be explained by the linear relationship with hours studied. Only 1.6% is due to other factors or random variation.
Assumptions of Linear Regression
For the results of linear regression to be valid, several assumptions must be satisfied (often remembered by the acronym LINE):
Linearity: The relationship between X and Y is linear.
Independence: Observations are independent of one another.
Normality: The residuals are approximately normally distributed.
Equal variance (Homoscedasticity): The spread of residuals is roughly constant across all values of X.
Always check these assumptions by examining residual plots. A plot of residuals vs. fitted values should show no obvious patterns — just a random scatter of points around zero. If you see a curve, fan shape, or other structure, the assumptions may be violated and alternative methods (e.g., transformations, non-linear regression) should be considered.
In practice, outcomes are rarely determined by a single predictor. Multiple linear regression extends simple linear regression to include multiple predictors:
ŷ = b₀ + b₁x₁ + b₂x₂ + … + bₖxₖ
Each coefficient bᵢ represents the effect of predictor xᵢ on Y, holding all other predictors constant. The interpretation and assumptions are analogous to simple regression, but with additional complexity around multicollinearity (predictors being correlated with each other).
Bayesian Statistics (Advanced)
In contrast to the frequentist approach (which treats parameters as fixed unknowns), Bayesian statistics treats parameters as random variables with probability distributions. This allows you to incorporate prior knowledge and update your beliefs as new data arrives.
Posterior: P(p | data) ∝ p⁷(1-p)³ — this is a Beta(8, 4) distribution.
The posterior mean is 8/(8+4) = 0.667, suggesting the coin is likely biased toward heads.
Credible Intervals
The Bayesian equivalent of a confidence interval is a credible interval — an interval that contains the parameter with a specified posterior probability. Unlike frequentist confidence intervals, a 95% credible interval literally means "there is a 95% probability the parameter lies in this interval."
Bayesian methods are increasingly popular in data science, machine learning, and scientific research. They naturally handle uncertainty, allow incorporation of expert knowledge, and provide intuitive probabilistic interpretations. The main challenge is computational — computing posterior distributions often requires simulation methods like MCMC (Markov Chain Monte Carlo).
Important Distributions Summary
A reference for the most commonly encountered probability distributions:
Discrete Distributions
Bernoulli(p): Single trial with success probability p. Mean = p, Variance = p(1-p)
Binomial(n, p): Number of successes in n independent trials. Mean = np, Var = np(1-p)
Poisson(λ): Number of events in a fixed interval. Mean = Var = λ
Geometric(p): Number of trials until first success. Mean = 1/p, Var = (1-p)/p²
Negative Binomial(r, p): Trials until r successes. Mean = r/p
Hypergeometric(N, K, n): Successes in n draws without replacement from N items with K successes
Continuous Distributions
Uniform(a, b): Equal probability over [a, b]. Mean = (a+b)/2, Var = (b-a)²/12
Normal(μ, σ²): The bell curve. Mean = μ, Var = σ²
Exponential(λ): Time between Poisson events. Mean = 1/λ, Var = 1/λ²
Gamma(α, β): Sum of α exponential variables. Mean = α/β
Beta(α, β): Distribution on [0,1]. Mean = α/(α+β). Used for probabilities.
Chi-squared(k): Sum of k squared standard normals. Mean = k, Var = 2k
Student's t(ν): Like normal but heavier tails. Used when σ is unknown.
F(d₁, d₂): Ratio of two chi-squared distributions. Used in ANOVA.
The Central Limit Theorem explains why the normal distribution appears everywhere: the sum (or average) of many independent random variables tends toward a normal distribution, regardless of the original distribution. This is why heights, test scores, measurement errors, and countless other quantities are approximately normally distributed.
Explore Statistics Lessons
Dive deeper into specific statistics topics with our focused lessons: