Overview
Statistics is the science of collecting, analyzing, interpreting, presenting, and organizing data. It provides tools and methods to make sense of complex data sets and draw meaningful conclusions.
When it comes to machine learning, statistics is the backbone that supports various algorithms and models. It helps in understanding data distributions, relationships between variables, and making predictions.
Probability Theory
Probability theory is the mathematical foundation of statistics. It deals with the likelihood of events occurring and provides a framework for making predictions based on data:
In simple terms, probability is what we commonly refer to as “chance” or “likelihood.”
It is a crucial component for both descriptive and inferential statistics, as without it, we cannot quantify uncertainty or make informed decisions based on data.
Probability Distributions
Probability distributions is a concept within probability theory that describes how the values of a random variable are distributed. They can be classified into two main types, based on the nature of the random variable:
- Discrete: Distributions that deal with discrete random variables, which can take on a finite or countable number of values (0, 1, 2, 3…). For example, the number of spam emails a user receives in a day.
- Continuous: Distributions that deal with continuous random variables, which can take on an infinite number of values within a given range (0.0 to 1.0, any decimal). For example, how likely a given email is spam.
So in short, probability distribution assigns probabilities to outcomes. Sticking with the email spam example, rather than providing a simple binary classification like “spam” or “not spam,” a spam classifier might output probabilities of 95% spam and 5% not spam for a given email.
Why this matters: Understanding probability distributions is crucial because most ML algorithms inherently use them to make predictions. This allows you to express uncertainty in predictions and set thresholds based on confidence levels. More importantly, understanding the distribution of your data helps you select the right model for your problem and evaluate whether it’s performing as expected.
For example:
- Predicting spam (yes/no) is a classification problem, often modeled with a Bernoulli or Binomial distribution. Algorithms like Logistic Regression or Naive Bayes are suitable because they naturally output probabilities.
- Predicting house prices is a regression problem, often modeled with a Normal (Gaussian) distribution. Algorithms like Linear Regression are appropriate for continuous outputs.
These concepts apply to real-world scenarios like:
- Decision thresholds — Adjust based on business needs (e.g., >95% confidence for surgery, >30% for marketing)
- Uncertainty handling — Route uncertain predictions to human review
- Model calibration — A 70% probability should occur ~70% of the time in real data
TIP
Understanding probability distributions helps you select appropriate models, interpret their outputs effectively, and make confidence-based decisions.
Bayes’ Theorem
Bayes’ Theorem is a fundamental concept in probability theory that describes how to update the probability of a hypothesis based on new evidence. It is expressed mathematically as:
This theorem allows us to calculate the probability of event A occurring given that event B has occurred, by using the conditional probability of B given A, the prior probability of A, and the marginal probability of B.
Sampling
Sampling is the process of selecting a subset of individuals or observations from a larger population to make inferences about that population. It is a crucial step in statistical analysis, as it allows us to gather data without having to study the entire population.
Some common sampling methods include:
- Random Sampling: Every individual in the population has an equal chance of being selected, helping to eliminate bias.
- Stratified Sampling: The population is divided into subgroups (strata) based on certain characteristics, and samples are drawn from each stratum, ensuring representation from all groups.
There are many other methods, each with its own advantages and disadvantages depending on the context of the study.
One way to calculate the required sample size for a study is by using the formula:
Where:
- n = required sample size
- Z = Z-score (based on the desired confidence level)
- p = estimated proportion of the population
- E = margin of error
This way we can ensure that our sample is representative of the population and that our results are statistically significant.
Branches of Statistics
Statistics can be broadly categorized into two main branches:
- Descriptive Statistics
- Inferential Statistics
Descriptive Statistics
Descriptive statistics involves summarizing and describing the main features of a data set. It provides simple summaries about the sample and the measures.
Common descriptive statistics include:
- Measures of Central Tendency: Mean, Median, Mode
- Measures of Dispersion: Range, Variance, Standard Deviation
- Data Visualization: Histograms, Box Plots, Scatter Plots
These tools help in understanding the distribution and spread of data, identifying patterns, and detecting outliers.
TIP
Think of descriptive statistics as a way to “describe” your data in a concise manner, either through numerical summaries or visual representations.
Measures and Formulas
Here are some common measures used in descriptive statistics along with their formulas:
Mean
The sum of all data points divided by the number of data points.
Variance
A measure of how much the data points differ from the mean. It is calculated as the average of the squared differences from the mean.
Standard Deviation
The square root of the variance, representing the average amount of variability in the data set.
Correlation Coefficient
A measure of the strength and direction of the linear relationship between two variables.
Amplitude
The difference between the maximum and minimum values in a data set.
Quartiles
Quartiles divide a ranked data set into four equal parts. The first quartile (Q1) is the median of the lower half, the second quartile (Q2) is the median, and the third quartile (Q3) is the median of the upper half.
Interquartile Range (IQR)
The difference between the third quartile (Q3) and the first quartile (Q1). It measures the spread of the middle 50% of the data.
Data Visualization
Data visualization is a powerful tool in descriptive statistics that helps to communicate information clearly and effectively through graphical representations. Common types of data visualizations include:
- Histograms: Used to represent the distribution of a single continuous variable by dividing the data into bins and counting the number of observations in each bin.
- Box Plots: Provide a visual summary of the distribution of a dataset, highlighting the median, quartiles, and potential outliers.
- Scatter Plots: Show the relationship between two continuous variables by plotting data points on a Cartesian plane.
These visualizations help in identifying patterns, trends, and outliers in the data, making it easier to interpret and analyze.
Inferential Statistics
Inferential statistics involves making predictions or inferences about a population based on a sample of data drawn from that population. It allows us to generalize findings from the sample to the larger population and assess the reliability of those findings.
There are some concepts that are fundamental to inferential statistics, which I will cover below.
Hypothesis Testing
A hypothesis is a statement about a population parameter that we want to test, for example “the average height of adults in a city is 170 cm.” We can use hypothesis testing to determine whether there is enough evidence in our sample data to support or reject this claim.
The process of hypothesis testing involves several steps:
- Formulate the Hypotheses: We start by stating the null hypothesis (H0) and the alternative hypothesis (H1). The null hypothesis represents the default assumption (e.g., “the average height is 170 cm”), while the alternative hypothesis represents what we want to prove (e.g., “the average height is not 170 cm”).
- Choose a Significance Level (α): This is the threshold for determining whether we reject the null hypothesis. Common values for α are 0.05 (5%) or 0.01 (1%).
- Collect Data: We gather a sample of data from the population, using sampling methods.
- Calculate the Test Statistic: Depending on the type of data and hypothesis, we calculate a test statistic (e.g., t-statistic, z-statistic) that summarizes the information in the sample.
- Determine the p-value: The p-value represents the probability of observing the data (or something more extreme) if the null hypothesis is true.
- Make a Decision: If the p-value is less than or equal to the significance level (p ≤ α), we reject the null hypothesis in favor of the alternative hypothesis. Otherwise, we fail to reject the null hypothesis.
When testing hypotheses, we need to be aware of two types of errors:
- Type I Error: This occurs when we reject the null hypothesis when it is actually true (false positive). The probability of making a Type I error is equal to the significance level (α).
- Type II Error: This occurs when we fail to reject the null hypothesis when it is actually false (false negative). The probability of making a Type II error is denoted by β.
Confidence Intervals
A confidence interval is actually a range of values, derived from a data sample, that is likely to contain the true population parameter (like the mean or average) with a specified level of confidence.
For example, if we are interested in the average height of all adults in a city, instead of measuring every single adult, we can take a sample and calculate a confidence interval to estimate the average height for the entire population. The formula for calculating a confidence interval for a population mean is:
Where:
- CI = Confidence Interval
- = Sample Mean
- = Z-score corresponding to the desired confidence level
- = Population Standard Deviation (or sample standard deviation if population standard deviation is unknown)
- n = Sample Size
This way we can say, for example, that we are 95% confident that the true average height of all adults in the city falls within our calculated interval.
Correlation and Regression
Correlation and regression are two important concepts in statistics that help us understand the relationship between variables.
Correlation
Correlation measures the strength and direction of the linear relationship between two variables. The correlation coefficient (r) ranges from -1 to 1, where:
- r = 1 indicates a perfect positive correlation (as one variable increases, the other also increases)
- r = -1 indicates a perfect negative correlation (as one variable increases, the other decreases)
- r = 0 indicates no correlation (no linear relationship between the variables)
This is known as Pearson’s correlation coefficient, and it is calculated using the following formula:
Where:
- and are the individual data points for variables X and Y
- and are the means of variables X and Y
One example could be the correlation between hours studied and exam scores. A positive correlation would suggest that as the number of hours studied increases, exam scores also tend to increase, so the correlation coefficient would be greater than 0.
NOTE
Is important to note that correlation does not imply causation. Just because two variables are correlated does not mean that one causes the other.
Regression
Regression analysis is a statistical technique used to model and analyze the relationship between a dependent variable and one or more independent variables. The most common type of regression is linear regression, which assumes a linear relationship between the variables. The equation for a simple linear regression model is:
Where:
- y = Dependent variable
- x = Independent variable
- = Intercept (the value of y when x = 0)
- = Slope (the change in y for a one-unit change in x)
- = Error term (the difference between the observed and predicted values)
In multiple linear regression, we can have more than one independent variable:
In machine learning, these variables can represent features (independent variables) and the target variable (dependent variable) we want to predict.
An example of regression analysis could be predicting house prices based on features such as square footage, number of bedrooms, and location. By fitting a regression model to historical data, we can estimate the coefficients ( values) that best explain the relationship between these features and house prices, allowing us to make predictions for new houses based on their characteristics.
TIP
Understanding correlation and regression is essential for feature selection, model building, and interpreting the relationships between variables in machine learning tasks.