Supervised Learning - A Probabilistic Definition
A rigorous probabilistic exploration of the supervised learning problem (Python notebook included)

\( \newcommand{\exp}{\mathbb{E}} \newcommand{\var}{\mathbb{Var}} \newcommand{\std}{\sigma} \newcommand{\prob}{\mathrm{Pr}} \newcommand{\distr}{P} \)

Introduction & motivation

One of the main problems in the field of machine learning (a.k.a. statistical learning) is the supervised learning problem (SLP). It has been explained so far in numerous books, papers and blog posts. It has been presented in different levels of formality and targeted at various audiences. What I have observed is that many people think of it as just function approximation (incl. me a few years ago). Although this is an absolutely correct way to think about it, in my opinion, this simple understanding might miss upon the intrinsic probabilistic nature of the problem.

There is randomness both in the supervised learning methods we use (e.g. stochastic gradient descent) and the data they are trained on (i.e. noisy data). For instance, a very important concept that relies on this randomness and practitioners and theorists need to be aware of is the bias-variance tradeoff, and in order to understand it one has to consider this probabilistic nature of the SLP.

Since my background is highly mathematical, I was eager to understand the SLP in the most formal way possible. Unfortunately, most of the machine learning books I have looked at cover the problem in a less formal way and quickly move on to explore the rich variety of methods. Machine learning is in general a problem-driven field and often breakthroughs don’t necessarily come with formal proofs or even explanations. Nevertheless, the very abstract definition of supervised learning could be presented in a probabilistically formal way.

The aim of this post is to present a motivated step by step probabilistic definition of the SLP. I really hope that there are other people such as me, who were either looking for such explanation of the topic, or would just be happy to read through my post. As I mentioned, machine learning is mostly an applied field. Since the best way to learn something is by doing I’ve also included a Python notebook that simulates the discussed randomness and shows the theoretical concepts in practice. I’d highly recommend that the reader go over the code as well, experiment with different simple SLPs and apply various methods on them. It might even be worth rewriting the code from scratch in order to solidify the theory (personally, this has helped me a lot).

Enjoy!

Prerequisites: conditional probability, conditional expectation, conditional variance, independence, law of total expectation
Good to know: K-Nearest Neighbors (KNN), Ordinary Least Squares (OLS)

The problem

In supervised learning we are investigating some existing and unknown conditional probability

$$ \distr (Y|X) $$

of the two random variables \( X \) and \( Y \). This conditional distribution is called the underlying model (we can also call it the target distribution). Keep in mind that the actual marginal \( \distr (X) \) is not of interest (at least not directly). The random variable \( Y \) can either be continuous or discrete. The supervised learning problem is respectively called a regression or a classification problem. To start with, we will first explore the regression problem setting.

Interested in prediction

In practice we are interested in getting a good estimate of \( Y \) for a given realization of \( X \). This problem is known as prediction. Let’s say we know the value of \( X \) and call it \( x_0 \). Let’s also assume our current estimate for \( Y \) is \( \hat{y}_0 \) (i.e. \( \hat{y}_0 \) is our estimate for \( Y|X=x_0 \)). How can we tell how good our estimate actually is?

Measuring accuracy

In order to judge how far the estimate is from the actual value of \( Y \) we will use the following distance function (a.k.a. error/loss function), which is non-negative (note: the squared error function is the convention).

$$ d(y_1, y_2) = (y_1 - y_2)^2 $$

Since \( Y \) is a random variable, the distance \( d(Y,\hat{y}_0) \) will depend on the realization of \( Y \) and be random as well. In order to account for this, we calculate the expected error (i.e. the expected distance) as follows:

$$ \begin{equation} \begin{split} & \; \exp \left[ d(Y, \hat{y}_0)|X=x_0 \right] \\ = & \; \exp \left[ (Y - \hat{y}_0)^2|X=x_0 \right] \\ = & \; \exp \left[ \left( (Y - \exp [Y|X=x_0]) + (\exp [Y|X=x_0] - \hat{y}_0) \right) ^2|X=x_0 \right] \\ = & \; \exp \left[ (Y - \exp [Y|X=x_0])^2 | X=x_0 \right] + \exp \left[ (\exp [Y|X=x_0] - \hat{y}_0)^2|X=x_0 \right] \\ = & \; \var [Y|X=x_0] + \left( \exp [Y|X=x_0] - \hat{y}_0 \right)^2 \end{split} \end{equation} $$

Both terms in the last expression are non-negative. The only adjustable variable in it is \( \hat{y}_0 \). Thus, the average error is minimized when \( \hat{y}_0 = \exp [Y|X=x_0] \), which makes the second term \( 0 \). In simple words this means that the best estimate for \( Y \) when \( X=x_0 \) is the average value of \( Y \) conditioned on \( X=x_0 \), which is quite intuitive.

The regression function

The function \( f(x_0) = \exp [Y|X=x_0] \) , which gives the best estimate for any value \( x_0 \), is called the regression function. The difference between \( Y \) and the average value of \( Y \), i.e. \( \epsilon = Y - f(X) \), is called the error of \( Y \) w.r.t to the regression function. This leads us to the decomposition:

$$ Y = f(X) + \epsilon $$

Example: In the following plot one can see an example of a regression function \( f(X) \) in black and 1-std of \( \epsilon \) error bands around the function.

81012141618202281012141618202224regression functionxy
Regression function \( f(X) \) and \( \std (\epsilon) \) error bands

We can now rewrite the expected distance in a different way in terms of the regression function:

$$ \exp \left[ d(Y, \hat{y}_0) | X=x_0 \right] = \left( f(x_0) - \hat{y}_0 \right)^2 + \var [\epsilon|X=x_0] $$

The estimator function

In practice we would like to make predictions for \( Y \) for any observed value of \( X \). Let’s denote by \( \hat{f}(x) \) our predicted value of \( Y \) for \( X=x \). Hence, \( \hat{f}(x_0) = \hat{y}_0 \). Since the best estimate for any given \( x \) should be as close as possible to \( f(x) \), the estimator function \( \hat{f}(x) \) should be as close as possible to the regression function \( f(x) \) for all values of \( x \). Thus, we can think of \( \hat{f} \) as a function that tries to estimate the function \( f \) (which may not have a ‘closed-form’ expression). The expected distance now becomes:

$$ \exp \left[ d(Y, \hat{f}(X)) | X=x_0 \right] = \left( f(x_0) - \hat{f}(x_0) \right)^2 + \var [\epsilon|X=x_0] $$

When we think about generating a random \( Y \) for a particular value of \( X \), we need to calculate \( f(X) \) and add an error \( \epsilon \) with mean \( 0 \). In general the error \( \epsilon \) might have different distributions for different values of \( X \). For simplicity we will assume \( \epsilon \) always has the same error distribution (i.e. we assume that \( X \) and \( \epsilon \) are independent). Accounting for that:

$$ \exp \left[ d(Y, \hat{f}(X)) | X=x_0 \right] = \underbrace{\left( f(x_0) - \hat{f}(x_0) \right)^2}_\text{reducible error}+\underbrace{\var [\epsilon]}_\text{irreducible error} $$

In the last equation the first part of the error is called the reducible error because in theory we can guess the correct function \( f \) and reduce the first term to \( 0 \), whereas the second part of the error is called the irreducible error, since the variance of \( \epsilon \) cannot be reduced and thus is a lower bound on the total expected error (i.e. \( X \) just doesn’t provide more information regarding \( Y \)).

So far we were assuming that we are given the estimator function \( \hat{f} \) and we were assessing it for a single observed value of \( X=x_0 \). At this point we are left with two unanswered questions:

  1. How do we come up with \( \hat{f} \)?
  2. As we have already said, we might want to predict \( Y \) for other values of \( X \) apart from \( x_0 \). However, we have a way of assessing the estimator function \( \hat{f} \) at a single value of \( X \). How do we assess \( \hat{f} \) on all the possible values of \( X \)?

Let’s investigate the first question.

How do we come up with \( \hat{f} \)?

The dataset

In practice \( \hat{f} \) is not just a magic guess that falls upon us from the sky. We need some structured way of guessing. Here comes the data! We assume that we are given a dataset:

$$ d = \left\{ (x_1,y_1), (x_2, y_2), \dots , (x_n, y_n) \right\} $$

for which \( y_i \) is a realization of \( Y|X=x_i \). We assume that the \( x_i \) values are independently sampled from some distribution \( \distr _{dataset}(X) \).

Example: In the following image we can see an example of a dataset \( d \), for which \( \distr _{dataset}(X) \) is uniformly distributed from \( 10 \) to \( 20 \).

81012141618202281012141618202224regression functiontraining datasetxy
Dataset \( d = {(x_1,y_1), (x_2, y_2), \dots , (x_n, y_n)} \) generated from \( \distr (Y|X) \) and \( \distr _{dataset}(X) \)

Usually `\( \distr _{dataset}(X) \)` is the same as `\( \distr (X) \)`, or close enough. Sometimes however, they might differ a lot. For instance, we might be training a digit classifier on the MNIST dataset, which contains nicely aligned handwritten digit images, but then the actual the images we classify might not be preprocessed in this way (i.e. they will have a different distribution). We want to avoid this, since as we'll see later this will result in a higher total error!

The thing to remember is that the dataset has been sampled from the conditional distribution we are trying to learn. This means that we can think about the dataset \( D \) as random (i.e. a collection of random variables). Note that we have denoted the dataset with a lower case letter \( d \) since it is a realization of the random dataset:

$$ D = \left\{ (X_1,Y_1), (X_2, Y_2), \dots , (X_n, Y_n) \right\} $$

The estimator function \( \hat{f} \) is actually based on the dataset \( D \), according to some rule (this dependence is called the supervised learning method/model). Therefore we can think of \( \hat{f} \) as a function of \( D \):

$$ \hat{f} = \hat{f}_D $$

which makes \( \hat{f} \) random as well.

Example: For instance, the K-nearest neighbors (KNN) model for \( k=1 \) neighbours produces an estimate that is equal to value of the closest data point from the dataset. In the following image we can see this model trained on the dataset from above:

810121416182022810121416182022regression functiontraining datasetmodel: knn1xy
A KNN model trained on a dataset \( d \)

However, we have to remember that \( D \) is random and thus will not be the same dataset every time. This means that the estimate function \( \hat{f}_D \) would also change. For example, the same KNN model was trained on a different realization of \( D \) below:

810121416182022810121416182022regression functiontraining datasetmodel: knn1xy
A KNN model trained on a different dataset \( d \)

Bias-variance-(irreducible error) decomposition

For a moment let’s consider that we have a fixed \( d \). The expected prediction distance at \( x_0 \) now becomes:

$$ \exp \left[ d( Y, \hat{f}_D(X) ) | X=x_0, D=d \right] = \left( f(x_0) - \hat{f}_d(x_0) \right)^2 + \var [\epsilon] $$

This expectation treats \( D \) as fixed and doesn’t account for its variability. If we take the expectation without conditioning on a specific \( D \) we get (using the law of total expectation):

$$ \begin{equation} \begin{split} & \exp \left[ d(Y, \hat{f}_D(X))|X=x_0 \right]\\ = \; & \exp \left[ \exp [d(Y, \hat{f}_D(X))|X=x_0, D] \right]\\ = \; & \exp \left[ \left( f(x_0) - \hat{f}_D(x_0) \right)^2 + \var [\epsilon] \right]\\ = \; & \exp \left[ \left( (f(x_0) - \exp [\hat{f}_D(x_0)]) + (\exp [\hat{f}_D(x_0)] - \hat{f}_D(x_0)) \right) ^2 \right] + \var [\epsilon]\\ = \; & \underbrace{ \left( f(x_0) - \exp [\hat{f}_D(x_0)] \right)^2 }_\text{bias}+ \underbrace{\var [\hat{f}_D(x_0)]}_\text{variance}+ \underbrace{\var [\epsilon]}_\text{irreducible error} \end{split} \end{equation} $$

Here the reducible error is divided further into two terms. The first is called the bias term and it indicates on average how far away is our prediction from the regression function. The second is called the variance term and it indicates how much our prediction changes, based on the variability of the dataset \( D \).

Example: To illustrate this idea let’s have a look at the following plot. As usual the black curve indicates the regression function \( f(X) \). To obtain the other curve, multiple datasets \( d \) were sampled from \( D \) and a KNN model (with \( k=45 \)) was trained on each one of them. The solid blue line at a particular point \( x_0 \) represents the mean of all the estimates at this point (i.e. \( \exp [\hat{f}_D(x_0)] \)). Intervals where the mean is closer to the regression function \( f(X) \) have a low bias, since on average the estimate function is guessing right, while intervals where the mean deviates from \( f(X) \) have a high bias. The radius of the error bands around the mean represent the standard deviation of the estimates at the same point (i.e. \( \std (\hat{f}_D(x_0)) = \sqrt{\var [\hat{f}_D(x_0)]} \)). Thus, the wider the error bands are the more uncertain the estimates are at that point and the higher the variance is. The bottom line is that the solid line and the error bands represent the consensus of all the trained KNN models.

81012141618202210121416182022regression functionknn45xy
The mean \( \exp [\hat{f}_D(x_0)] \) and the standard deviation \( \std (\hat{f}_D(x_0)) \) for different values \( x_0 \) (KNN with \( k=45 \) )

One can see that the bias and variance are smaller between \( 10 \) and \( 20 \). Actually our training dataset is sampled from the exact same interval. It is true that the more training samples we have in an area, the smaller its bias and variance are. To illustrate this phenomena further let’s look at another dataset distribution \( \distr _{dataset}(X) \), which only produces \( X \) in the intervals \( [10,12] \) and \( [18,20] \):

81012141618202281012141618202224regression functiontraining datasetxy
A different dataset distribution \( \distr _{dataset}(X) \)

And now let’s also have a look at the mean \( \exp [\hat{f}_D(x_0)] \) and the standard deviation \( \std (\hat{f}_D(x_0)) \) across different values \( x_0 \) (here we use as a model polynomial OLS linear regression of order \( N=6 \)):

101214161820−10010203040regression functionlr6xy
The mean \( \exp [\hat{f}_D(x_0)] \) and the standard deviation \( \std (\hat{f}_D(x_0)) \) for different values \( x_0 \) (polynomial OLS with \( N=6 \) )

It is obvious that the interval \( [12,18] \), which doesn’t contain training data points, has slight bias and much more variance than the intervals with training data. Similarly, outside \( [10,20] \) both the bias and the variance explode. This illustrates well why it is really important to train on data from the actual distribution (i.e. \( \distr _{dataset}(X) = \distr (X) \)) so that the values of \( X \) that the model will predict on in practice will be better represented in the training dataset and the total prediction error will be decreased.

How do we assess \( \hat{f} \) on all the possible values of \( X \)?

Once we have the dataset \( d \) we would like to make new predictions for \( Y \) given \( X \) according to the prediction function \( \hat{f}_d \). At this point we will be given new values for \( X \) sampled from the distribution \( \distr (X) \) and we calculate the corresponding predictions \( \hat{f}_d(X) \). We already know how to calculate the expected error for a particular value of \( X \). In order to account for all the possible values of \( X \) we simply average the expected error also w.r.t. to \( \distr (X) \) (again using the law of total expectation).

$$ \begin{equation} \begin{split} & \exp \left[ d(Y, \hat{f}_D(X)) \right]\\ = \; & \exp \left[ \exp [d(Y, \hat{f}_D(X))|X] \right]\\ = \; & \exp \left[ (f(X) - \exp [\hat{f}_D(X)|X])^2 + \var [\hat{f}_D(X)|X] + \var [\epsilon] \right]\\ = \; & \underbrace{\exp \left[ (f(X) - \exp [\hat{f}_D(X)|X])^2 \right]}_\text{bias} + \underbrace{\exp \left[ \var [\hat{f}_D(X)|X] \right]}_\text{variance} + \underbrace{\var [\epsilon]}_\text{irreducible error} \end{split} \end{equation} $$

The last equation presents the bias-variance decomposition of the expected error \( \exp [d(Y, \hat{f}_D(X))] \) accounting both for the variability in the irreducible error \( \epsilon \) and in the dataset \( D \) and averaged over all the possible values of \( X \). This will be our criterion from now on to assess how a certain method \( \hat{f} \) performs on the particular supervised learning problem.

Class of models and hyperparameters

One of the main questions we are faced when dealing with a supervised learning problem is that of model selection (i.e. choosing the best prediction function/method \( \hat{f} \)). Given a \( \hat{f} \), we have shown a theoretical definition of the expected prediction error, which tells us how good the model is. However, \( \distr (Y|X) \) is unknown and we cannot actually calculate the expected prediction error. There are ways to approximate it (e.g. with cross-validation), however those methods are outside the scope of this post. In practice, when selecting a model, we choose from a class of models:

$$ F = \left\{ \hat{f}^{(1)}, \hat{f}^{(2)}, \dots \right\} $$

Often those models are parameterized with a parameter \( h \) (a.k.a. model hyperparameter), which also indexes the class of models \( F \). Hence, the final prediction function depends both on the hyperparameter \( h \) and on the dataset \( d \):

$$ \hat{f} = \hat{f}_{h,d} $$

Example: In the case of KNN, the hyperparameter \( h \) is just the number of neighbours \( k \).

Model flexibility (or complexity)

As we have shown the expected error decomposes into bias, variance and irreducible error. The three terms are positive and changing the model \( \hat{f} \) can only affect the bias and the variance. Ideally, we would like both of them to be as close as possible to \( 0 \). In practice, out of all the models that perform bad on a particular problem there are two distinct groups.

Some models have a high bias (i.e. \( \exp [ (f(X) - \exp [\hat{f}_D(X)|X])^2 ] \)), which means that on average they predict values far away from the actual mean. Usually, this is due to the fact that the model assumes a more fixed (and thus wrong) structure of \( f \). These types of models are less flexible (a.k.a. less complex) and the effect of the training dataset on them is less significant. Changing the dataset doesn’t change as much the prediction function \( \hat{f} \), which means that the variance is low. Observing such high total error is called underfitting.

On the opposite side, there are models for which the variance (i.e. \( \exp [ \var [\hat{f}_D(X)|X] ] \)) is significantly higher than the bias. Normally these are more flexible (a.k.a. more complex) models with less strict assumptions on the structure of \( f \). Due to this fact the bias usually is lower. However, since the flexibility of the model is high, small changes in the dataset can lead to great fluctuations in the predictions, which would mean that the variance is significantly increased. The model is too dependent on the training dataset. This problem is called overfitting.

The bottom line is that for a given class of models, increasing the flexibility decreases the bias, but increases the variance. Models with low flexibility have a dominating bias and high overall error. Models with high flexibility have a dominating variance and, again, high overall error. Between those two extremes there is a sweet spot of models that neither have high bias, nor high variance, and thus have the least overall error (among the models in this particular class).

Example: In the plot below three KNN models with different values for the hyperparameter \( k \) have been trained on our dataset. The prediction of each KNN model is the average of the \( Y \) values for the closest \( k \) training points. As one can see, when \( k=1 \), \( \hat{f} \) is only dependent on the closest training point and the thus the predictions change a lot. On the other hand, when \( k=45 \) the predictions are less local and more fixed. However, due to this we can see higher deviations of \( \hat{f} \) towards the end of the dataset. \( k=11 \) is the sweet point, since then \( \hat{f} \) is not overfitted, but still manages to follow the local curvature of \( f \). This actually results in the best model with least total error.

810121416182022810121416182022regression functiontraining datasetmodel: knn45model: knn11model: knn1xy
Three trained K-Nearest Neighbors models

These three models we have explored are taken from a range of K-Nearest Neighbors models trained on the same supervised learning problem. The bias, variance, irreducible error and total error of all the models were plotted in the image below. When the hyperparameter \( k \) is higher the models are less flexible and, similarly, when \( k \) is lower the models are more flexible. As one can see the flexibility increases along the x-axis.

knn45knn43knn41knn39knn37knn35knn33knn31knn29knn27knn25knn23knn21knn19knn17knn15knn13knn11knn9knn7knn5knn3knn100.511.522.5dataset errorirr. errorbiasvariancetotal errormodelserror
Bias-variance decomposition for a range of K-Nearest Neighbors models

The plot also shows the dataset error (a.k.a. the training error) \( \underset{(x_i, y_i) \in D}{\operatorname{Avg}} d(y_i, \hat{f}_D(x_i)) \) which indicates how closely the model fits the training data. Normally with increasing the flexibility of the model the dataset error decreases (even down to \( 0 \)). Actually the most flexible model could ‘remember’ the dataset by interpolating through its data points, which we have seen is the case with KNN for \( k=1 \).

In order to make the concept of ‘model flexibility’ much more clear I have created a Python notebook to calculate the bias-variance decomposition for a range of models on a simple supervised learning tasks. The above images have been created with the help of this this code. I highly recommend playing with the code in order to understand the theoretical concepts we have explored with concrete examples.

The classification setting

In classification problems the random variable \( Y \) is discrete. Let’s assume it can take values among the set of classes \( \{1, 2, \dots, K\} \) (a.k.a. labels). When predicting the value of \( Y \) we either guess the correct class or not, thus the squared error function \( (y_1 - y_2)^2 \) that we have used for regression is no longer applicable. Instead we use the following distance function (a.k.a. indicator function), which outputs \( 0 \) for correct classification and \( 1 \) for miss-classification:

$$ d(y_1, y_2) = I(y_1 \neq y_2) = \begin{cases} 0 & y_1 = y_2 \\ 1 & y_1 \neq y_2 \end{cases} $$

For a fixed \( X=x_0 \) and prediction value \( \hat{y}_0 \) the expected error is:

$$ \begin{equation} \begin{split} & \exp \left[ d(Y, \hat{y}_0) | X = x_0 \right]\\ = \; & \exp \left[ I(Y \neq \hat{y}_0) | X = x_0 \right]\\ = \; & \prob (Y \neq \hat{y}_0 | X = x_0)\\ = \; & 1 - \prob (Y=\hat{y}_0 | X = x_0) \end{split} \end{equation} $$

The only adjustable term in this equation is the prediction class \( \hat{y}_0 \). In order to minimize the expected error we have to choose \( \hat{y}_0 \) which maximizes \( \prob (Y=\hat{y}_0 | X=x_0) \). In other words, we have to choose the most likely class given the current value of \( X \). The classifier that chooses the most likely class for \( Y \) and reaches the minimal possible expected error is called the Bayes classifier:

$$ \hat{f}_{bayes}(x) = \underset{y}{\operatorname{argmax}} \prob (Y=y | X=x) $$

The corresponding minimum achievable expected error is called the Bayes error:

$$ \exp [ I(Y \neq \hat{f}_{bayes}(X)) | X = x_0] = 1 - \underset{y}{\operatorname{max}} \prob (Y=y | X=x_0) $$

Accounting for all possible values of \( X \) the Bayes error is:

$$ \begin{equation} \begin{split} & \exp [ I(Y \neq \hat{f}_{bayes}(X))]\\ = \; & \exp \left[ \exp [ I(Y \neq \hat{f}_{bayes}(X)) | X] \right]\\ = \; & \exp \left[1 - \underset{y}{\operatorname{max}} \prob (Y=y | X) \right]\\ = \; & 1 - \exp \left[ \underset{y}{\operatorname{max}} \prob (Y=y | X) \right] \end{split} \end{equation} $$

The Bayes error is analogous to the irreducible error in the regression setting, since it’s a minimum we cannot decrease further.

Conclusion

Hopefully you managed to take something for yourself from this post that might help you understand supervised learning in deeper way. If this is the case please let me know in the comments section below. I would be extremely happy to know that there are such people. Even if that’s not case I am open to any thoughts, remarks and feedback in general.

References

  1. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition ; Trevor Hastie, Robert Tibshirani, Jerome Friedman
  2. An Introduction to Statistical Learning: with Applications in R ; Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

Last modified on 2021-04-04