Bias-variance trade-offs

(by J. M. Morales, translated Spanish to English by PBA)

Understanding the distinct sources of errors producing bias and variance in our predictions can help us build better models.

Image that we can repeat many times the following process: collect data, fit a model to the data, and make predictions. In each one of those replicates we have a random sample, so the models we fit will generate a range of predictions. Bias measures how far those predictions are from the true values. Variance measures how much the predictions vary with respect to one point (like the mean prediction).

Suppose we want to predict the variable \(Y\) based on covariate \(X\), we can assume a data generating model like:

\[ Y = f(x) + \epsilon \]

where the error \(\epsilon\) has a mean of zero and a normal distribution \(\epsilon \sim \text{N} \left(0, \sigma_{\epsilon} \right)\)

With a regression or some other statistical method, we can estimate a model \(\hat{f} \left(X \right)\) to represent \(f\left(X \right)\). In this case, the expected squared prediction error at one point \(x\) is:

\[ \text{Error} \left( x \right) = \text{E} \left[ \left( Y − \hat{f} \left(x \right) \right)^2 \right] \]

Using the rules of probability, we can decompose that error into components of bias and variance:

\[ \text{Error} \left( x \right) = \left( \text{E} \left[\hat{f} \left( x \right) \right] - f \left(x \right) \right)^2 + \text{E} \left[ \left( \hat{f} \left(x \right) - \text{E}\left[ \hat{f} \left(x \right) \right] \right)^2 \right] + \sigma^2_{\epsilon} \]

Equivalently, \[ \text{Error} \left( x \right) = \text{Bias} + \text{Variance} + \text{irreducible error} \]

The irreducible error represents variability unexplained by the model, even when we have perfect knowledge of \(f\left(x\right)\). This variability is outside our control. In contrast, the bias and variance components could be reduced if we knew the data generating model and had lots of data. But normally we are working with few data, and with imperfect models, we face a compromise between bias and variance (the data contain a finite quantity of information that we have to divide among the parameters we we want to estimate) .

Here is an example with simulated data:

set.seed(123)
n <- 20
a <- 0
b1 <- 0.15
b2 <- -0.4
s <- 1

x1 <- runif(n, -3, 3)
x1 <- sort(x1) # for graphing
x2 <- runif(n, -3, 3)

m <- a + b1 * x1 + b2 * x2
y <- rnorm(n, mean = m, sd = s)

# simulate more variables
x3 <- runif(n, -3, 3)
x4 <- runif(n, -3, 3)
x5 <- runif(n, -3, 3)

# fit linear models
m1 <- lm(y ~ 1)
m2 <- lm(y ~ x1)
m3 <- lm(y ~ x1 + x3)
m4 <- lm(y ~ x1 + x3 + x4)
m5 <- lm(y ~ x1 + x3 + x4 + x5)

plot(x1, y)
lines(x1, predict(m1), col = rgb(0, 0, 0, 0.5), lwd = 2)
lines(x1, predict(m2), col = rgb(0.25, 0, 0.6, 0.5), lwd = 2)
lines(x1, predict(m3), col = rgb(0.5, 0, 0.6, 0.5), lwd = 2)
lines(x1, predict(m4), col = rgb(0.75, 0, 0.6, 0.5), lwd = 2)
lines(x1, predict(m5), col = rgb(1, 0, 0, 0.6), lwd = 2)

We can calculate the “training error” of each model and prove that as the complexity of the model increases, the error decreases.

tr_e = numeric(5)
tr_e[1] = mean((y - predict(m1))^2)
tr_e[2] = mean((y - predict(m2))^2)
tr_e[3] = mean((y - predict(m3))^2)
tr_e[4] = mean((y - predict(m4))^2)
tr_e[5] = mean((y - predict(m5))^2)

plot(tr_e, type = "o", xlab = "model complexity", ylab = "error")

We can see what happens if we simulate new data sets, retaining the original values of the predictors:

nrep <- 1000

f_hat <- array(NA, dim = c(nrep, n, 5))
pr_e <- matrix(NA, nrep, 5)

for(i in 1: nrep){

  x1 <- runif(n, -3, 3)
  x1 <- sort(x1) # for graphing
  x2 <- runif(n, -3, 3)

  m <- a + b1 * x1 + b2 * x2
  y <- rnorm(n, mean = m, sd = s)

  # simulate more variables
  x3 <- runif(n, -3, 3)
  x4 <- runif(n, -3, 3)
  x5 <- runif(n, -3, 3)

  y_rep <- rnorm(n, mean = m, sd = s)
  m1 <- lm(y_rep ~ 1)
  m2 <- lm(y_rep ~ x1)
  m3 <- lm(y_rep ~ x1 + x2)
  m4 <- lm(y_rep ~ x1 + x2 + x3)
  m5 <- lm(y_rep ~ x1 + x2 + x3 + x4)
  
  f_hat[i, , 1] <- predict(m1)
  f_hat[i, , 2] <- predict(m2)
  f_hat[i, , 3] <- predict(m3)
  f_hat[i, , 4] <- predict(m4)
  f_hat[i, , 5] <- predict(m5)
  
  pr_e[i, 1] <- mean((y - predict(m1))^2)
  pr_e[i, 2] <- mean((y - predict(m2))^2)
  pr_e[i, 3] <- mean((y - predict(m3))^2)
  pr_e[i, 4] <- mean((y - predict(m4))^2)
  pr_e[i, 5] <- mean((y - predict(m5))^2)
}

# calculate bias and variance for new observations

B <- numeric(5)
V <- numeric(5)
Ef <- matrix(NA,n,5)
for(i in 1:5){
  Ef[ , i] <- colMeans(f_hat[, , i])
  tmp <- matrix(NA, nrep, n)
  for(j in 1:nrep){
    tmp[j, ] <- mean((f_hat[j, ,i] - Ef[,i])^2)
  }
  B[i] <- mean((Ef[, i] - m)^2)
  V[i] <- mean(tmp)
}

We can see how the bias and variance change with the complexity of the model:


plot(B, type = "l", lwd = 2, xlab = "model complexity", ylim = c(0,1))
lines(V, col = 2, lwd = 2)

And here is the prediction error for our models:


plot(tr_e, ylim = c(0.5,2), type = "l", lwd = 3, xlab = "model complexity", ylab = "error")
lines(colMeans(pr_e), col = 2, lwd = 3)
legend("top",c("in","out"),lty=1,col=c(1,2))

Another way of thinking about this problem is that we have to figure out how to minimize the risk of underfitting the model and the risk of overfitting the model. Models that underfit will be insensitive to the addition of new data, while models that overfit will be overly sensitive to the addition of new data.


op <- par(mfrow = c(3,1))

plot(NULL, xlim = c(-2, 2), ylim = c(-2, 2))
for(i in 1: 50){
  lines(x1, f_hat[i,,1], col = rgb(0,0,0, 0.2))
}
plot(NULL, xlim = c(-2, 2), ylim = c(-2, 2))
for(i in 1: 50){
  lines(x1, f_hat[i,,3], col = rgb(0,0,0, 0.2))
}
plot(NULL, xlim = c(-2, 2), ylim = c(-2, 2))
for(i in 1: 50){
  lines(x1, f_hat[i,,5], col = rgb(0,0,0, 0.2))
}

par(op)

Information theory and AIC

How do we find the optimal point between over- and underfitting? First, we have to define a measure of how good or bad the model is. In the previous example, we focused on prediction error. But there are good reasons to consider measures based on information theory.

Information theory provides a good way to measure the distance between two probability distributions.

The deviance measure distance relative to a perfect prediction.
In general, statisticians care most about out of sample deviance.
Information: the reduction in uncertainty that comes from knowing a result.

Information entropy: \[ H \left( p \right) = - \text{E} \log \left( p_i \right) = \sum_{i=1}^{n} p_i \log \left( p_i \right) \]

The uncertainty contained in a probability distribution is the average of the log-probabilities of an event.

Divergence (Kullback-Leibler): The additional uncertainty generated by using one probability distribution to describe another.

To estimate a model’s divergence we use the deviance:

\[ -2 \sum_{i} \log \left( p_i \right) \]

Let’s see what happens to the deviance when we use estimates from one sample to predict new data.

set.seed(123)
n <- 20
a <- 0
b1 <- 0.15
b2 <- -0.4
s <- 1

nrep <- 1000

dev_in  <- matrix(NA, nrep, 5)
dev_out <- matrix(NA, nrep, 5)

for(i in 1: nrep){

X_in  <- matrix(runif(n*4, -3, 3), n, 4)
X_out <- matrix(runif(n*4, -3, 3), n, 4)

y_in  <- rnorm(n, mean = a + b1 * X_in[,1] + b2 * X_in[,2], sd = s)
y_out <- rnorm(n, mean = a + b1 * X_out[,1] + b2 * X_out[,2], sd = s)

  m1 <- lm(y_in ~ 1)
  m2 <- lm(y_in ~ X_in[,1])
  m3 <- lm(y_in ~ X_in[,1:2])
  m4 <- lm(y_in ~ X_in[,1:3])
  m5 <- lm(y_in ~ X_in[,1:4])
  
  dev_in[i, 1] <- -2 * sum(dnorm(y_in, coef(m1), sd = s, log = TRUE))
  dev_in[i, 2] <- -2 * sum(dnorm(y_in, cbind(numeric(n)+1, X_in[,1]) %*% coef(m2), sd = s, log = TRUE))
  dev_in[i, 3] <- -2 * sum(dnorm(y_in, cbind(numeric(n)+1, X_in[,1:2]) %*% coef(m3), sd = s, log = TRUE))
  dev_in[i, 4] <- -2 * sum(dnorm(y_in, cbind(numeric(n)+1, X_in[,1:3]) %*% coef(m4), sd = s, log = TRUE))
  dev_in[i, 5] <- -2 * sum(dnorm(y_in, cbind(numeric(n)+1, X_in[,1:4]) %*% coef(m5), sd = s, log = TRUE))
  
  dev_out[i, 1] <- -2 * sum(dnorm(y_out, coef(m1), sd = s, log = TRUE))
  dev_out[i, 2] <- -2 * sum(dnorm(y_out, cbind(numeric(n)+1, X_out[,1]) %*% coef(m2), sd = s, log = TRUE))
  dev_out[i, 3] <- -2 * sum(dnorm(y_out, cbind(numeric(n)+1, X_out[,1:2]) %*% coef(m3), sd = s, log = TRUE))
  dev_out[i, 4] <- -2 * sum(dnorm(y_out, cbind(numeric(n)+1, X_out[,1:3]) %*% coef(m4), sd = s, log = TRUE))
  dev_out[i, 5] <- -2 * sum(dnorm(y_out, cbind(numeric(n)+1, X_out[,1:4]) %*% coef(m5), sd = s, log = TRUE))
}

xs = 1:5
sd_in <- numeric(5)
for(i in 1:5) sd_in[i] <- sd(dev_in[,i])
sd_out <- numeric(5)
for(i in 1:5) sd_out[i] <- sd(dev_out[,i])

plot(xs, colMeans(dev_in), type = "o", xlim = c(1,5.5), ylim = c(40,90), xlab = "Complexity", ylab = "Deviance")

lines(xs+0.2, colMeans(dev_out), type = "o", pch = 19)

segments(xs, colMeans(dev_in)-sd_in, xs, colMeans(dev_in)+sd_in)
segments(xs+0.2, colMeans(dev_out)-sd_out, xs+0.2, colMeans(dev_out)+sd_out)

We see that the deviance estimated from the samples decreases with model complexity, while the out-of-sample deviance has a minimum for model \(3\). In addition, the difference in these deviances is approximately 2 \(\times\) the number of parameters in the model. AIC uses just that count: \(\text{deviance} + 2 k\) where \(k\) the number of parameters in the model. However, what AIC calculates is the expected result, but between random samples there is a lot of variability (as the error bars show).

The problem of overfitting occurs because the model gets “too excited” about the sample.

M and S errors

Another manifestation of the overfitting problem are M (for magnitude) and S (for sign) errors. If a real effect is weak, and we try to measure it using small and noisy samples (measurements with poor precision), we will only detect statistically significant effects when the magnitude of the effect is exaggerated (M), and this can even occur when the sign of the effect is opposite the true value (S). A good reference on this topic is here.

Let’s try to see this with some simulations. We will suppose that there is a true effect of a predictor variable on a response, but the magnitude of the effect is small. We can simulate this with a simple regression:

set.seed(123)

n = 20
b0 = 0.5
b1 = 0.1
sigma = 1 

x = runif(n, -3,3)
y = rnorm(n, b0 + b1 * x, sd = sigma)

plot(x, y)
curve(b0 + b1 * x, add = TRUE, lwd = 2, col = 2)
abline(lm(y~x))

We can do this with many replicates, save just the replicates that reveal a statistically significant estimate of the slope (b1), and then graph these estimated relationships between \(x\) and \(y\).

n.reps = 1000
n = 20
res = matrix(NA, n.reps, 2)
se = numeric(n.reps) # para guardar los errores estándar de la pendiente
ps = numeric(n.reps) # para guardar los valores de p de la pendiente

for(i in 1: n.reps){
  x = runif(n, -3,3)
  y = rnorm(n, b0 + b1 * x, sd = sigma)
  tmp = summary(lm(y~x))$coefficients
  se[i] = tmp[2,2]
  ps[i] = tmp[2,4]
  res[i, ] = tmp[,1]
}

plot(NULL, xlim = c(-3, 3), ylim = c(-1, 3), xlab = "x", ylab="y")
for(i in 1: n.reps){
  if(ps[i] <= 0.05){
    curve(res[i,1] + res[i,2] * x, add = TRUE, col=rgb(0,0,0, 0.2))
  } 
} 

curve(b0 + b1 * x, add = TRUE, col = 2, lwd = 3)

Regularization techniques try to minimize overfitting by imposing various kinds of penalties or constraints on the model coefficients.