Bayesian analysis

(written by Juan Manuel Morales, translated to English by PBA)

Bayesian analysis is based on probability theory. In general, we work with random variables:

\[ P\left(X = x_i \right) = p_i \]

The random variable \(X\) takes the value \(x_i\) with probability \(p_i\). We use probability distributions to describe how these probabilities, and they can be continuous or discrete.

The discrete functions are defined by a probability mass function \(f(x)\) that defines the probability \(p_i\) for each possible value of the variable, \(x_i\). By definition, these probabilities are non-negative and never greater than \(1\). The sum of the probabilities of all possible values of \(x\) has to equal \(1\). The cumulative probability function \(F(x)\) gives us the probability that a random variables may be less than or equal to a particular value \(F(x_i) = P(X \leq x_i)\). An important difference between discrete and continuous distributions as that for continuous variables we have infinite possible values. So we don’t talk about the probability of obtaining one value in particular but rather of the probability density around a specific value. In general, we don’t pay much attention to the specific values of the probability density. Instead, we focus on relative comparisons. The probability density function is defined as \(f(x) = d F(x) / d x\). The probability densities of continuous variables have to be non-negative, but may be less than \(1\), and should integrate to \(1\) over the possible values of \(X\).

When we have more than one random variable, we can think about joint probabilities \(P(x,y)\) that give us the probability of obtaining \(x\) and \(y\) simultaneously. If the variables are independent, then that probability is equal to the product \(P(x) P(y)\).

We can define the marginal probability of the the variable \(X\) as

\[ P(x) = \int P(x,y) dy \]

And we define the joint probability of \(X\) and \(Y\) as their conditional probability

\[ P(x,y) = P(x|y)P(y) \] \[ P(x,y) = P(y|x)P(x) \]

Using these conditional relationships, we can derive Bayes’ rule:

\[ P(x,y) = P(x|y)P(y) = P(y|x)P(x) \]

\[ P(x|y) = \frac{P(y|x)P(x)}{P(y)} \]

Bayesian analyses consider the parameters of a model (\(\theta\)) as random variables, and characterizes the probability distributions of the parameters conditional on the observed data:

\[ P(\theta | y) = \frac{P(y| \theta)P(\theta)}{P(y)} \]

In other words, the posterior of the parameters \(\theta\) given the observations, or data, \(y\), is equal to the probability of the data conditional on the parameters (the likelihood) multiplied by the priors and divided by the total probability of the data. The likelihood function gives us the probability of observing the data conditional on the value of the parameters \(P(y \lvert \theta)\). The prior distribution of the parameters \(P(\theta)\) reflects the possible values of the parameters given our prior “beliefs,” or results of previous studies, or what seems to make sense for the study system (e.g., on the basis of prior information). The important thing is that we define the priors before we see the data. Finally, the total probability of the data is obtained by integrating the likelihood function over the possible values of the parameters:

\[ P(y) = \int P(y \vert \theta) d \theta \]

Bayesian analysis combined with numerical methods allows us to analyze models with many parameters, levels of variability, missing values, unobserved values, etc., but first we will begin with simple cases where we can calculate the posteriors directly.

To gain a good understanding of the whole process, let’s simulate some data. Imagine that we want to study the removal of fruit from \(10\) plants, each with \(100\) fruits, and suppose that a good model for these data is a Binomial distribution with a fixed probability of success:

set.seed(123)
nobs <- 10    # number of observations (plants)
fruits <- rep(100, nobs) # available fruits
p.rem <- 0.2  # probability of removal for each fruit
removals <- rbinom(nobs, size = fruits, prob = p.rem)

In this case, because the data model (how many fruits are removed) is a Binomial, if we use a Beta distribution as the prior for the parameter representing the probability of success (p.rem), it is possible to obtain an analytical result for the posterior. The posterior is another Beta distribution, but with its parameters informed by the observations. The Beta distribution is the conjugate of the Binomial. If the prior of the fruit removal rate is a Beta distribution with parameters \(\alpha\) and \(\beta\), we calculate the values of \(\alpha\) and \(\beta\) based on the number of observed fruit removals. The posterior of the fruit removal rate is then a Beta with \(\alpha = \sum y\), \(\beta = \sum (n-y)\) where \(y\) represents the number of the \(n\) available fruits that were removed. Here is how we do that in R:

# priors
alpha <- 1
beta  <- 1

pos.alpha <- alpha + sum(removals)
pos.beta  <- beta + sum(fruits - removals)

# expected value of the posterior
pos.alpha / (pos.alpha + pos.beta)

# quantiles of the posterior
qbeta(c(0.025, 0.975), pos.alpha, pos.beta)

# now a figure

n = nobs

curve(dbeta(x, alpha + sum(removals[1:n]) , 
            beta + sum(fruits[1:n] - removals[1:n])), 
      lwd = 2, 
      ylab = "Probability density", 
      xlab = "Probability of removal")
curve(dbeta(x, alpha, beta), lwd = 2, col = "gray", add = TRUE)
text(0.6, 2.5, "prior")
text(0.4, 9, "posterior")

With this script, we can see how the posterior changes if we change the prior, or if we change the number of observations. In general, the more data we have, the less relevant, or influential, the priors.

MCMC

We saw above that to calculate the marginal probability of the data we have to calculate \[ \int P(y \vert \theta) d \theta \]

In general, we can’t calculate these integrals for the models we want to fit. For example, in a simple regression with an intercept, a slope, and a variance term, we would have to solve a triple integral. So we use Markov Chain Monte Carlo to generate samples from the posterior distribution.

The links below show animations (made by Chi Feng) of some MCMC methods: