Forecast evaluation assignment

Overview

The purpose of this assignment is to practice visualizing and quantifying forecast accuracy, and using Monte Carlo simulations to quantify forecast uncertainty. For our case study we will use data on the Yellowstone National Park bison herd. We will ask whether we can use (observed) weather data to skillfully hindcast population dynamics. Because I will be asking you to consider various sources of uncertainty, we will deviate from the ARIMA and forecast package workflow, and code our own models and predictions.

The data

The data come from Hobbs et al. (2015), though I downloaded a more recent version of these data from Andrew Tredennick’s Github site. You can download the data from the course website here.

We will use associated climate data from PRISM, which you can download here.

Formatting the data

Getting the data formatted can be the most time consuming part of any analysis, so I am giving you a bunch of code.

Here is what I did to get the bison data ready:

library(forecast)
library(ggplot2)
library(dplyr)
library(tidyverse)

bison = read.csv("data/YNP_bison_counts.csv", stringsAsFactors = FALSE)
bison = select(bison,c(year,count.mean)) # drop non-essential columns
names(bison) = c("year","N") # rename columns

# the next few lines set up a lag N variable
tmp = bison    
tmp$year = tmp$year + 1
names(tmp)[2] = "lagN"
bison = full_join(bison,tmp)
bison = filter(bison,year > min(year) & year < max(year))  # drop incomplete observations
rm(tmp)

# add log transformed variables
bison = mutate(bison, logN = log(N))
bison = mutate(bison, loglagN = log(lagN))

For the weather data, I wanted to set things up so I could aggregate variables by “climate year” (often referred to as “water year”) rather than calendar year. Bison counts are completed by August each year; weather in September - December really shouldn’t effect the population in the year of the count, but it might affect the population in the following calendar year. So I start my “climate year” in September (month 9):

weather = read.csv("data/YNP_prism.csv", stringsAsFactors = FALSE)
weather = weather %>% separate(Date,c("year","month"), sep = '-')
weather$year = as.numeric(weather$year)
weather$month = as.numeric(weather$month)
weather$clim_year = ifelse(weather$month < 9, weather$year, weather$year + 1)
weather$clim_month = ifelse(weather$month < 9, weather$month + 4, weather$month - 8)
head(weather)

Next, to make model fitting easier, I want to have a version of the data with all the weather covariates that correspond to one bison count in one row. Here is how I did that:

precip_wide = weather %>% 
                select(c(clim_year,clim_month,ppt_in)) %>%  # drop all the other climate variables
                spread(clim_month,ppt_in) # 'spread' is where the magic happens

# rename months (optional, but I think it makes the data frame easier to understand)
names(precip_wide)[2:13] = paste0("ppt_",c("Sep","Oct","Nov","Dec","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug"))
head(precip_wide)

# aggregate by season
precip_wide = mutate(precip_wide, ppt_Fall = rowSums(precip_wide[,c("ppt_Sep","ppt_Oct","ppt_Nov")]))
precip_wide = mutate(precip_wide, ppt_Win = rowSums(precip_wide[,c("ppt_Dec","ppt_Jan","ppt_Feb")]))
precip_wide = mutate(precip_wide, ppt_Spr = rowSums(precip_wide[,c("ppt_Mar","ppt_Apr","ppt_May")]))
precip_wide = mutate(precip_wide, ppt_Sum = rowSums(precip_wide[,c("ppt_Jun","ppt_Jul","ppt_Aug")]))
head(precip_wide)

# merge with bison
bison_wide = left_join(bison,precip_wide,by=c("year" = "clim_year"))

My example does this just for the precipitation data. You could repeat the same steps if you also want to play with the temperature or vapor pressure deficit variables.

Modeling steps

Split the full dataset into a training set, including all years up through 2011, and a test set, for years 2012-2017.
Fit a simple Gompertz population model to the training data set, with no climate covariates. It should look something like:

null_model = lm(logN ~ loglagN, data = train_data)

Fit an alternative model that includes at least one weather covariate (feel free to fancy it up if you like). Based on my previous explorations of this data set, I suggest including precipitation from one or more winter months (deep snow makes life hard for bison). So this model might look something like:

clim_model = lm(logN ~ loglagN + ppt_Jan, data = train_data)

Set up a Monte Carlo simulation to a) generate predictions for the test data period and b) simulate uncertainty due to both parameter error and process error. Why not just use the forecast() function? Because we fit the models using lm() instead of Arima() and forecast() won’t realize that the lagged density term is autoregressive. Also, forecast() won’t include parameter error. Assume that future (test set) climate covariates are known, but that “true” lag population size is known only in year 2012 (the 2011 population size). After that, use your predicted population sizes as the lag values in subsequent years.

Hint: For these Monte Carlo simulations, you will need to adapt code from Simulating prediction intervals. The logistic growth example shows how to nest a time (year) loop inside a loop for parameter draws. The next example shows how to work with correlated parameters. You will have to figure out how to add the process error as well.

Evaluate the forecasts from the null model and the climate model using accuracy() from the forecast package. Syntax should look like:

accuracy(predictions, observations)

Lab report

Your assignment is to answer the question: How much (if at all) do the climate covariates improve your forecast? Your answer should be based on two complementary ways of comparing the two models:

The accuracy of the point forecasts for the test data. By “point forecast,” I mean the mean or median of the Monte Carlo simulations of bison population size for the test data set years.
Visualize the forecasts. Plot the observed population counts for the training and test data, the predictions for the test data, and the uncertainty around the predictions. You will probably want to make separate figures for the null model and the climate model.

Bonus exercises

Partition the overall uncertainty into contributions from parameter error and process error.
Calculate coverage for the null model and the climate model.

As usual, you should turn in both the report (text and figures) and your R script(s).