---
title: "Analysis of CO2 data with Gaussian Processes"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Analysis of CO2 data with Gaussian Processes}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
---

## Learner Outcomes: 

This vignette demonstrates a simplified version of the Mauna Loa case study presented by @Rasmussen2006 [p. 119-121],  using `zoomerGP`. The analysis involves fitting an additive GP model to co2 data collected at the [Mauna Loa Observatory](https://scrippsco2.ucsd.edu/data/atmospheric_co2/) from from 1959 to 1997. The data are included with R in the `co2` dataset. 

After completing the vignette, you should be able to:

1. Be able to write and fit a basic GP model with Empirical Bayes using `zoomerGP`
2. Extract predictions from this model and plot them
3. Use `zoomerGP` to generate predictions from sub-components of the model and examine them


```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Getting Started: Setup and Data Processing

First, we load the libraries we will use for data analysis (`zoomerGP` for fitting the GP models and `ggplot2` for plotting the raw data). We then transform the built-in co2 dataset to a tidy dataframe with one row per observation.

```{r setup}
library(zoomerGP)
library(ggplot2)

data <- data.frame(
  year = (1:length(co2) - 1)/12 + 1959,
  co2 = c(co2)
)

head(data)
```

Using `ggplot2`, we can then plot the raw data:


```{r, out.width="100%",  fig.width=8, fig.height=5}
ggplot(data, aes(x=year, y=co2)) +
  geom_point() + 
  theme_bw() + 
  xlab("Year") + 
  ylab("co2 concentration (parts per million)")
```
## Specifying and Fitting the Model

The data appear to be reflective of a long-term process combined with a seasonal process which repeats every 12 months. This motivates a simplified version of the model described by Rassmusen and Williams (2005), which describes the co2 concentration observed at a given time $(y_i)$ as follows:

\[
y_i = \alpha + f_1(t) + f_2(t) + \varepsilon_i
\]

Where $\alpha$ is an intercept term, and $\varepsilon_i$ is a noise term. $f_1$ and $f_2$ are mean-zero Gaussian processes that absorb the long-term and periodic components, respectively. Specifically, we parameterize the kernels as described below. The formulas for the RBF and Periodic kernels are are taken from *Bayesian Data Analysis*, 3rd edition [@gelmanbda p. 515].

* $f_1$ absorbs the long-term trend and is parameterized with an RBF kernel: $f_1 \sim GP(0, k_1), \quad k_2(t,t') = \sigma_1^2 \exp{\bigg(-\frac{(t-t')^2}{2l_1^2}\bigg)}$
* $f_2$ absorbs the periodic trend, and is parameterized by an RBF kernel multiplied by a periodic kernel with a period of one year: $f_2 \sim GP(0, k_2), \quad k_2(t,t') = \sigma_2^2 \exp{\bigg(-\frac{(t-t')^2}{2l_2^2}\bigg)} \sigma_3^2 \exp{\bigg(-\frac{\sin^2(\pi (t-t'))}{l_3^2}\bigg)}$. The inclusion of the RBF kernel allows for the periodic component to change slightly over time. 

`zoomerGP` provides a convenient formula syntax to describe this model, which we then fit using Empirical Bayes. 

```{r}
set.seed(1)
gp_model <- gaussian_process(
  co2 ~ periodic(year, periods = c(1.0))*rbf(year) + rbf(year), 
  data = data, 
  sparse = F, 
  )
gp_model
```

Examining the model fit object, we can see the log-likelihood, r-squared and. 

```{r, out.width="100%",  fig.width=8, fig.height=5}
prediction_data <- data.frame(year = 1959 + seq(0,90, length.out = 500))
prediction_data$preds <- predict(gp_model, prediction_data)[,1]
prediction_data$pred_var <- predict(gp_model, prediction_data)[,3]

prediction_data$ci_lo <- prediction_data$preds + 2*sqrt(prediction_data$pred_var)
prediction_data$ci_hi <- prediction_data$preds - 2*sqrt(prediction_data$pred_var)

ggplot() + 
  geom_point(data = data, aes(x=year, y=co2)) + 
  geom_line(data = prediction_data, aes(x=year, y=preds)) + 
  geom_line(data = prediction_data, aes(x=year, y=ci_hi), linetype = 2, col = "darkgrey") + 
  geom_line(data = prediction_data, aes(x=year, y=ci_lo), linetype = 2, col = "darkgrey") 
```


If we want to extract the periodic component, we can easily do so by isolating the ""