class: center, middle, inverse, title-slide # Causal Inference on Time Series Data ## with CausalImpact ### Brandon Beidel ### 2020/09/24 --- class: inverse middle .center[  ] <br> <small>Imagine a world where you rarely leave home, have little separation between your work and home life, have lost track of time, and haven't quite mastered a regular sleep cycle. __Great, I'm right there with you.__</small> --- class: inverse ## Was there a decline in sleep quality in March? <!-- --> <small>With a fitness tracker, we have an imperfect proxy. **At first glance, it looks like there is a shift in late March, were there any major changes?** If we identify a potential cause, can estimate how much it has affected sleep quality?</small> --- class: inverse ## Why might that be? Oh right...  --- ## What are our limitations when estimating the impact? .large[ 1. We only have historical data for a single subject 2. Too expensive to recreate an experiment 3. Time travel not yet available ] .pull-bottom[ **Our challenge:** We need to estimate the outcome as if the 'treatment' never occurred. (i.e., the counterfactual) ] --- ## An easy-to-estimate counterfactual `Y(0)`  --- ## Estimate `Y(0)` with `CausalImpact`  `CausalImpact` is an R package for causal inference using Bayesian structural time-series models. --- ## Why build this model? Our goal is to estimate the `counterfactual` `\(Y(0)\)` (i.e, a synthetic control). That `counterfactual` represents a belief about observed data `\(Y(1)\)` if the treatment had never occurred. ### What are the inputs and assumptions? 1. Observed Data `\(Y(1)\)` 2. A `treatment` with a well defined start and end date. 3. A series of `covariate` time-series that **are assumed to be**: - Co-occuring during the same date range as `\(Y(1)\)` - Correlated to `\(Y(1)\)` in the defined `pre-period` - Have an relationship with the observed `\(Y(1)\)` established in the `pre-period` which remains stable through the `post-period` - Are NOT themselves impacted by the `treatment` --- ```r library(zoo); set.seed(1); x1 <- 100 + arima.sim(model = list(ar = 0.999), n = 100) y <- 1.2 * (x1) + rnorm(100) y[71:100] <- y[71:100] + 10 start <- as.Date("2020-01-01") time.points <- seq.Date(start, by = 1, length.out = 100) data <- zoo::zoo(cbind(y, x1), time.points) zoo::autoplot.zoo(data, facets = NULL) ``` <!-- --> --- ```r pre.period <- as.Date(c("2020-01-01", "2020-03-11")) post.period <- as.Date(c("2020-03-12", "2020-04-09")) impact <- CausalImpact(data, pre.period, post.period) plot(impact) ``` <!-- --> --- ```r summary(impact) ``` ``` ## Posterior inference {CausalImpact} ## ## Average Cumulative ## Actual 117 3394 ## Prediction (s.d.) 107 (0.45) 3099 (13.00) ## 95% CI [106, 108] [3074, 3125] ## ## Absolute effect (s.d.) 10 (0.45) 295 (13.00) ## 95% CI [9.3, 11] [268.7, 321] ## ## Relative effect (s.d.) 9.5% (0.42%) 9.5% (0.42%) ## 95% CI [8.7%, 10%] [8.7%, 10%] ## ## Posterior tail-area probability p: 0.00101 ## Posterior prob. of a causal effect: 99.89889% ## ## For more details, type: summary(impact, "report") ``` --- ## Finding Relevant Covariates ```r library(gtrendsR) keywords <- c("insomnia", "pet adoption", "wfh", "sourdough") trends <- gtrendsR::gtrends(keywords, time ="2020-01-01 2020-09-01") plot(trends) ``` <!-- --> .pull-bottom[ _Note:_ For our sleep problem, these are likely poor choices because they likely violate the assumption that the covariates are unaffected by the treatment, COVID19 and the government response in the United States. ] --- ## Can we use the weather for our covariates instead? <!-- --> --- ```r pre.period <- as.Date(c("2019-09-04", "2020-03-27")) post.period <- as.Date(c("2020-03-28", "2020-09-02")) s_impact <- CausalImpact(sleep_data, pre.period, post.period) plot(s_impact) ``` <!-- --> --- ```r summary(s_impact) ``` ``` ## Posterior inference {CausalImpact} ## ## Average Cumulative ## Actual 79 12544 ## Prediction (s.d.) 78 (1.1) 12468 (171.7) ## 95% CI [76, 81] [12119, 12804] ## ## Absolute effect (s.d.) 0.48 (1.1) 76.60 (171.7) ## 95% CI [-1.6, 2.7] [-260.0, 425.0] ## ## Relative effect (s.d.) 0.61% (1.4%) 0.61% (1.4%) ## 95% CI [-2.1%, 3.4%] [-2.1%, 3.4%] ## ## Posterior tail-area probability p: 0.327 ## Posterior prob. of a causal effect: 67% ## ## For more details, type: summary(impact, "report") ``` --- ## Inconclusive? Some popular reasons... 1. There was no impact. 2. The true treatment occurred at another time. 3. An inappropriate prior was chosen. 4. Multiple, 'treatments' occurred in same time period. 5. Reliance on too few covariates (spurious correlation overtook model) 6. Covariant time-series are not correlated with target `pre.period` .pull-bottom[ __In the case of the prior analysis, many of these are probably true.__ ] --- ## Started WFH on March 16th, is it the 'treatment' we are looking for? ```r pre.period <- as.Date(c("2019-09-04", "2020-03-15")) post.period <- as.Date(c("2020-03-16", "2020-09-02")) s_impact <- CausalImpact(sleep_data, pre.period, post.period) plot(s_impact) ``` <!-- --> --- ## Outputs are sensitive to period, prior, and covariates. ```r summary(s_impact) ``` ``` ## Posterior inference {CausalImpact} ## ## Average Cumulative ## Actual 78 13411 ## Prediction (s.d.) 81 (1) 13845 (179) ## 95% CI [79, 83] [13493, 14203] ## ## Absolute effect (s.d.) -2.5 (1) -434.4 (179) ## 95% CI [-4.6, -0.48] [-791.7, -82.07] ## ## Relative effect (s.d.) -3.1% (1.3%) -3.1% (1.3%) ## 95% CI [-5.7%, -0.59%] [-5.7%, -0.59%] ## ## Posterior tail-area probability p: 0.01025 ## Posterior prob. of a causal effect: 98.975% ## ## For more details, type: summary(impact, "report") ``` --- ## Under the hood: State Space Model A state space model, defined by an observation equation (1), linking observed data `\(y_{t}\)` to a latent state `\(\alpha_{t}\)` over time. And a state equation (2), dictating the change in state `\(\alpha_{t}\)` over time. `$$y_{t} = Z_{t}^{T} \alpha_{t} + \epsilon_{t} \\ \alpha_{t+1} = T_{t}\alpha_{t} + R_{t}\eta_{t}$$` - scalar observation: `\(y_{t}\)` - output vector: `\(Z_{t}\)` - state vector: `\(\alpha_{t}\)` - observation error: `\(\epsilon_{t}\)` - transition matrix: `\(T_{t}\)` - control matrix: `\(R_{t}\)` - system error vector: `\(\eta_{t}\)` --- ## State has mulitiple components in the model The state vector `\(\alpha_{t}\)` is modular, and allows for the inclusion of elements from different aggregates. - Local Linear Trends - Multiple levels of seasonality - Contemporaneous covariates - with static coefficient - with dynamic coefficient .pull-bottom[ Given too many covariates, the model uses a _spike-and-slab_ technique to help determine which covariates to include, and how strongly they should influence the predictions. ] --- ## Simulating the counterfactual 1. For the training period, simulate draws of the state vector `\(\alpha_{t}\)` and model parameters given `\(y_{t}\)` 2. Simulate outcomes the counterfactual period `\(\bar{y}_{n+1:m}\)` given pre-intervention `\(y_{1:n}\)` 3. Use posterior samples to calculate point-wise impact `\(y_{t}-\bar{y}_{t}\)` for each `\(t = 1,...,m\)`, aggregate for cumulative impact estimate. --- class: center middle inverse ## Demo --- ## Additional Resources - Brodersen KH, Gallusser F, Koehler J, Remy N, Scott SL. Inferring causal impact using Bayesian structural time-series models. Annals of Applied Statistics, 2015, Vol. 9, No. 1, 247-274. <http://research.google.com/pubs/pub41854.html> - Brodersen, Kay H., and Alain Hauser. “CausalImpact.” Google, <https://google.github.io/CausalImpact/CausalImpact.html>. Accessed 1 Sept. 2020. - Big Things Conference. “Inferring the Effect of an Event Using CausalImpact by Kay Brodersen.” YouTube, 13 Dec. 2016, <https://www.youtube.com/watch?v=GTgZfCltMm8>. Accessed 1 Sept. 2020.