_Copyright (c) Microsoft Corporation._
_Licensed under the MIT License._ ```{r, echo=FALSE, results="hide", message=FALSE} library(tidyr) library(dplyr) library(tsibble) library(feasts) library(fable) library(urca) ``` This notebook builds on the output from "Basic models" by including regressor variables in the ARIMA model(s). We fit the following model types: - `ar_trend` includes only a linear trend over time. - `ar_reg` allows stepwise selection of independent regressors. - `ar_reg_price`: rather than allowing the algorithm to select from the 11 price variables, we use only the price relevant to each brand. This is to guard against possible overfitting, something that classical stepwise procedures are wont to do. - `ar_reg_price_trend` is the same as `ar_reg_price`, but including a linear trend. As part of the modelling, we also compute a new independent variable `maxpricediff`, the log-ratio of the price of this brand compared to the best competing price. A positive `maxpricediff` means this brand is cheaper than all the other brands, and a negative `maxpricediff` means it is more expensive. ```{r} srcdir <- here::here("R_utils") for(src in dir(srcdir, full.names=TRUE)) source(src) load_objects("grocery_sales", "data.Rdata") cl <- make_cluster(libs=c("tidyr", "dplyr", "fable", "tsibble", "feasts")) # add extra regression variables to training and test datasets add_regvars <- function(df) { df %>% group_by(store, brand) %>% group_modify(~ { pricevars <- grep("price", names(.x), value=TRUE) thispricevar <- unique(paste0("price", .y$brand)) best_other_price <- do.call(pmin, .x[setdiff(pricevars, thispricevar)]) .x$price <- .x[[thispricevar]] .x$maxpricediff <- log(best_other_price/.x$price) .x }) %>% ungroup() %>% mutate(week=yearweek(week)) %>% # need to recreate this variable because of tsibble/vctrs issues as_tsibble(week, key=c(store, brand)) } oj_trainreg <- parallel::parLapply(cl, oj_train, add_regvars) oj_testreg <- parallel::parLapply(cl, oj_test, add_regvars) save_objects(oj_trainreg, oj_testreg, example="grocery_sales", file="data_reg.Rdata") oj_modelset_reg <- parallel::parLapply(cl, oj_trainreg, function(df) { model(df, ar_trend=ARIMA(logmove ~ pdq() + PDQ(0, 0, 0) + trend()), ar_reg=ARIMA(logmove ~ pdq() + PDQ(0, 0, 0) + deal + feat + maxpricediff + price1 + price2 + price3 + price4 + price5 + price6 + price7 + price8 + price9 + price10 + price11), ar_reg_price=ARIMA(logmove ~ pdq() + PDQ(0, 0, 0) + deal + feat + maxpricediff + price), ar_reg_price_trend=ARIMA(logmove ~ pdq() + PDQ(0, 0, 0) + trend() + deal + feat + maxpricediff + price), .safely=FALSE ) }) oj_fcast_reg <- parallel::clusterMap(cl, get_forecasts, oj_modelset_reg, oj_testreg) destroy_cluster(cl) save_objects(oj_modelset_reg, oj_fcast_reg, example="grocery_sales", file="model_reg.Rdata") oj_fcast_reg %>% bind_rows() %>% mutate_at(-(1:3), exp) %>% eval_forecasts() ``` This shows that the models incorporating price are a significant improvement over the previous naive models. The model that uses stepwise selection to choose the best price variable does worse than the one where we choose the price beforehand, confirming the suspicion that stepwise leads to overfitting in this case.