Copyright (c) Microsoft Corporation.
Licensed under the MIT License.
In this notebook, we generate the datasets that will be used for model training and validating.
The orange juice dataset comes from the bayesm package, and gives pricing and sales figures over time for a variety of orange juice brands in several stores in Florida. Rather than installing the entire package (which is very complex), we download the dataset itself from the GitHub mirror of the CRAN repository.
# download the data from the GitHub mirror of the bayesm package source ojfile <- tempfile(fileext=".rda") download.file("https://github.com/cran/bayesm/raw/master/data/orangeJuice.rda", ojfile) load(ojfile) file.remove(ojfile)
The dataset generation parameters are obtained from the file
ojdata_forecast_settings.yaml; you can modify that file to vary the experimental setup. The settings are
||The number of splits to make.||10|
||The forecast horizon for the test dataset for each split.||2|
||The gap in weeks from the end of the training period to the start of the testing period; see below.||2|
||The first week of data to use.||40|
||The last week of data to use.||156|
||The actual calendar date for the start of the first week in the data.||
A complicating factor is that the data does not include every possible combination of store, brand and date, so we have to pad out the missing rows with
complete. In addition, one store/brand combination has no data beyond week 156; we therefore end the analysis at this week. We also do not fill in the missing values in the data, as many of the modelling functions in the fable package can handle this innately.
library(tidyr) library(dplyr) library(tsibble) library(feasts) library(fable) settings <- yaml::read_yaml(here::here("examples/grocery_sales/R/forecast_settings.yaml")) start_date <- as.Date(settings$START_DATE) train_periods <- seq(to=settings$LAST_WEEK - settings$HORIZON - settings$GAP + 1, by=settings$HORIZON, length.out=settings$N_SPLITS) oj_data <- orangeJuice$yx %>% complete(store, brand, week) %>% mutate(week=yearweek(start_date + week*7)) %>% as_tsibble(index=week, key=c(store, brand))
Here are some glimpses of what the data looks like. The dependent variable is
logmove, the logarithm of the total sales for a given brand and store, in a particular week.
The time series plots for a small subset of brands and stores are shown below. We can make the following observations:
library(ggplot2) oj_data %>% filter(store < 25, brand < 5) %>% ggplot(aes(x=week, y=logmove)) + geom_line() + scale_x_date(labels=NULL) + facet_grid(vars(store), vars(brand), labeller="label_both")