Copyright (c) Microsoft Corporation.
Licensed under the MIT License.

In this notebook, we generate the datasets that will be used for model training and validating.

The orange juice dataset comes from the bayesm package, and gives pricing and sales figures over time for a variety of orange juice brands in several stores in Florida. Rather than installing the entire package (which is very complex), we download the dataset itself from the GitHub mirror of the CRAN repository.

# download the data from the GitHub mirror of the bayesm package source
ojfile <- tempfile(fileext=".rda")
download.file("", ojfile)

The dataset generation parameters are obtained from the file ojdata_forecast_settings.yaml; you can modify that file to vary the experimental setup. The settings are

Parameter Description Default
N_SPLITS The number of splits to make. 10
HORIZON The forecast horizon for the test dataset for each split. 2
GAP The gap in weeks from the end of the training period to the start of the testing period; see below. 2
FIRST_WEEK The first week of data to use. 40
LAST_WEEK The last week of data to use. 156
START_DATE The actual calendar date for the start of the first week in the data. 1989-09-14

A complicating factor is that the data does not include every possible combination of store, brand and date, so we have to pad out the missing rows with complete. In addition, one store/brand combination has no data beyond week 156; we therefore end the analysis at this week. We also do not fill in the missing values in the data, as many of the modelling functions in the fable package can handle this innately.


settings <- yaml::read_yaml(here::here("examples/grocery_sales/R/forecast_settings.yaml"))
start_date <- as.Date(settings$START_DATE)
train_periods <- seq(to=settings$LAST_WEEK - settings$HORIZON - settings$GAP + 1,

oj_data <- orangeJuice$yx %>%
    complete(store, brand, week) %>%
    mutate(week=yearweek(start_date + week*7)) %>%
    as_tsibble(index=week, key=c(store, brand))

Here are some glimpses of what the data looks like. The dependent variable is logmove, the logarithm of the total sales for a given brand and store, in a particular week.


The time series plots for a small subset of brands and stores are shown below. We can make the following observations:


oj_data %>%
    filter(store < 25, brand < 5) %>%
    mutate(week=as.Date(week)) %>%
    ggplot(aes(x=week, y=logmove)) +
        geom_line() +
        scale_x_date(labels=NULL) +
        facet_grid(vars(store), vars(brand), labeller="label_both")