Prep Data — prep_data • finnts

Preps data with various feature engineering recipes to create features before training models

prep_data(
  run_info,
  input_data,
  combo_variables,
  target_variable,
  date_type,
  forecast_horizon,
  external_regressors = NULL,
  hist_start_date = NULL,
  hist_end_date = NULL,
  combo_cleanup_date = NULL,
  fiscal_year_start = 1,
  clean_missing_values = TRUE,
  clean_outliers = FALSE,
  box_cox = FALSE,
  stationary = TRUE,
  forecast_approach = "bottoms_up",
  parallel_processing = NULL,
  num_cores = NULL,
  fourier_periods = NULL,
  lag_periods = NULL,
  rolling_window_periods = NULL,
  recipes_to_run = NULL,
  multistep_horizon = FALSE
)

Arguments

run_info: Run info using set_run_info()
input_data: A standard data frame, tibble, or spark data frame using sparklyr of historical time series data. Can also include external regressors for both historical and future data.
combo_variables: List of column headers within input data to be used to separate individual time series.
target_variable: The column header formatted as a character value within input data you want to forecast.
date_type: The date granularity of the input data. Finn accepts the following as a character string: day, week, month, quarter, year.
forecast_horizon: Number of periods to forecast into the future.
external_regressors: List of column headers within input data to be used as features in multivariate models.
hist_start_date: Date value of when your input_data starts. Default of NULL uses earliest date value in input_data.
hist_end_date: Date value of when your input_data ends. Default of NULL uses the latest date value in input_data.
combo_cleanup_date: Date value to remove individual time series that don't contain non-zero values after that specified date. Default of NULL is to not remove any time series and attempt to forecast all time series.
fiscal_year_start: Month number of start of fiscal year of input data, aids in building out date features. Formatted as a numeric value. Default of 1 assumes fiscal year starts in January.
clean_missing_values: If TRUE, cleans missing values. Only impute values for missing data within an existing series, and does not add new values onto the beginning or end, but does provide a value of 0 for said values.
clean_outliers: If TRUE, outliers are cleaned and inputted with values more in line with historical data.
box_cox: Apply box-cox transformation to normalize variance in data
stationary: Apply differencing to make data stationary
forecast_approach: How the forecast is created. The default of 'bottoms_up' trains models for each individual time series. Value of 'grouped_hierarchy' creates a grouped time series to forecast at while 'standard_hierarchy' creates a more traditional hierarchical time series to forecast, both based on the hts package.
parallel_processing: Default of NULL runs no parallel processing and forecasts each individual time series one after another. Value of 'local_machine' leverages all cores on current machine Finn is running on. Value of 'spark' runs time series in parallel on a spark cluster in Azure Databricks/Synapse.
num_cores: Number of cores to run when parallel processing is set up. Used when running parallel computations on local machine or within Azure. Default of NULL uses total amount of cores on machine minus one. Can't be greater than number of cores on machine minus 1.
fourier_periods: List of values to use in creating fourier series as features. Default of NULL automatically chooses these values based on the date_type.
lag_periods: List of values to use in creating lag features. Default of NULL automatically chooses these values based on date_type.
rolling_window_periods: List of values to use in creating rolling window features. Default of NULL automatically chooses these values based on date_type.
recipes_to_run: List of recipes to run on multivariate models that can run different recipes. A value of NULL runs all recipes, but only runs the R1 recipe for weekly and daily date types. A value of "all" runs all recipes, regardless of date type. A list like c("R1") or c("R2") would only run models with the R1 or R2 recipe.
multistep_horizon: Use a multistep horizon approach when training multivariate models with R1 recipe.

Value

No return object. Feature engineered data is written to disk based on the output locations provided in set_run_info().

Examples

# \donttest{
data_tbl <- timetk::m4_monthly %>%
  dplyr::rename(Date = date) %>%
  dplyr::mutate(id = as.character(id)) %>%
  dplyr::filter(
    Date >= "2013-01-01",
    Date <= "2015-06-01"
  )

run_info <- set_run_info()
#> Finn Submission Info
#> • Project Name: finn_project
#> • Run Name: finn_fcst-20251218T182140Z
#> 

prep_data(run_info,
  input_data = data_tbl,
  combo_variables = c("id"),
  target_variable = "value",
  date_type = "month",
  forecast_horizon = 3,
  recipes_to_run = "R1"
)
#> ℹ Prepping Data
#> ✔ Prepping Data [1.2s]
#> 
# }