Feature Engineering • finnts

Automated feature engineering is a cornerstone of the package. Below are some of the techniques we use in multivariate machine learning models, and the outside packages that make it possible.

Missing Data and Outliers

Missing data is filled in using the pad_by_time function from the timetk package. First, each time series is grouped and padded using their existing start and end dates. Missing values are padded using NA. Then the same process is ran again, this time padding data from the hist_start_date from forecast_time_series(), with missing values being filled in with zero. This ensures that missing data before a time series starts are all zeroes, but missing periods within the existing time series data are identified to be inputted with new values in the next step.

After missing data is padded, the ts_impute_vec function from the timetk package is called to impute any NA values. This only happens if the clean_missing_values input from forecast_time_series() is set to TRUE, otherwise NA values are replaced with zero.

Outliers are handled using the ts_clean_vec function from the timetk package. Outliers are replaced after the missing data process, and only runs if the clean_outliers input from forecast_time_series() is set to TRUE.

Important Note: Missing values and outliers are replaced for the target variable and any numeric external regressors.

Understanding “_original” Columns

When you call get_prepped_data() to view your feature engineered data, you may notice columns with “_original” appended to their names (e.g. Price_original). These columns are created during the preprocessing pipeline that preserve the untransformed external regressor values. This happens after missing value and outlier cleaning, but before any transformations like Box-Cox or differencing are applied.

TimeGPT and Chronos2 use these “_original” columns for external regressors as they expect data in its original scale and distribution.

Models like XGBoost, LightGBM, Random Forest, etc., ignore the “_original” columns and use the transformed versions instead.

Important Note: Avoid naming your external regressors with “original” or “Original” in their names (e.g., original_price, Price_Original) as this can cause conflicts with the internal “_original” columns created by finnts.

Box-Cox

Stabilizes the variance in each time series using the box_cox_vec function from the timetk package. Applied to both the target variable and any external regressor before other transformations like differencing. You can control this within prep_models().

Differencing

Uses the feasts package to check if each time series is stationary and applies the differencing required (up to two standard differences with lag one) in order to make the time series stationary. Uses the diff_vec function from the timetk package to do the differencing. This is applied to the target variable and any external regressor before other features are created. Data is undifferenced before training for univariate models like arima, but differenced data is used for all multivariate models. You can control the differencing done within prep_models().

Date Features

The tk_augment_timeseries_signature function from the timetk package easily extracts out various date features from the time stamp. The function doesn’t differentiate between date type, so features need to be removed depending on the date type. For example, features related to week and day for a monthly forecast are automatically removed.

Fourier series are also added using the tk_augment_fourier function from timetk.

library(dplyr)
library(timetk)

m4_monthly %>%
  timetk::tk_augment_timeseries_signature(date) %>%
  dplyr::group_by(id) %>%
  timetk::tk_augment_fourier(date, .periods = c(3, 6, 12), .K = 1) %>%
  dplyr::ungroup()
#> # A tibble: 1,574 × 37
#>    id    date       value index.num    diff  year year.iso  half quarter month
#>    <fct> <date>     <dbl>     <dbl>   <dbl> <int>    <int> <int>   <int> <int>
#>  1 M1    1976-06-01  8000 202435200      NA  1976     1976     1       2     6
#>  2 M1    1976-07-01  8350 205027200 2592000  1976     1976     2       3     7
#>  3 M1    1976-08-01  8570 207705600 2678400  1976     1976     2       3     8
#>  4 M1    1976-09-01  7700 210384000 2678400  1976     1976     2       3     9
#>  5 M1    1976-10-01  7080 212976000 2592000  1976     1976     2       4    10
#>  6 M1    1976-11-01  6520 215654400 2678400  1976     1976     2       4    11
#>  7 M1    1976-12-01  6070 218246400 2592000  1976     1976     2       4    12
#>  8 M1    1977-01-01  6650 220924800 2678400  1977     1976     1       1     1
#>  9 M1    1977-02-01  6830 223603200 2678400  1977     1977     1       1     2
#> 10 M1    1977-03-01  5710 226022400 2419200  1977     1977     1       1     3
#> # ℹ 1,564 more rows
#> # ℹ 27 more variables: month.xts <int>, month.lbl <ord>, day <int>, hour <int>,
#> #   minute <int>, second <int>, hour12 <int>, am.pm <int>, wday <int>,
#> #   wday.xts <int>, wday.lbl <ord>, mday <int>, qday <int>, yday <int>,
#> #   mweek <int>, week <int>, week.iso <int>, week2 <int>, week3 <int>,
#> #   week4 <int>, mday7 <int>, date_sin3_K1 <dbl>, date_cos3_K1 <dbl>,
#> #   date_sin6_K1 <dbl>, date_cos6_K1 <dbl>, date_sin12_K1 <dbl>, …

Lags, Rolling Windows, and Polynomial Transformations

Lags of the target variable and external regressors are created using the tk_augment_lags function from timetk.

Rolling window calculations of the target variable are created using the tk_augment_slidify function from timetk. The below calculations are created over various window values.

sum
mean
standard deviation

Polynomial transformations are created for the target variable, and lags are then created on top of them. The below transformations are created.

squared
cubed
log

Custom Approaches

In addition to the standard approaches above, finnts also does two different ways of preparing features to be created for a multivariate machine learning model.

In the first recipe, referred to as “R1” in default finnts models, by default takes a single step horizon approach. Meaning all of the engineered target and external regressor features are used but the lags cannot be less than the forecast horizon. For example, a monthly data set with a forecast horizon of 3, finnts will take engineered features like lags and rolling window features but only use those lags that are for periods equal to or greater than 3. You can also run a multistep horizon approach by setting multistep_horizon to TRUE in prep_models(). The multistep approach will create features that can be used by specific multivariate models that optimize for each period in a forecast horizon. More on this in the “models used in finnts” vignette. Recursive forecasting is not supported in finnts multivariate machine learning models, since feeding forecast outputs as features to create another forecast adds complex layers of uncertainty that can easily spiral out of control and produce poor forecasts. NA values created by generating lag features are filled “up”. This results in the first initial periods of a time series having some data leakage but the effect should be small if the time series is long enough.

library(finnts)

hist_data <- timetk::m4_monthly %>%
  dplyr::filter(
    date >= "2012-01-01",
    id == "M2"
  ) %>%
  dplyr::rename(Date = date) %>%
  dplyr::mutate(id = as.character(id))

run_info <- set_run_info(
  project_name = "finnts_fcst",
  run_name = "R1_run"
)

prep_data(
  run_info = run_info,
  input_data = hist_data,
  combo_variables = c("id"),
  target_variable = "value",
  date_type = "month",
  forecast_horizon = 3,
  recipes_to_run = "R1",
  multistep_horizon = FALSE
)
#> Warning: `cross_df()` was deprecated in purrr 1.0.0.
#> ℹ Please use `tidyr::expand_grid()` instead.
#> ℹ See <https://github.com/tidyverse/purrr/issues/768>.
#> ℹ The deprecated feature was likely used in the timetk package.
#>   Please report the issue at
#>   <https://github.com/business-science/timetk/issues>.
#> This warning is displayed once per session.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: `cross()` was deprecated in purrr 1.0.0.
#> ℹ Please use `tidyr::expand_grid()` instead.
#> ℹ See <https://github.com/tidyverse/purrr/issues/768>.
#> ℹ The deprecated feature was likely used in the purrr package.
#>   Please report the issue at <https://github.com/tidyverse/purrr/issues>.
#> This warning is displayed once per session.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

R1_prepped_data_tbl <- get_prepped_data(
  run_info = run_info,
  recipe = "R1"
)

print(R1_prepped_data_tbl)
#> # A tibble: 45 × 79
#>    Date       Combo id    Target Date_index.num Date_diff Date_year Date_half
#>    <date>     <chr> <chr>  <dbl>          <dbl>     <dbl>     <dbl>     <dbl>
#>  1 2012-01-01 M2    M2       260     1325376000         0      2012         1
#>  2 2012-02-01 M2    M2       260     1328054400   2678400      2012         1
#>  3 2012-03-01 M2    M2       300     1330560000   2505600      2012         1
#>  4 2012-04-01 M2    M2      -440     1333238400   2678400      2012         1
#>  5 2012-05-01 M2    M2       430     1335830400   2592000      2012         1
#>  6 2012-06-01 M2    M2        60     1338508800   2678400      2012         1
#>  7 2012-07-01 M2    M2      -160     1341100800   2592000      2012         2
#>  8 2012-08-01 M2    M2       -40     1343779200   2678400      2012         2
#>  9 2012-09-01 M2    M2      -120     1346457600   2678400      2012         2
#> 10 2012-10-01 M2    M2       110     1349049600   2592000      2012         2
#> # ℹ 35 more rows
#> # ℹ 71 more variables: Date_quarter <dbl>, Date_month <dbl>,
#> #   Date_month.lbl <chr>, Target_lag3 <dbl>, Target_lag6 <dbl>,
#> #   Target_lag9 <dbl>, Target_lag12 <dbl>, Target_lag3_roll3_Avg <dbl>,
#> #   Target_lag6_roll3_Avg <dbl>, Target_lag9_roll3_Avg <dbl>,
#> #   Target_lag12_roll3_Avg <dbl>, Target_lag3_roll6_Avg <dbl>,
#> #   Target_lag6_roll6_Avg <dbl>, Target_lag9_roll6_Avg <dbl>, …

The second recipe is referred to as “R2” in default finnts models. It takes a very different approach than the “R1” recipe. For a 3 month forecast horizon on a monthly dataset, target and rolling window features are created depending on the horizon period. They are also constrained to be equal or less than the forecast horizon. In the below example, “Origin” and “Horizon” features are created for each time period. This results in duplicating rows in the original data set to create new features that are now specific to each horizon period. This helps the default finnts models find new unique relationships to model, when compared to a more formal approach in “R1”. NA values created by generating lag features are filled “up”.

library(finnts)

hist_data <- timetk::m4_monthly %>%
  dplyr::filter(
    date >= "2012-01-01",
    id == "M2"
  ) %>%
  dplyr::rename(Date = date) %>%
  dplyr::mutate(id = as.character(id))

run_info <- set_run_info(
  project_name = "finnts_fcst",
  run_name = "R2_run"
)

prep_data(
  run_info = run_info,
  input_data = hist_data,
  combo_variables = c("id"),
  target_variable = "value",
  date_type = "month",
  forecast_horizon = 3,
  recipes_to_run = "R2"
)

R2_prepped_data_tbl <- get_prepped_data(
  run_info = run_info,
  recipe = "R2"
)

print(R2_prepped_data_tbl)
#> # A tibble: 135 × 107
#>    Date       Combo id    Target Date_index.num Date_diff Date_year Date_half
#>    <date>     <chr> <chr>  <dbl>          <dbl>     <dbl>     <dbl>     <dbl>
#>  1 2012-01-01 M2    M2       260     1325376000         0      2012         1
#>  2 2012-02-01 M2    M2       260     1328054400   2678400      2012         1
#>  3 2012-03-01 M2    M2       300     1330560000   2505600      2012         1
#>  4 2012-04-01 M2    M2      -440     1333238400   2678400      2012         1
#>  5 2012-05-01 M2    M2       430     1335830400   2592000      2012         1
#>  6 2012-06-01 M2    M2        60     1338508800   2678400      2012         1
#>  7 2012-07-01 M2    M2      -160     1341100800   2592000      2012         2
#>  8 2012-08-01 M2    M2       -40     1343779200   2678400      2012         2
#>  9 2012-09-01 M2    M2      -120     1346457600   2678400      2012         2
#> 10 2012-10-01 M2    M2       110     1349049600   2592000      2012         2
#> # ℹ 125 more rows
#> # ℹ 99 more variables: Date_quarter <dbl>, Date_month <dbl>,
#> #   Date_month.lbl <chr>, Horizon <dbl>, Origin <dbl>, Target_lag1 <dbl>,
#> #   Target_lag2 <dbl>, Target_lag3 <dbl>, Target_lag6 <dbl>, Target_lag9 <dbl>,
#> #   Target_lag12 <dbl>, Target_lag1_roll3_Avg <dbl>,
#> #   Target_lag2_roll3_Avg <dbl>, Target_lag3_roll3_Avg <dbl>,
#> #   Target_lag6_roll3_Avg <dbl>, Target_lag9_roll3_Avg <dbl>, …

Model Specific Preprocessing

In addition to everything called out above, some models have their own specific transformations that need to be applied before training a model. For example, the “glmnet” model needs to transform categorical variables into continuous variables and center/scale the data before training. Each default model in finnts has their own preprocessing steps that ensure the data fed into the model has the best chance of producing a high quality forecast. The recipes package is used to easily apply various preprocessing transformations needed before training a model. Additionally, TimeGPT automatically pads data with zeros when using Azure AI endpoints just before the API call to meet minimum data size requirements based on data frequency, and Chronos2 pads combos with fewer than 3 observations to meet its minimum data requirements (see the Forecasting with GenAI vignette for details).