Fit partition on some data, optionally finding best lambda using CV and then re-fiting on full data.

fit_partition(
  y,
  X,
  d = NULL,
  X_aux = NULL,
  d_aux = NULL,
  max_splits = Inf,
  max_cells = Inf,
  min_size = 3,
  cv_folds = 2,
  verbosity = 0,
  breaks_per_dim = NULL,
  potential_lambdas = NULL,
  X_range = NULL,
  bucket_min_n = NA,
  bucket_min_d_var = FALSE,
  obj_fn,
  est_plan,
  partition_i = NA,
  pr_cl = NULL,
  bump_samples = 0,
  bump_ratio = 1,
  ...
)

Arguments

y

Nx1 matrix of outcome (label/target) data. With multiple core estimates see Details below.

X

NxK matrix of features (covariates). With multiple core estimates see Details below.

d

(Optional) NxP matrix (with colnames) of treatment data. If all equally important they should be normalized to have the same variance. With multiple core estimates see Details below.

X_aux

aux X sample to compute statistics on (OOS data)

d_aux

aux d sample to compute statistics on (OOS data)

max_splits

Maximum number of splits even if splits continue to improve OOS fit

max_cells

Maximum number of cells even if more splits continue to improve OOS fit

min_size

Minimum cell size when building full grid, cv_tr will use (F-1)/F*min_size, cv_te doesn't use any.

cv_folds

Number of CV Folds or a vector of foldids. If m_mode==DS.MULTI_SAMPLE, then a list with foldids per Dataset.

verbosity

0 print no message. 1 prints progress bar for high-level loops. 2 prints detailed output for high-level loops. Nested operations decrease verbosity by 1.

breaks_per_dim

NULL (for all possible breaks); K-length vector with # of break (chosen by quantiles); or K-dim list of vectors giving potential split points for non-categorical variables (can put c(0) for categorical). Similar to 'discrete splitting' in CausalTree though their they do separate split-points for treated and controls.

potential_lambdas

potential lambdas to search through in CV

X_range

list of min/max for each dimension (e.g., from get_X_range)

bucket_min_n

Minimum number of observations needed between different split checks

bucket_min_d_var

Ensure positive variance of d for the observations between different split checks

obj_fn

Default is eval_mse_hat. User-provided must allow same signature.

est_plan

EstimatorPlan.

partition_i

Default NA. Use this to avoid CV

pr_cl

Default NULL. Parallel cluster. Used for:

  1. CVing the optimal lambda,

  2. fitting full tree (at each split going across dimensions),

  3. fitting trees over the bumped samples

bump_samples

Number of bump bootstraps (default 0), or list of such length where each items is a bootstrap sample. If m_mode==DS.MULTI_SAMPLE then each item is a sublist with such bootstrap samples over each dataset.

bump_ratio

For bootstraps the ratio of sample size to sample (between 0 and 1, default 1)

...

Additional params.

Value

An object.

partition

Grid Partition (type=grid_partition)

is_obj_val_seq

Full sequence of in-sample objective function values

complexity_seq

Full sequence of partition complexities (num_cells - 1)

partition_i

Index of partition chosen

partition_seq

Full sequence of Grid Partitions

split_seq

Full sequence of splits (type=partition_split)

lambda

lambda chosen

folds_index_out

List of the held-out observations for each fold (e.g., we might have generated them)

Details

Returns the partition and information about the fitting process

Multiple estimates

With multiple core estimates (M) there are 3 options (the first two have the same sample across treatment effects).

  1. DS.MULTI_SAMPLE: Multiple pairs of (Y_m,W_m). y,X,d are then lists of length M. Each element then has the typical size The N_m may differ across m. The number of columns of X will be the same across m.

  2. DS.MULTI_D: Multiple treatments and a single outcome. d is then a NxM matrix.

  3. DS.MULTI_Y: A single treatment and multiple outcomes. y is then a NXM matrix.