Production Deployment

This page walks through the train → save → reload → predict on new data lifecycle for FLAML models, with a focus on the gotchas that surface in production but not in the quick-start tutorials. Each section is a self-contained pattern with runnable code and a pointer to the issue or PR that motivated it.

Scope

You have called AutoML.fit(...) once on training data and now need to:

Serialize the trained model so that a separate process can load and use it.
Score new (unseen) input rows that may contain categorical features, new categorical values, or a slightly different class distribution.
Reach into individual ensemble component models (automl.model.estimators_[i]).
Pass sample weights at training time, and understand what predict() does (and does not) accept at inference time.
Avoid the common silent-correctness bugs reported in #1101 (categorical encoding drift) and #1136 (ensemble component prediction).

What this page does not cover: training-time configuration (see Task-Oriented AutoML), zero-shot estimators (see Zero-Shot AutoML), or distributed/Spark deployment.

1. Save and reload the trained model

1.1 `automl.pickle()` — recommended default

automl.pickle() writes the entire AutoML instance, including the data transformer, the best estimator, and the search history. AutoML.load_pickle() restores it in another process. This is the simplest reliable path for FLAML.

import numpy as np
import pandas as pd
from flaml import AutoML

X = pd.DataFrame(
    {
        "age": np.random.randint(20, 70, 400),
        "income": np.random.normal(50000, 15000, 400),
        "gender": np.random.choice(["M", "F"], 400),
        "education": np.random.choice(["HS", "BS", "MS", "PhD"], 400),
    }
)
y = (X["age"] > 40).astype(int)

automl = AutoML()
automl.fit(X, y, task="classification", time_budget=5, estimator_list=["lgbm"])
automl.pickle("automl.pkl")

# In a different process:
loaded = AutoML.load_pickle("automl.pkl")
assert np.array_equal(automl.predict(X), loaded.predict(X))

Use automl.pickle() whenever possible. It is the only path that preserves everything needed at inference time (data transformer included), so the categorical-encoding behavior described in section 3 is reproduced correctly.

1.2 MLflow logging — for MLflow-managed deployments

If your serving stack is built around MLflow, log the trained AutoML instance explicitly via the sklearn flavor. This works because the AutoML object exposes a sklearn-compatible predict/predict_proba API.

import mlflow
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from flaml import AutoML

X, y = load_iris(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

mlflow.set_experiment("flaml_prod")
automl = AutoML()
with mlflow.start_run() as run:
    automl.fit(
        X_train, y_train, task="classification", time_budget=5, mlflow_logging=False
    )
    mlflow.sklearn.log_model(automl, artifact_path="flaml_model")
    run_id = run.info.run_id

loaded = mlflow.sklearn.load_model(f"runs:/{run_id}/flaml_model")
assert np.array_equal(automl.predict(X_test), loaded.predict(X_test))

Two practical notes:

mlflow_logging=False disables FLAML's built-in MLflow autologging path inside fit. With it enabled, MLflow auto-saves an artifact under runs:/{run_id}/model, but on recent MLflow versions reloading that artifact via mlflow.sklearn.load_model can return an unfitted Pipeline. The explicit mlflow.sklearn.log_model(automl, ...) call above sidesteps that issue.
The argument is artifact_path= (not name=) in MLflow 2.x.

1.3 Pickling just the best estimator — lean serving

If you do not need the data transformer (because your serving pipeline preprocesses upstream and only needs to call the bare ML model), you can pickle automl.model instead of the whole AutoML. Use this only if you can guarantee that inference-time inputs match what FLAML produced after its data transformer ran — otherwise you will hit the categorical and ensemble issues in sections 3 and 4.

2. The public `automl.preprocess(X)` API

FLAML applies two layers of preprocessing inside automl.predict(X):

Task-level preprocessing — handled by the internal DataTransformer: type coercions, NaN handling, categorical encoding, datetime expansion.
Estimator-level preprocessing — handled by the estimator wrapper itself (e.g., Normalizer for the SGDEstimator, sparse-input conversion for XGBoost).

Calling automl.predict(X) chains both layers automatically. When you need to reach a single ensemble component or write a custom inference pipeline, call them explicitly:

# Task-level preprocessing, accessible since #1497
X_pre = automl.preprocess(X_test)

# Estimator-level preprocessing on top of the task-level output
X_full = automl.model.preprocess(X_pre)

For most consumers, automl.preprocess(X_test) is all you need before delegating to a single estimator. Section 4 walks through the canonical use case.

3. Categorical features at inference time

This section is the answer to issue #1101 and the silent-correctness bug fixed in PR #1561.

3.1 What FLAML does at fit time

When X is a pandas DataFrame containing object, string, or category columns, DataTransformer.fit_transform records the per-column category list seen at fit time and pins it on the transformer instance. Each known category gets a stable integer code; an extra reserved slot is held for the "__NAN__" sentinel that future inference batches may need.

3.2 What `transform` does at predict time

DataTransformer.transform re-uses the pinned category list, so the integer code assigned to each known category at predict time is identical to the one assigned at fit time — regardless of which values happen to appear in the predict-time DataFrame.

import pandas as pd
import numpy as np
from flaml.automl.data import DataTransformer
from flaml.automl.task.factory import task_factory

rng = np.random.RandomState(0)
fit_df = pd.DataFrame(
    {
        "a": rng.randn(120),
        "gender": rng.choice(["M", "F"], 120),
    }
)
fit_y = pd.Series(rng.randn(120))

transformer = DataTransformer()
transformer.fit_transform(
    fit_df.copy(), fit_y, task_factory("regression", fit_df, fit_y)
)

# Predict-time DataFrame contains only the "M" category
predict_df = pd.DataFrame({"a": np.zeros(20), "gender": ["M"] * 20})
X_pred = transformer.transform(predict_df.copy())

# The integer code assigned to "M" is the same as at fit time — no drift.

3.3 Unseen categories

If predict-time data contains values that were not seen at fit time, FLAML now emits a UserWarning and encodes those rows as the "__NAN__" sentinel. Consume the warning category in your serving code and decide how to react (log, alert, reject the batch, etc.).

import warnings

with warnings.catch_warnings(record=True) as caught:
    warnings.simplefilter("always")
    predict_df = pd.DataFrame({"a": np.zeros(5), "gender": ["M", "F", "X", "Y", "M"]})
    X_pred = transformer.transform(predict_df.copy())

unseen = [
    w
    for w in caught
    if issubclass(w.category, UserWarning) and "unseen at fit time" in str(w.message)
]
if unseen:
    # In production this is where you raise an alert / reject the batch /
    # fall back to a default category.
    print(unseen[0].message)

The model still produces a prediction for rows mapped to "__NAN__", but those predictions are unreliable: the model was not trained on that category. Treat unseen-category warnings as a deployment health signal, not background noise.

3.4 Recommended workflow

If your production data may legitimately introduce new categorical values over time (a new product code, a new geography), pin the category list upstream of FLAML using sklearn's OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1) or an equivalent component, and pass the encoded DataFrame into AutoML.fit. This makes encoding consistency an explicit part of your pipeline rather than relying on FLAML's defensive sentinel.

4. Ensemble component access

This is the canonical pattern for issue #1136 (closed by PR #1558).

When AutoML.fit(..., ensemble=True) is used, automl.model is a sklearn StackingClassifier/StackingRegressor whose estimators_ were trained on data that has already passed through FLAML's task-level preprocessing. As a result, calling automl.model.estimators_[i].predict(X_raw) directly raises a confusing error from the underlying estimator (LightGBM: train and valid dataset categorical_feature do not match, XGBoost: DataFrame.dtypes must be int/float/bool/category, etc.).

The fix is to preprocess raw input via automl.preprocess(X) first:

automl = AutoML()
automl.fit(
    X,
    y,
    task="classification",
    ensemble=True,
    estimator_list=["lgbm", "xgboost", "rf"],
    time_budget=10,
)

# Direct call on raw input — DOES NOT WORK:
# automl.model.estimators_[0].predict(X)   # raises ValueError on categorical input

# Correct pattern — preprocess first:
X_pre = automl.preprocess(X)
component_preds = [est.predict(X_pre) for est in automl.model.estimators_]

This is intentionally a two-step process. automl.predict(X) does both steps for you; component-level access is for cases where you need per-component scores, predictions, or feature attributions, in which case you handle the preprocessing call site explicitly.

5. Sample weights and cost-sensitive learning

Pass sample_weight at fit time to perform cost-sensitive training. FLAML honors the weight inside both the holdout and CV evaluation paths.

import numpy as np
from flaml import AutoML

# 5x weight on the minority (positive) class
sample_weight = np.where(y_train == 1, 5.0, 1.0)
automl = AutoML()
automl.fit(
    X_train,
    y_train,
    sample_weight=sample_weight,
    task="classification",
    time_budget=5,
)

Compatibility notes:

split_type="time" + sample_weight works correctly after PR #1554 (closes #887).
predict() does not take a sample_weight argument — weights apply only during training. For weighted evaluation on new data, compute the metric outside FLAML (e.g., sklearn.metrics.f1_score(y_test, automl.predict(X_test), sample_weight=test_weight)).
class_weight is passed through to the underlying estimator unchanged if your chosen estimator accepts it (e.g., LightGBM, XGBoost sklearn API).

For severe class imbalance, see also issue #1200 on adding a resampler= integration. The current recommendation is to apply SMOTE (or your resampler of choice) upstream of AutoML.fit; see the imbalanced-learn documentation for the canonical pattern.

6. Multi-output regression

For multi-target regression today, wrap a fresh AutoML(task="regression", ...) per target with sklearn's MultiOutputRegressor or RegressorChain:

from sklearn.datasets import make_regression
from sklearn.multioutput import MultiOutputRegressor
from flaml import AutoML

X, y = make_regression(n_samples=200, n_targets=3, random_state=42)
model = MultiOutputRegressor(
    AutoML(task="regression", time_budget=1, estimator_list=["lgbm"])
)
model.fit(X[:150], y[:150])
preds = model.predict(X[150:])

Known limitation: passing X_val and y_val through MultiOutputRegressor does not flow into each inner AutoML.fit (#1115). Workaround: concatenate train + val into a single dataset and use a custom splitter, or call AutoML per target manually.

Native multi-target support is being tracked in #1301; when it lands, prefer the native path.

7. Versioning and reproducibility

Two pieces matter for reproducible predictions in production:

The FLAML random_seed — pass it via automl.fit(..., seed=N) to make the search deterministic. The 2026-05 reproducibility audit (closes #1540) standardized how every audited estimator honors this seed; see #1541 (SGD), #1546 (LRL1), #1547 (RandomForest/ExtraTrees), #1549 (XGBoost sklearn), #1551 (XGBoost native), #1552 (LRL2), #1556 (LRL penalty/n_jobs deprecations).
Pinned library versions — flaml, scikit-learn, lightgbm, xgboost, catboost, pandas, and numpy should all be pinned in your serving environment. Mismatches between training-time and serving-time versions of any of these can produce silently divergent predictions even with the same random_seed.

A minimal training-environment requirements.txt snippet:

flaml==2.6.0
scikit-learn==1.8.0
lightgbm>=4.0,<5.0
xgboost>=2.0,<3.0
pandas>=2.0,<3.0
numpy>=1.26,<3.0

When you ship a model, ship the corresponding requirements.txt (or conda-lock.yml) alongside the pickle/MLflow artifact and use the same versions to instantiate the serving environment.

8. Common gotchas — quick reference

Symptom	Cause	Fix
`predict()` on a DataFrame returns different codes than at fit time	Predict-time DataFrame had a different subset of categorical values	Use FLAML ≥ post-#1561; or pin categories upstream via `OrdinalEncoder`
`UserWarning: Column '...' contains values unseen at fit time`	New category at inference time	Decide policy: alert, retrain, or fall back to default
`automl.model.estimators_[i].predict(X)` raises on categorical input	Component model expects preprocessed input	Call `automl.preprocess(X)` first
`MultiOutputRegressor(AutoML(...))` ignores `X_val`	Per-target inner `AutoML.fit` doesn't see validation kwargs	Use a custom splitter on the concatenated dataset
`AttributeError: 'AutoMLState' has no attribute 'sample_weight_all'` on `retrain_full=True`	Pre-#1554 bug	Upgrade FLAML past #1554
MLflow autolog'd model loads as an unfitted `Pipeline`	Older example assumed an autolog artifact path that no longer reliably reloads	Use the explicit `mlflow.sklearn.log_model(automl, artifact_path=...)` pattern in §1.2

Scope​

1. Save and reload the trained model​

1.1 automl.pickle() — recommended default​

1.2 MLflow logging — for MLflow-managed deployments​

1.3 Pickling just the best estimator — lean serving​

2. The public automl.preprocess(X) API​

3. Categorical features at inference time​

3.1 What FLAML does at fit time​

3.2 What transform does at predict time​

3.3 Unseen categories​

3.4 Recommended workflow​

4. Ensemble component access​

5. Sample weights and cost-sensitive learning​

6. Multi-output regression​

7. Versioning and reproducibility​

8. Common gotchas — quick reference​