Production Deployment
This page walks through the train → save → reload → predict on new data lifecycle for FLAML models, with a focus on the gotchas that surface in production but not in the quick-start tutorials. Each section is a self-contained pattern with runnable code and a pointer to the issue or PR that motivated it.
Scope
You have called AutoML.fit(...) once on training data and now need to:
- Serialize the trained model so that a separate process can load and use it.
- Score new (unseen) input rows that may contain categorical features, new categorical values, or a slightly different class distribution.
- Reach into individual ensemble component models (
automl.model.estimators_[i]). - Pass sample weights at training time, and understand what
predict()does (and does not) accept at inference time. - Avoid the common silent-correctness bugs reported in #1101 (categorical encoding drift) and #1136 (ensemble component prediction).
What this page does not cover: training-time configuration (see Task-Oriented AutoML), zero-shot estimators (see Zero-Shot AutoML), or distributed/Spark deployment.
1. Save and reload the trained model
1.1 automl.pickle() — recommended default
automl.pickle() writes the entire AutoML instance, including the data transformer, the best estimator, and the search history. AutoML.load_pickle() restores it in another process. This is the simplest reliable path for FLAML.
import numpy as np
import pandas as pd
from flaml import AutoML
X = pd.DataFrame(
{
"age": np.random.randint(20, 70, 400),
"income": np.random.normal(50000, 15000, 400),
"gender": np.random.choice(["M", "F"], 400),
"education": np.random.choice(["HS", "BS", "MS", "PhD"], 400),
}
)
y = (X["age"] > 40).astype(int)
automl = AutoML()
automl.fit(X, y, task="classification", time_budget=5, estimator_list=["lgbm"])
automl.pickle("automl.pkl")
# In a different process:
loaded = AutoML.load_pickle("automl.pkl")
assert np.array_equal(automl.predict(X), loaded.predict(X))
Use automl.pickle() whenever possible. It is the only path that preserves everything needed at inference time (data transformer included), so the categorical-encoding behavior described in section 3 is reproduced correctly.
1.2 MLflow logging — for MLflow-managed deployments
If your serving stack is built around MLflow, log the trained AutoML instance explicitly via the sklearn flavor. This works because the AutoML object exposes a sklearn-compatible predict/predict_proba API.
import mlflow
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from flaml import AutoML
X, y = load_iris(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
mlflow.set_experiment("flaml_prod")
automl = AutoML()
with mlflow.start_run() as run:
automl.fit(
X_train, y_train, task="classification", time_budget=5, mlflow_logging=False
)
mlflow.sklearn.log_model(automl, artifact_path="flaml_model")
run_id = run.info.run_id
loaded = mlflow.sklearn.load_model(f"runs:/{run_id}/flaml_model")
assert np.array_equal(automl.predict(X_test), loaded.predict(X_test))
Two practical notes:
mlflow_logging=Falsedisables FLAML's built-in MLflow autologging path insidefit. With it enabled, MLflow auto-saves an artifact underruns:/{run_id}/model, but on recent MLflow versions reloading that artifact viamlflow.sklearn.load_modelcan return an unfittedPipeline. The explicitmlflow.sklearn.log_model(automl, ...)call above sidesteps that issue.- The argument is
artifact_path=(notname=) in MLflow 2.x.
1.3 Pickling just the best estimator — lean serving
If you do not need the data transformer (because your serving pipeline preprocesses upstream and only needs to call the bare ML model), you can pickle automl.model instead of the whole AutoML. Use this only if you can guarantee that inference-time inputs match what FLAML produced after its data transformer ran — otherwise you will hit the categorical and ensemble issues in sections 3 and 4.
2. The public automl.preprocess(X) API
FLAML applies two layers of preprocessing inside automl.predict(X):
- Task-level preprocessing — handled by the internal
DataTransformer: type coercions, NaN handling, categorical encoding, datetime expansion. - Estimator-level preprocessing — handled by the estimator wrapper itself (e.g.,
Normalizerfor theSGDEstimator, sparse-input conversion for XGBoost).
Calling automl.predict(X) chains both layers automatically. When you need to reach a single ensemble component or write a custom inference pipeline, call them explicitly:
# Task-level preprocessing, accessible since #1497
X_pre = automl.preprocess(X_test)
# Estimator-level preprocessing on top of the task-level output
X_full = automl.model.preprocess(X_pre)
For most consumers, automl.preprocess(X_test) is all you need before delegating to a single estimator. Section 4 walks through the canonical use case.
3. Categorical features at inference time
This section is the answer to issue #1101 and the silent-correctness bug fixed in PR #1561.
3.1 What FLAML does at fit time
When X is a pandas DataFrame containing object, string, or category columns, DataTransformer.fit_transform records the per-column category list seen at fit time and pins it on the transformer instance. Each known category gets a stable integer code; an extra reserved slot is held for the "__NAN__" sentinel that future inference batches may need.
3.2 What transform does at predict time
DataTransformer.transform re-uses the pinned category list, so the integer code assigned to each known category at predict time is identical to the one assigned at fit time — regardless of which values happen to appear in the predict-time DataFrame.
import pandas as pd
import numpy as np
from flaml.automl.data import DataTransformer
from flaml.automl.task.factory import task_factory
rng = np.random.RandomState(0)
fit_df = pd.DataFrame(
{
"a": rng.randn(120),
"gender": rng.choice(["M", "F"], 120),
}
)
fit_y = pd.Series(rng.randn(120))
transformer = DataTransformer()
transformer.fit_transform(
fit_df.copy(), fit_y, task_factory("regression", fit_df, fit_y)
)
# Predict-time DataFrame contains only the "M" category
predict_df = pd.DataFrame({"a": np.zeros(20), "gender": ["M"] * 20})
X_pred = transformer.transform(predict_df.copy())
# The integer code assigned to "M" is the same as at fit time — no drift.
3.3 Unseen categories
If predict-time data contains values that were not seen at fit time, FLAML now emits a UserWarning and encodes those rows as the "__NAN__" sentinel. Consume the warning category in your serving code and decide how to react (log, alert, reject the batch, etc.).
import warnings
with warnings.catch_warnings(record=True) as caught:
warnings.simplefilter("always")
predict_df = pd.DataFrame({"a": np.zeros(5), "gender": ["M", "F", "X", "Y", "M"]})
X_pred = transformer.transform(predict_df.copy())
unseen = [
w
for w in caught
if issubclass(w.category, UserWarning) and "unseen at fit time" in str(w.message)
]
if unseen:
# In production this is where you raise an alert / reject the batch /
# fall back to a default category.
print(unseen[0].message)
The model still produces a prediction for rows mapped to "__NAN__", but those predictions are unreliable: the model was not trained on that category. Treat unseen-category warnings as a deployment health signal, not background noise.
3.4 Recommended workflow
If your production data may legitimately introduce new categorical values over time (a new product code, a new geography), pin the category list upstream of FLAML using sklearn's OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1) or an equivalent component, and pass the encoded DataFrame into AutoML.fit. This makes encoding consistency an explicit part of your pipeline rather than relying on FLAML's defensive sentinel.
4. Ensemble component access
This is the canonical pattern for issue #1136 (closed by PR #1558).
When AutoML.fit(..., ensemble=True) is used, automl.model is a sklearn StackingClassifier/StackingRegressor whose estimators_ were trained on data that has already passed through FLAML's task-level preprocessing. As a result, calling automl.model.estimators_[i].predict(X_raw) directly raises a confusing error from the underlying estimator (LightGBM: train and valid dataset categorical_feature do not match, XGBoost: DataFrame.dtypes must be int/float/bool/category, etc.).
The fix is to preprocess raw input via automl.preprocess(X) first:
automl = AutoML()
automl.fit(
X,
y,
task="classification",
ensemble=True,
estimator_list=["lgbm", "xgboost", "rf"],
time_budget=10,
)
# Direct call on raw input — DOES NOT WORK:
# automl.model.estimators_[0].predict(X) # raises ValueError on categorical input
# Correct pattern — preprocess first:
X_pre = automl.preprocess(X)
component_preds = [est.predict(X_pre) for est in automl.model.estimators_]
This is intentionally a two-step process. automl.predict(X) does both steps for you; component-level access is for cases where you need per-component scores, predictions, or feature attributions, in which case you handle the preprocessing call site explicitly.
5. Sample weights and cost-sensitive learning
Pass sample_weight at fit time to perform cost-sensitive training. FLAML honors the weight inside both the holdout and CV evaluation paths.
import numpy as np
from flaml import AutoML
# 5x weight on the minority (positive) class
sample_weight = np.where(y_train == 1, 5.0, 1.0)
automl = AutoML()
automl.fit(
X_train,
y_train,
sample_weight=sample_weight,
task="classification",
time_budget=5,
)
Compatibility notes:
split_type="time"+sample_weightworks correctly after PR #1554 (closes #887).predict()does not take asample_weightargument — weights apply only during training. For weighted evaluation on new data, compute the metric outside FLAML (e.g.,sklearn.metrics.f1_score(y_test, automl.predict(X_test), sample_weight=test_weight)).class_weightis passed through to the underlying estimator unchanged if your chosen estimator accepts it (e.g., LightGBM, XGBoost sklearn API).
For severe class imbalance, see also issue #1200 on adding a resampler= integration. The current recommendation is to apply SMOTE (or your resampler of choice) upstream of AutoML.fit; see the imbalanced-learn documentation for the canonical pattern.
6. Multi-output regression
For multi-target regression today, wrap a fresh AutoML(task="regression", ...) per target with sklearn's MultiOutputRegressor or RegressorChain:
from sklearn.datasets import make_regression
from sklearn.multioutput import MultiOutputRegressor
from flaml import AutoML
X, y = make_regression(n_samples=200, n_targets=3, random_state=42)
model = MultiOutputRegressor(
AutoML(task="regression", time_budget=1, estimator_list=["lgbm"])
)
model.fit(X[:150], y[:150])
preds = model.predict(X[150:])
Known limitation: passing X_val and y_val through MultiOutputRegressor does not flow into each inner AutoML.fit (#1115). Workaround: concatenate train + val into a single dataset and use a custom splitter, or call AutoML per target manually.
Native multi-target support is being tracked in #1301; when it lands, prefer the native path.
7. Versioning and reproducibility
Two pieces matter for reproducible predictions in production:
- The FLAML
random_seed— pass it viaautoml.fit(..., seed=N)to make the search deterministic. The 2026-05 reproducibility audit (closes #1540) standardized how every audited estimator honors this seed; see #1541 (SGD), #1546 (LRL1), #1547 (RandomForest/ExtraTrees), #1549 (XGBoost sklearn), #1551 (XGBoost native), #1552 (LRL2), #1556 (LRLpenalty/n_jobsdeprecations). - Pinned library versions —
flaml,scikit-learn,lightgbm,xgboost,catboost,pandas, andnumpyshould all be pinned in your serving environment. Mismatches between training-time and serving-time versions of any of these can produce silently divergent predictions even with the samerandom_seed.
A minimal training-environment requirements.txt snippet:
flaml==2.6.0
scikit-learn==1.8.0
lightgbm>=4.0,<5.0
xgboost>=2.0,<3.0
pandas>=2.0,<3.0
numpy>=1.26,<3.0
When you ship a model, ship the corresponding requirements.txt (or conda-lock.yml) alongside the pickle/MLflow artifact and use the same versions to instantiate the serving environment.
8. Common gotchas — quick reference
| Symptom | Cause | Fix |
|---|---|---|
predict() on a DataFrame returns different codes than at fit time | Predict-time DataFrame had a different subset of categorical values | Use FLAML ≥ post-#1561; or pin categories upstream via OrdinalEncoder |
UserWarning: Column '...' contains values unseen at fit time | New category at inference time | Decide policy: alert, retrain, or fall back to default |
automl.model.estimators_[i].predict(X) raises on categorical input | Component model expects preprocessed input | Call automl.preprocess(X) first |
MultiOutputRegressor(AutoML(...)) ignores X_val | Per-target inner AutoML.fit doesn't see validation kwargs | Use a custom splitter on the concatenated dataset |
AttributeError: 'AutoMLState' has no attribute 'sample_weight_all' on retrain_full=True | Pre-#1554 bug | Upgrade FLAML past #1554 |
MLflow autolog'd model loads as an unfitted Pipeline | Older example assumed an autolog artifact path that no longer reliably reloads | Use the explicit mlflow.sklearn.log_model(automl, artifact_path=...) pattern in §1.2 |
See also: Best-Practices, Task-Oriented AutoML, FAQ.