mirror of
https://github.com/microsoft/FLAML.git
synced 2026-02-09 02:09:16 +08:00
* Initial plan * Add comprehensive documentation on missing value handling in FAQ Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com> * Apply mdformat to FAQ.md Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com> * Correct FAQ: FLAML does preprocess missing values with SimpleImputer Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com> Co-authored-by: Li Jiang <bnujli@gmail.com>
344 lines
16 KiB
Markdown
344 lines
16 KiB
Markdown
# Frequently Asked Questions
|
|
|
|
### [Guidelines on how to set a hyperparameter search space](Use-Cases/Tune-User-Defined-Function#details-and-guidelines-on-hyperparameter-search-space)
|
|
|
|
### [Guidelines on parallel vs seqential tuning](Use-Cases/Task-Oriented-AutoML#guidelines-on-parallel-vs-sequential-tuning)
|
|
|
|
### [Guidelines on creating and tuning a custom estimator](Use-Cases/Task-Oriented-AutoML#guidelines-on-tuning-a-custom-estimator)
|
|
|
|
### About `low_cost_partial_config` in `tune`.
|
|
|
|
- Definition and purpose: The `low_cost_partial_config` is a dictionary of subset of the hyperparameter coordinates whose value corresponds to a configuration with known low-cost (i.e., low computation cost for training the corresponding model). The concept of low/high-cost is meaningful in the case where a subset of the hyperparameters to tune directly affects the computation cost for training the model. For example, `n_estimators` and `max_leaves` are known to affect the training cost of tree-based learners. We call this subset of hyperparameters, *cost-related hyperparameters*. In such scenarios, if you are aware of low-cost configurations for the cost-related hyperparameters, you are recommended to set them as the `low_cost_partial_config`. Using the tree-based method example again, since we know that small `n_estimators` and `max_leaves` generally correspond to simpler models and thus lower cost, we set `{'n_estimators': 4, 'max_leaves': 4}` as the `low_cost_partial_config` by default (note that `4` is the lower bound of search space for these two hyperparameters), e.g., in [LGBM](https://github.com/microsoft/FLAML/blob/main/flaml/model.py#L215). Configuring `low_cost_partial_config` helps the search algorithms make more cost-efficient choices.
|
|
In AutoML, the `low_cost_init_value` in `search_space()` function for each estimator serves the same role.
|
|
|
|
- Usage in practice: It is recommended to configure it if there are cost-related hyperparameters in your tuning task and you happen to know the low-cost values for them, but it is not required (It is fine to leave it the default value, i.e., `None`).
|
|
|
|
- How does it work: `low_cost_partial_config` if configured, will be used as an initial point of the search. It also affects the search trajectory. For more details about how does it play a role in the search algorithms, please refer to the papers about the search algorithms used: Section 2 of [Frugal Optimization for Cost-related Hyperparameters (CFO)](https://arxiv.org/pdf/2005.01571.pdf) and Section 3 of [Economical Hyperparameter Optimization with Blended Search Strategy (BlendSearch)](https://openreview.net/pdf?id=VbLH04pRA3).
|
|
|
|
### How does FLAML handle missing values?
|
|
|
|
FLAML automatically preprocesses missing values in the input data through its `DataTransformer` class (for classification/regression tasks) and `DataTransformerTS` class (for time series tasks). The preprocessing behavior differs based on the column type:
|
|
|
|
**Automatic Missing Value Preprocessing:**
|
|
|
|
FLAML performs the following preprocessing automatically when you call `AutoML.fit()`:
|
|
|
|
1. **Numerical/Continuous Columns**: Missing values (NaN) in numerical columns are imputed using `sklearn.impute.SimpleImputer` with the **median strategy**. This preprocessing is applied in the `DataTransformer.fit_transform()` method (see `flaml/automl/data.py` lines 357-369 and `flaml/automl/time_series/ts_data.py` lines 429-440).
|
|
|
|
1. **Categorical Columns**: Missing values in categorical columns (object, category, or string dtypes) are filled with a special placeholder value `"__NAN__"`, which is treated as a distinct category.
|
|
|
|
**Example of automatic preprocessing:**
|
|
|
|
```python
|
|
from flaml import AutoML
|
|
import pandas as pd
|
|
import numpy as np
|
|
|
|
# Data with missing values
|
|
X_train = pd.DataFrame(
|
|
{
|
|
"num_feature": [1.0, 2.0, np.nan, 4.0, 5.0],
|
|
"cat_feature": ["A", "B", None, "A", "B"],
|
|
}
|
|
)
|
|
y_train = [0, 1, 0, 1, 0]
|
|
|
|
# FLAML automatically handles missing values
|
|
automl = AutoML()
|
|
automl.fit(X_train, y_train, task="classification", time_budget=60)
|
|
# Numerical NaNs are imputed with median, categorical None becomes "__NAN__"
|
|
```
|
|
|
|
**Estimator-Specific Native Handling:**
|
|
|
|
After FLAML's preprocessing, some estimators have additional native missing value handling capabilities:
|
|
|
|
- **`lgbm`** (LightGBM): After preprocessing, can still handle any remaining NaN values natively by learning optimal split directions.
|
|
- **`xgboost`** (XGBoost): After preprocessing, can handle remaining NaN values by learning the best direction during training.
|
|
- **`xgb_limitdepth`** (XGBoost with depth limit): Same as `xgboost`.
|
|
- **`catboost`** (CatBoost): After preprocessing, has additional sophisticated missing value handling strategies. See [CatBoost documentation](https://catboost.ai/en/docs/concepts/algorithm-missing-values-processing).
|
|
- **`histgb`** (HistGradientBoosting): After preprocessing, can still handle NaN values natively.
|
|
|
|
**Estimators that rely on preprocessing:**
|
|
|
|
These estimators rely on FLAML's automatic preprocessing since they cannot handle missing values directly:
|
|
|
|
- **`rf`** (RandomForest): Requires preprocessing (automatically done by FLAML).
|
|
- **`extra_tree`** (ExtraTrees): Requires preprocessing (automatically done by FLAML).
|
|
- **`lrl1`**, **`lrl2`** (LogisticRegression): Require preprocessing (automatically done by FLAML).
|
|
- **`kneighbor`** (KNeighbors): Requires preprocessing (automatically done by FLAML).
|
|
- **`sgd`** (SGDClassifier/Regressor): Require preprocessing (automatically done by FLAML).
|
|
|
|
**Advanced: Customizing Missing Value Handling**
|
|
|
|
In most cases, FLAML's automatic preprocessing (median imputation for numerical, "__NAN__" for categorical) works well. However, if you need custom preprocessing:
|
|
|
|
1. **Skip automatic preprocessing** using the `skip_transform` parameter:
|
|
|
|
```python
|
|
from flaml import AutoML
|
|
from sklearn.impute import SimpleImputer
|
|
import numpy as np
|
|
|
|
# Custom preprocessing with different strategy
|
|
imputer = SimpleImputer(strategy="mean") # Use mean instead of median
|
|
X_train_preprocessed = imputer.fit_transform(X_train)
|
|
X_test_preprocessed = imputer.transform(X_test)
|
|
|
|
# Skip FLAML's automatic preprocessing
|
|
automl = AutoML()
|
|
automl.fit(
|
|
X_train_preprocessed,
|
|
y_train,
|
|
task="classification",
|
|
time_budget=60,
|
|
skip_transform=True, # Skip automatic preprocessing
|
|
)
|
|
```
|
|
|
|
2. **Use sklearn Pipeline** for integrated custom preprocessing:
|
|
|
|
```python
|
|
from flaml import AutoML
|
|
from sklearn.pipeline import Pipeline
|
|
from sklearn.impute import SimpleImputer, KNNImputer
|
|
|
|
# Custom pipeline with KNN imputation
|
|
pipeline = Pipeline(
|
|
[
|
|
("imputer", KNNImputer(n_neighbors=5)), # Custom imputation strategy
|
|
("automl", AutoML()),
|
|
]
|
|
)
|
|
|
|
pipeline.fit(X_train, y_train)
|
|
```
|
|
|
|
**Note on time series forecasting**: For time series tasks (`ts_forecast`, `ts_forecast_panel`), the `DataTransformerTS` class applies the same preprocessing approach (median imputation for numerical columns, "__NAN__" for categorical). Missing values handling in the time dimension may require additional consideration depending on your specific forecasting model.
|
|
|
|
### How does FLAML handle imbalanced data (unequal distribution of target classes in classification task)?
|
|
|
|
Currently FLAML does several things for imbalanced data.
|
|
|
|
1. When a class contains fewer than 20 examples, we repeatedly add these examples to the training data until the count is at least 20.
|
|
1. We use stratified sampling when doing holdout and kf.
|
|
1. We make sure no class is empty in both training and holdout data.
|
|
1. We allow users to pass `sample_weight` to `AutoML.fit()`.
|
|
1. User can customize the weight of each class by setting the `custom_hp` or `fit_kwargs_by_estimator` arguments. For example, the following code sets the weight for pos vs. neg as 2:1 for the RandomForest estimator:
|
|
|
|
```python
|
|
from flaml import AutoML
|
|
from sklearn.datasets import load_iris
|
|
|
|
X_train, y_train = load_iris(return_X_y=True)
|
|
automl = AutoML()
|
|
automl_settings = {
|
|
"time_budget": 2,
|
|
"task": "classification",
|
|
"log_file_name": "test/iris.log",
|
|
"estimator_list": ["rf", "xgboost"],
|
|
}
|
|
|
|
automl_settings["custom_hp"] = {
|
|
"xgboost": {
|
|
"scale_pos_weight": {
|
|
"domain": 0.5,
|
|
"init_value": 0.5,
|
|
}
|
|
},
|
|
"rf": {"class_weight": {"domain": "balanced", "init_value": "balanced"}},
|
|
}
|
|
print(automl.model)
|
|
```
|
|
|
|
### How to interpret model performance? Is it possible for me to visualize feature importance, SHAP values, optimization history?
|
|
|
|
You can use `automl.model.estimator.feature_importances_` to get the `feature_importances_` for the best model found by automl. See an [example](Examples/AutoML-for-XGBoost#plot-feature-importance).
|
|
|
|
Packages such as `azureml-interpret` and `sklearn.inspection.permutation_importance` can be used on `automl.model.estimator` to explain the selected model.
|
|
Model explanation is frequently asked and adding a native support may be a good feature. Suggestions/contributions are welcome.
|
|
|
|
Optimization history can be checked from the [log](Use-Cases/Task-Oriented-AutoML#log-the-trials). You can also [retrieve the log and plot the learning curve](Use-Cases/Task-Oriented-AutoML#plot-learning-curve).
|
|
|
|
### How to resolve out-of-memory error in `AutoML.fit()`
|
|
|
|
- Set `free_mem_ratio` a float between 0 and 1. For example, 0.2 means try to keep free memory above 20% of total memory. Training may be early stopped for memory consumption reason when this is set.
|
|
- Set `model_history` False.
|
|
- If your data are already preprocessed, set `skip_transform` False. If you can preprocess the data before the fit starts, this setting can save memory needed for preprocessing in `fit`.
|
|
- If the OOM error only happens for some particular trials:
|
|
- set `use_ray` True. This will increase the overhead per trial but can keep the AutoML process running when a single trial fails due to OOM error.
|
|
- provide a more accurate [`size`](reference/automl/model#size) function for the memory bytes consumption of each config for the estimator causing this error.
|
|
- modify the [search space](Use-Cases/Task-Oriented-AutoML#a-shortcut-to-override-the-search-space) for the estimators causing this error.
|
|
- or remove this estimator from the `estimator_list`.
|
|
- If the OOM error happens when ensembling, consider disabling ensemble, or use a cheaper ensemble option. ([Example](Use-Cases/Task-Oriented-AutoML#ensemble)).
|
|
|
|
### How to get the best config of an estimator and use it to train the original model outside FLAML?
|
|
|
|
When you finished training an AutoML estimator, you may want to use it in other code w/o depending on FLAML. The `automl.best_config` contains FLAML's search space parameters, which may differ from the original model's parameters (e.g., FLAML uses `log_max_bin` for LightGBM instead of `max_bin`). You need to convert them using the `config2params()` method.
|
|
|
|
**Method 1: Using the trained model instance**
|
|
|
|
```python
|
|
from flaml import AutoML
|
|
from sklearn.datasets import load_iris
|
|
|
|
X, y = load_iris(return_X_y=True)
|
|
settings = {"time_budget": 3}
|
|
automl = AutoML(**settings)
|
|
automl.fit(X, y)
|
|
|
|
print(f"{automl.best_estimator=}")
|
|
print(f"{automl.best_config=}")
|
|
# Example: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
|
|
# 'learning_rate': 0.1, 'log_max_bin': 8, ...}
|
|
|
|
# Convert to original model parameters
|
|
best_params = automl.model.config2params(automl.best_config)
|
|
print(f"params for best estimator: {best_params}")
|
|
# Example: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
|
|
# 'learning_rate': 0.1, 'max_bin': 255, ...} # log_max_bin -> max_bin
|
|
```
|
|
|
|
**Method 2: Using FLAML estimator classes directly**
|
|
|
|
If the automl instance is not accessible and you only have the `best_config`, you can convert it with below code:
|
|
|
|
```python
|
|
from flaml.automl.model import LGBMEstimator
|
|
|
|
best_config = {
|
|
"n_estimators": 4,
|
|
"num_leaves": 4,
|
|
"min_child_samples": 20,
|
|
"learning_rate": 0.1,
|
|
"log_max_bin": 8, # FLAML-specific parameter
|
|
"colsample_bytree": 1.0,
|
|
"reg_alpha": 0.0009765625,
|
|
"reg_lambda": 1.0,
|
|
}
|
|
|
|
# Create FLAML estimator - this automatically converts parameters
|
|
flaml_estimator = LGBMEstimator(task="classification", **best_config)
|
|
best_params = flaml_estimator.params # Converted params ready for original model
|
|
print(f"Converted params: {best_params}")
|
|
# Example: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
|
|
# 'learning_rate': 0.1, 'max_bin': 255, 'verbose': -1, ...}
|
|
```
|
|
|
|
**Method 3: Using task_factory (for any estimator type)**
|
|
|
|
```python
|
|
from flaml.automl.task.factory import task_factory
|
|
|
|
task = "classification"
|
|
best_estimator = "rf"
|
|
best_config = {
|
|
"n_estimators": 15,
|
|
"max_features": 0.35807183923834934,
|
|
"max_leaves": 12,
|
|
"criterion": "gini",
|
|
}
|
|
|
|
model_class = task_factory(task).estimator_class_from_str(best_estimator)(task=task)
|
|
best_params = model_class.config2params(best_config)
|
|
```
|
|
|
|
Then you can use it to train the sklearn/lightgbm/xgboost estimators directly:
|
|
|
|
```python
|
|
from lightgbm import LGBMClassifier
|
|
|
|
# Using LightGBM directly with converted parameters
|
|
model = LGBMClassifier(**best_params)
|
|
model.fit(X, y)
|
|
```
|
|
|
|
**Using best_config_per_estimator for multiple estimators**
|
|
|
|
```python
|
|
from flaml import AutoML
|
|
from flaml.automl.model import LGBMEstimator, XGBoostEstimator
|
|
from lightgbm import LGBMClassifier
|
|
from xgboost import XGBClassifier
|
|
|
|
automl = AutoML()
|
|
automl.fit(
|
|
X, y, task="classification", time_budget=30, estimator_list=["lgbm", "xgboost"]
|
|
)
|
|
|
|
# Get configs for all estimators
|
|
configs = automl.best_config_per_estimator
|
|
# Example: {'lgbm': {'n_estimators': 4, 'log_max_bin': 8, ...},
|
|
# 'xgboost': {'n_estimators': 4, 'max_leaves': 4, ...}}
|
|
|
|
# Convert and use LightGBM config
|
|
if configs.get("lgbm"):
|
|
lgbm_config = configs["lgbm"].copy()
|
|
lgbm_config.pop("FLAML_sample_size", None) # Remove FLAML internal param if present
|
|
flaml_lgbm = LGBMEstimator(task="classification", **lgbm_config)
|
|
lgbm_model = LGBMClassifier(**flaml_lgbm.params)
|
|
lgbm_model.fit(X, y)
|
|
|
|
# Convert and use XGBoost config
|
|
if configs.get("xgboost"):
|
|
xgb_config = configs["xgboost"].copy()
|
|
xgb_config.pop("FLAML_sample_size", None) # Remove FLAML internal param if present
|
|
flaml_xgb = XGBoostEstimator(task="classification", **xgb_config)
|
|
xgb_model = XGBClassifier(**flaml_xgb.params)
|
|
xgb_model.fit(X, y)
|
|
```
|
|
|
|
### How to save and load an AutoML object? (`pickle` / `load_pickle`)
|
|
|
|
FLAML provides `AutoML.pickle()` / `AutoML.load_pickle()` as a convenient and robust way to persist an AutoML run.
|
|
|
|
```python
|
|
from flaml import AutoML
|
|
|
|
automl = AutoML()
|
|
automl.fit(X_train, y_train, task="classification", time_budget=60)
|
|
|
|
# Save
|
|
automl.pickle("automl.pkl")
|
|
|
|
# Load
|
|
automl_loaded = AutoML.load_pickle("automl.pkl")
|
|
pred = automl_loaded.predict(X_test)
|
|
```
|
|
|
|
Notes:
|
|
|
|
- If you used Spark estimators, `AutoML.pickle()` externalizes Spark ML models into an adjacent artifact folder and keeps
|
|
the pickle itself lightweight.
|
|
- If you want to skip re-loading externalized Spark models (e.g., in an environment without Spark), use:
|
|
|
|
```python
|
|
automl_loaded = AutoML.load_pickle("automl.pkl", load_spark_models=False)
|
|
```
|
|
|
|
### How to list all available estimators for a task?
|
|
|
|
The available estimator set is task-dependent and can vary with optional dependencies. You can list the estimator keys
|
|
that FLAML currently has registered in your environment:
|
|
|
|
```python
|
|
from flaml.automl.task.factory import task_factory
|
|
|
|
print(sorted(task_factory("classification").estimators.keys()))
|
|
print(sorted(task_factory("regression").estimators.keys()))
|
|
print(sorted(task_factory("forecast").estimators.keys()))
|
|
print(sorted(task_factory("rank").estimators.keys()))
|
|
```
|
|
|
|
### How to list supported built-in metrics?
|
|
|
|
```python
|
|
from flaml import AutoML
|
|
|
|
automl = AutoML()
|
|
sklearn_metrics, hf_metrics, spark_metrics = automl.supported_metrics
|
|
print(sorted(sklearn_metrics))
|
|
print(sorted(hf_metrics))
|
|
print(spark_metrics)
|
|
```
|