mirror of https://github.com/microsoft/FLAML.git synced 2026-02-09 02:09:16 +08:00

Files

Copilot 5f1aa2dda8 Fix: Preserve FLAML_sample_size in best_config_per_estimator (#1475 )

* Initial plan

* Fix: Preserve FLAML_sample_size in best_config_per_estimator

Modified best_config_per_estimator property to keep FLAML_sample_size when returning best configurations. Previously, AutoMLState.sanitize() was removing this key, which caused the sample size information to be lost when using starting_points from a previous run.

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Add a test to verify the improvement of starting_points

* Update documentation to reflect FLAML_sample_size preservation

Updated Task-Oriented-AutoML.md to document that best_config_per_estimator now preserves FLAML_sample_size:
- Added note in "Warm start" section explaining that FLAML_sample_size is preserved for effective warm-starting
- Added note in "Get best configuration" section with example showing FLAML_sample_size in output
- Explains importance of sample size preservation for continuing optimization with correct sample sizes

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix unintended code change

* Improve docstrings and docs

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
Co-authored-by: Li Jiang <lijiang1@microsoft.com>

2026-01-20 07:42:31 +08:00

11 KiB

Raw Blame History

Frequently Asked Questions

Guidelines on how to set a hyperparameter search space

Guidelines on parallel vs seqential tuning

Guidelines on creating and tuning a custom estimator

About `low_cost_partial_config` in `tune`.

Definition and purpose: The low_cost_partial_config is a dictionary of subset of the hyperparameter coordinates whose value corresponds to a configuration with known low-cost (i.e., low computation cost for training the corresponding model). The concept of low/high-cost is meaningful in the case where a subset of the hyperparameters to tune directly affects the computation cost for training the model. For example, n_estimators and max_leaves are known to affect the training cost of tree-based learners. We call this subset of hyperparameters, cost-related hyperparameters. In such scenarios, if you are aware of low-cost configurations for the cost-related hyperparameters, you are recommended to set them as the low_cost_partial_config. Using the tree-based method example again, since we know that small n_estimators and max_leaves generally correspond to simpler models and thus lower cost, we set {'n_estimators': 4, 'max_leaves': 4} as the low_cost_partial_config by default (note that 4 is the lower bound of search space for these two hyperparameters), e.g., in LGBM. Configuring low_cost_partial_config helps the search algorithms make more cost-efficient choices. In AutoML, the low_cost_init_value in search_space() function for each estimator serves the same role.
Usage in practice: It is recommended to configure it if there are cost-related hyperparameters in your tuning task and you happen to know the low-cost values for them, but it is not required (It is fine to leave it the default value, i.e., None).
How does it work: low_cost_partial_config if configured, will be used as an initial point of the search. It also affects the search trajectory. For more details about how does it play a role in the search algorithms, please refer to the papers about the search algorithms used: Section 2 of Frugal Optimization for Cost-related Hyperparameters (CFO) and Section 3 of Economical Hyperparameter Optimization with Blended Search Strategy (BlendSearch).

How does FLAML handle imbalanced data (unequal distribution of target classes in classification task)?

Currently FLAML does several things for imbalanced data.

When a class contains fewer than 20 examples, we repeatedly add these examples to the training data until the count is at least 20.
We use stratified sampling when doing holdout and kf.
We make sure no class is empty in both training and holdout data.
We allow users to pass sample_weight to AutoML.fit().
User can customize the weight of each class by setting the custom_hp or fit_kwargs_by_estimator arguments. For example, the following code sets the weight for pos vs. neg as 2:1 for the RandomForest estimator:

from flaml import AutoML
from sklearn.datasets import load_iris

X_train, y_train = load_iris(return_X_y=True)
automl = AutoML()
automl_settings = {
    "time_budget": 2,
    "task": "classification",
    "log_file_name": "test/iris.log",
    "estimator_list": ["rf", "xgboost"],
}

automl_settings["custom_hp"] = {
    "xgboost": {
        "scale_pos_weight": {
            "domain": 0.5,
            "init_value": 0.5,
        }
    },
    "rf": {"class_weight": {"domain": "balanced", "init_value": "balanced"}},
}
print(automl.model)

How to interpret model performance? Is it possible for me to visualize feature importance, SHAP values, optimization history?

You can use automl.model.estimator.feature_importances_ to get the feature_importances_ for the best model found by automl. See an example.

Packages such as azureml-interpret and sklearn.inspection.permutation_importance can be used on automl.model.estimator to explain the selected model. Model explanation is frequently asked and adding a native support may be a good feature. Suggestions/contributions are welcome.

Optimization history can be checked from the log. You can also retrieve the log and plot the learning curve.

How to resolve out-of-memory error in `AutoML.fit()`

Set free_mem_ratio a float between 0 and 1. For example, 0.2 means try to keep free memory above 20% of total memory. Training may be early stopped for memory consumption reason when this is set.
Set model_history False.
If your data are already preprocessed, set skip_transform False. If you can preprocess the data before the fit starts, this setting can save memory needed for preprocessing in fit.
If the OOM error only happens for some particular trials:
- set use_ray True. This will increase the overhead per trial but can keep the AutoML process running when a single trial fails due to OOM error.
- provide a more accurate size function for the memory bytes consumption of each config for the estimator causing this error.
- modify the search space for the estimators causing this error.
- or remove this estimator from the estimator_list.
If the OOM error happens when ensembling, consider disabling ensemble, or use a cheaper ensemble option. (Example).

How to get the best config of an estimator and use it to train the original model outside FLAML?

When you finished training an AutoML estimator, you may want to use it in other code w/o depending on FLAML. The automl.best_config contains FLAML's search space parameters, which may differ from the original model's parameters (e.g., FLAML uses log_max_bin for LightGBM instead of max_bin). You need to convert them using the config2params() method.

Method 1: Using the trained model instance

from flaml import AutoML
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
settings = {"time_budget": 3}
automl = AutoML(**settings)
automl.fit(X, y)

print(f"{automl.best_estimator=}")
print(f"{automl.best_config=}")
# Example: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
#           'learning_rate': 0.1, 'log_max_bin': 8, ...}

# Convert to original model parameters
best_params = automl.model.config2params(automl.best_config)
print(f"params for best estimator: {best_params}")
# Example: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
#           'learning_rate': 0.1, 'max_bin': 255, ...}  # log_max_bin -> max_bin

Method 2: Using FLAML estimator classes directly

If the automl instance is not accessible and you only have the best_config, you can convert it with below code:

from flaml.automl.model import LGBMEstimator

best_config = {
    "n_estimators": 4,
    "num_leaves": 4,
    "min_child_samples": 20,
    "learning_rate": 0.1,
    "log_max_bin": 8,  # FLAML-specific parameter
    "colsample_bytree": 1.0,
    "reg_alpha": 0.0009765625,
    "reg_lambda": 1.0,
}

# Create FLAML estimator - this automatically converts parameters
flaml_estimator = LGBMEstimator(task="classification", **best_config)
best_params = flaml_estimator.params  # Converted params ready for original model
print(f"Converted params: {best_params}")
# Example: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
#           'learning_rate': 0.1, 'max_bin': 255, 'verbose': -1, ...}

Method 3: Using task_factory (for any estimator type)

from flaml.automl.task.factory import task_factory

task = "classification"
best_estimator = "rf"
best_config = {
    "n_estimators": 15,
    "max_features": 0.35807183923834934,
    "max_leaves": 12,
    "criterion": "gini",
}

model_class = task_factory(task).estimator_class_from_str(best_estimator)(task=task)
best_params = model_class.config2params(best_config)

Then you can use it to train the sklearn/lightgbm/xgboost estimators directly:

from lightgbm import LGBMClassifier

# Using LightGBM directly with converted parameters
model = LGBMClassifier(**best_params)
model.fit(X, y)

Using best_config_per_estimator for multiple estimators

from flaml import AutoML
from flaml.automl.model import LGBMEstimator, XGBoostEstimator
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

automl = AutoML()
automl.fit(
    X, y, task="classification", time_budget=30, estimator_list=["lgbm", "xgboost"]
)

# Get configs for all estimators
configs = automl.best_config_per_estimator
# Example: {'lgbm': {'n_estimators': 4, 'log_max_bin': 8, ...},
#           'xgboost': {'n_estimators': 4, 'max_leaves': 4, ...}}

# Convert and use LightGBM config
if configs.get("lgbm"):
    lgbm_config = configs["lgbm"].copy()
    lgbm_config.pop("FLAML_sample_size", None)  # Remove FLAML internal param if present
    flaml_lgbm = LGBMEstimator(task="classification", **lgbm_config)
    lgbm_model = LGBMClassifier(**flaml_lgbm.params)
    lgbm_model.fit(X, y)

# Convert and use XGBoost config
if configs.get("xgboost"):
    xgb_config = configs["xgboost"].copy()
    xgb_config.pop("FLAML_sample_size", None)  # Remove FLAML internal param if present
    flaml_xgb = XGBoostEstimator(task="classification", **xgb_config)
    xgb_model = XGBClassifier(**flaml_xgb.params)
    xgb_model.fit(X, y)

How to save and load an AutoML object? (`pickle` / `load_pickle`)

FLAML provides AutoML.pickle() / AutoML.load_pickle() as a convenient and robust way to persist an AutoML run.

from flaml import AutoML

automl = AutoML()
automl.fit(X_train, y_train, task="classification", time_budget=60)

# Save
automl.pickle("automl.pkl")

# Load
automl_loaded = AutoML.load_pickle("automl.pkl")
pred = automl_loaded.predict(X_test)

Notes:

If you used Spark estimators, AutoML.pickle() externalizes Spark ML models into an adjacent artifact folder and keeps the pickle itself lightweight.
If you want to skip re-loading externalized Spark models (e.g., in an environment without Spark), use:

automl_loaded = AutoML.load_pickle("automl.pkl", load_spark_models=False)

How to list all available estimators for a task?

The available estimator set is task-dependent and can vary with optional dependencies. You can list the estimator keys that FLAML currently has registered in your environment:

from flaml.automl.task.factory import task_factory

print(sorted(task_factory("classification").estimators.keys()))
print(sorted(task_factory("regression").estimators.keys()))
print(sorted(task_factory("forecast").estimators.keys()))
print(sorted(task_factory("rank").estimators.keys()))

How to list supported built-in metrics?

from flaml import AutoML

automl = AutoML()
sklearn_metrics, hf_metrics, spark_metrics = automl.supported_metrics
print(sorted(sklearn_metrics))
print(sorted(hf_metrics))
print(spark_metrics)

11 KiB Raw Blame History