Fix: Preserve FLAML_sample_size in best_config_per_estimator (#1475)

* Initial plan

* Fix: Preserve FLAML_sample_size in best_config_per_estimator

Modified best_config_per_estimator property to keep FLAML_sample_size when returning best configurations. Previously, AutoMLState.sanitize() was removing this key, which caused the sample size information to be lost when using starting_points from a previous run.

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Add a test to verify the improvement of starting_points

* Update documentation to reflect FLAML_sample_size preservation

Updated Task-Oriented-AutoML.md to document that best_config_per_estimator now preserves FLAML_sample_size:
- Added note in "Warm start" section explaining that FLAML_sample_size is preserved for effective warm-starting
- Added note in "Get best configuration" section with example showing FLAML_sample_size in output
- Explains importance of sample size preservation for continuing optimization with correct sample sizes

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix unintended code change

* Improve docstrings and docs

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
Co-authored-by: Li Jiang <lijiang1@microsoft.com>
This commit is contained in:
Copilot
2026-01-20 07:42:31 +08:00
committed by GitHub
parent 67bdcde4d5
commit 5f1aa2dda8
4 changed files with 281 additions and 12 deletions

View File

@@ -503,18 +503,135 @@ class AutoML(BaseEstimator):
@property
def best_config(self):
"""A dictionary of the best configuration."""
"""A dictionary of the best configuration.
The returned config dictionary can be used to:
1. Pass as `starting_points` to a new AutoML run.
2. Initialize the corresponding FLAML estimator directly.
3. Initialize the original model (e.g., LightGBM, XGBoost) after converting
FLAML-specific parameters.
Note:
The config contains FLAML's search space parameters, which may differ from
the original model's parameters. For example, FLAML uses `log_max_bin` for
LightGBM instead of `max_bin`. Use the FLAML estimator's `config2params()`
method to convert to the original model's parameters.
Example:
```python
from flaml import AutoML
from flaml.automl.model import LGBMEstimator
from lightgbm import LGBMClassifier
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
# Train with AutoML
automl = AutoML()
automl.fit(X, y, task="classification", time_budget=10)
# Get the best config
best_config = automl.best_config
print("Best config:", best_config)
# Example output: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
# 'learning_rate': 0.1, 'log_max_bin': 8, ...}
# Option 1: Use FLAML estimator directly (handles parameter conversion internally)
flaml_estimator = LGBMEstimator(task="classification", **best_config)
flaml_estimator.fit(X, y)
# Option 2: Convert to original model parameters using config2params()
# This converts FLAML-specific params (e.g., log_max_bin -> max_bin)
original_params = flaml_estimator.params # or use flaml_estimator.config2params(best_config)
print("Original model params:", original_params)
# Example output: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
# 'learning_rate': 0.1, 'max_bin': 255, ...} # log_max_bin converted to max_bin
# Now use with original LightGBM
lgbm_model = LGBMClassifier(**original_params)
lgbm_model.fit(X, y)
```
"""
state = self._search_states.get(self._best_estimator)
config = state and getattr(state, "best_config", None)
return config and AutoMLState.sanitize(config)
@property
def best_config_per_estimator(self):
"""A dictionary of all estimators' best configuration."""
return {
e: e_search_state.best_config and AutoMLState.sanitize(e_search_state.best_config)
for e, e_search_state in self._search_states.items()
}
"""A dictionary of all estimators' best configuration.
Returns a dictionary where keys are estimator names (e.g., 'lgbm', 'xgboost')
and values are the best hyperparameter configurations found for each estimator.
The config may include `FLAML_sample_size` which indicates the sample size used
during training.
This is useful for:
1. Passing as `starting_points` to a new AutoML run for warm-starting.
2. Comparing the best configurations across different estimators.
3. Initializing the original models after converting FLAML-specific parameters.
Note:
The configs contain FLAML's search space parameters, which may differ from
the original models' parameters. Use each estimator's `config2params()` method
to convert to the original model's parameters.
Example:
```python
from flaml import AutoML
from flaml.automl.model import LGBMEstimator, XGBoostEstimator
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
# Train with AutoML
automl = AutoML()
automl.fit(X, y, task="classification", time_budget=30,
estimator_list=['lgbm', 'xgboost'])
# Get best configs for all estimators
configs = automl.best_config_per_estimator
print(configs)
# Example output: {'lgbm': {'n_estimators': 4, 'num_leaves': 4, 'log_max_bin': 8, ...},
# 'xgboost': {'n_estimators': 4, 'max_leaves': 4, ...}}
# Use as starting points for a new AutoML run (warm start)
new_automl = AutoML()
new_automl.fit(X, y, task="classification", time_budget=30,
starting_points=configs)
# Or convert to original model parameters for direct use
if configs.get('lgbm'):
lgbm_config = configs['lgbm'].copy()
lgbm_config.pop('FLAML_sample_size', None) # Remove FLAML internal param
flaml_lgbm = LGBMEstimator(task="classification", **lgbm_config)
original_lgbm_params = flaml_lgbm.params # Converted params (log_max_bin -> max_bin), or use flaml_lgbm.config2params(lgbm_config)
lgbm_model = LGBMClassifier(**original_lgbm_params)
lgbm_model.fit(X, y)
if configs.get('xgboost'):
xgb_config = configs['xgboost'].copy()
xgb_config.pop('FLAML_sample_size', None) # Remove FLAML internal param
flaml_xgb = XGBoostEstimator(task="classification", **xgb_config)
original_xgb_params = flaml_xgb.params # Converted params
xgb_model = XGBClassifier(**original_xgb_params)
xgb_model.fit(X, y)
```
"""
result = {}
for e, e_search_state in self._search_states.items():
if e_search_state.best_config:
config = e_search_state.best_config.get("ml", e_search_state.best_config).copy()
# Remove internal keys that are not needed for starting_points, but keep FLAML_sample_size
config.pop("learner", None)
config.pop("_choice_", None)
result[e] = config
else:
result[e] = None
return result
@property
def best_loss_per_estimator(self):

View File

@@ -535,6 +535,32 @@ class TestMultiClass(unittest.TestCase):
print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
# print('Training duration of best run: {0:.4g} s'.format(new_automl_experiment.best_config_train_time))
def test_starting_points_should_improve_performance(self):
N = 10000 # a large N is needed to see the improvement
X_train, y_train = load_iris(return_X_y=True)
X_train = np.concatenate([X_train + 0.1 * i for i in range(N)], axis=0)
y_train = np.concatenate([y_train] * N, axis=0)
am1 = AutoML()
am1.fit(X_train, y_train, estimator_list=["lgbm"], time_budget=3, seed=11)
am2 = AutoML()
am2.fit(
X_train,
y_train,
estimator_list=["lgbm"],
time_budget=2,
seed=11,
starting_points=am1.best_config_per_estimator,
)
print(f"am1.best_loss: {am1.best_loss:.4f}")
print(f"am2.best_loss: {am2.best_loss:.4f}")
assert np.round(am2.best_loss, 4) <= np.round(
am1.best_loss, 4
), "Starting points should help improve the performance!"
if __name__ == "__main__":
unittest.main()

View File

@@ -73,7 +73,9 @@ Optimization history can be checked from the [log](Use-Cases/Task-Oriented-AutoM
### How to get the best config of an estimator and use it to train the original model outside FLAML?
When you finished training an AutoML estimator, you may want to use it in other code w/o depending on FLAML. You can get the `automl.best_config` and convert it to the parameters of the original model with below code:
When you finished training an AutoML estimator, you may want to use it in other code w/o depending on FLAML. The `automl.best_config` contains FLAML's search space parameters, which may differ from the original model's parameters (e.g., FLAML uses `log_max_bin` for LightGBM instead of `max_bin`). You need to convert them using the `config2params()` method.
**Method 1: Using the trained model instance**
```python
from flaml import AutoML
@@ -86,10 +88,43 @@ automl.fit(X, y)
print(f"{automl.best_estimator=}")
print(f"{automl.best_config=}")
print(f"params for best estimator: {automl.model.config2params(automl.best_config)}")
# Example: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
# 'learning_rate': 0.1, 'log_max_bin': 8, ...}
# Convert to original model parameters
best_params = automl.model.config2params(automl.best_config)
print(f"params for best estimator: {best_params}")
# Example: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
# 'learning_rate': 0.1, 'max_bin': 255, ...} # log_max_bin -> max_bin
```
If the automl instance is not accessible and you've the `best_config`. You can also convert it with below code:
**Method 2: Using FLAML estimator classes directly**
If the automl instance is not accessible and you only have the `best_config`, you can convert it with below code:
```python
from flaml.automl.model import LGBMEstimator
best_config = {
"n_estimators": 4,
"num_leaves": 4,
"min_child_samples": 20,
"learning_rate": 0.1,
"log_max_bin": 8, # FLAML-specific parameter
"colsample_bytree": 1.0,
"reg_alpha": 0.0009765625,
"reg_lambda": 1.0,
}
# Create FLAML estimator - this automatically converts parameters
flaml_estimator = LGBMEstimator(task="classification", **best_config)
best_params = flaml_estimator.params # Converted params ready for original model
print(f"Converted params: {best_params}")
# Example: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
# 'learning_rate': 0.1, 'max_bin': 255, 'verbose': -1, ...}
```
**Method 3: Using task_factory (for any estimator type)**
```python
from flaml.automl.task.factory import task_factory
@@ -107,15 +142,51 @@ model_class = task_factory(task).estimator_class_from_str(best_estimator)(task=t
best_params = model_class.config2params(best_config)
```
Then you can use it to train the sklearn estimators directly:
Then you can use it to train the sklearn/lightgbm/xgboost estimators directly:
```python
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
model = RandomForestClassifier(**best_params)
# Using LightGBM directly with converted parameters
model = LGBMClassifier(**best_params)
model.fit(X, y)
```
**Using best_config_per_estimator for multiple estimators**
```python
from flaml import AutoML
from flaml.automl.model import LGBMEstimator, XGBoostEstimator
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
automl = AutoML()
automl.fit(
X, y, task="classification", time_budget=30, estimator_list=["lgbm", "xgboost"]
)
# Get configs for all estimators
configs = automl.best_config_per_estimator
# Example: {'lgbm': {'n_estimators': 4, 'log_max_bin': 8, ...},
# 'xgboost': {'n_estimators': 4, 'max_leaves': 4, ...}}
# Convert and use LightGBM config
if configs.get("lgbm"):
lgbm_config = configs["lgbm"].copy()
lgbm_config.pop("FLAML_sample_size", None) # Remove FLAML internal param if present
flaml_lgbm = LGBMEstimator(task="classification", **lgbm_config)
lgbm_model = LGBMClassifier(**flaml_lgbm.params)
lgbm_model.fit(X, y)
# Convert and use XGBoost config
if configs.get("xgboost"):
xgb_config = configs["xgboost"].copy()
xgb_config.pop("FLAML_sample_size", None) # Remove FLAML internal param if present
flaml_xgb = XGBoostEstimator(task="classification", **xgb_config)
xgb_model = XGBClassifier(**flaml_xgb.params)
xgb_model.fit(X, y)
```
### How to save and load an AutoML object? (`pickle` / `load_pickle`)
FLAML provides `AutoML.pickle()` / `AutoML.load_pickle()` as a convenient and robust way to persist an AutoML run.

View File

@@ -552,6 +552,8 @@ automl2.fit(
`starting_points` is a dictionary or a str to specify the starting hyperparameter config. (1) When it is a dictionary, the keys are the estimator names. If you do not need to specify starting points for an estimator, exclude its name from the dictionary. The value for each key can be either a dictionary of a list of dictionaries, corresponding to one hyperparameter configuration, or multiple hyperparameter configurations, respectively. (2) When it is a str: if "data", use data-dependent defaults; if "data:path", use data-dependent defaults which are stored at path; if "static", use data-independent defaults. Please find more details about data-dependent defaults in [zero shot AutoML](Zero-Shot-AutoML#combine-zero-shot-automl-and-hyperparameter-tuning).
**Note on sample size preservation**: When using `best_config_per_estimator` as starting points, the configurations now preserve `FLAML_sample_size` (if subsampling was used during the search). This ensures that the warm-started run continues optimization with the same sample sizes that produced the best results in the previous run, leading to more effective warm-starting.
### Log the trials
The trials are logged in a file if a `log_file_name` is passed.
@@ -664,6 +666,25 @@ print(automl.best_config)
# {'n_estimators': 148, 'num_leaves': 18, 'min_child_samples': 3, 'learning_rate': 0.17402065726724145, 'log_max_bin': 8, 'colsample_bytree': 0.6649148062238498, 'reg_alpha': 0.0009765625, 'reg_lambda': 0.0067613624509965}
```
**Note**: The config contains FLAML's search space parameters, which may differ from the original model's parameters. For example, FLAML uses `log_max_bin` for LightGBM instead of `max_bin`. To convert to the original model's parameters, use the `config2params()` method:
```python
from flaml.automl.model import LGBMEstimator
# Convert FLAML config to original model parameters
flaml_estimator = LGBMEstimator(task="classification", **automl.best_config)
original_params = flaml_estimator.params
print(original_params)
# {'n_estimators': 148, 'num_leaves': 18, 'min_child_samples': 3, 'learning_rate': 0.17402065726724145, 'max_bin': 255, ...}
# Note: 'log_max_bin': 8 is converted to 'max_bin': 255 (2^8 - 1)
# Now you can use original LightGBM directly
from lightgbm import LGBMClassifier
lgbm_model = LGBMClassifier(**original_params)
lgbm_model.fit(X_train, y_train)
```
We can also find the best configuration per estimator.
```python
@@ -673,6 +694,40 @@ print(automl.best_config_per_estimator)
The `None` value corresponds to the estimators which have not been tried.
**Converting configs for all estimators to original model parameters:**
```python
from flaml.automl.model import LGBMEstimator, XGBoostEstimator
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
configs = automl.best_config_per_estimator
# Convert and use LightGBM config
if configs.get("lgbm"):
lgbm_config = configs["lgbm"].copy()
lgbm_config.pop("FLAML_sample_size", None) # Remove FLAML internal param if present
flaml_lgbm = LGBMEstimator(task="classification", **lgbm_config)
lgbm_model = LGBMClassifier(**flaml_lgbm.params)
lgbm_model.fit(X_train, y_train)
# Convert and use XGBoost config
if configs.get("xgboost"):
xgb_config = configs["xgboost"].copy()
xgb_config.pop("FLAML_sample_size", None) # Remove FLAML internal param if present
flaml_xgb = XGBoostEstimator(task="classification", **xgb_config)
xgb_model = XGBClassifier(**flaml_xgb.params)
xgb_model.fit(X_train, y_train)
```
**Note**: When subsampling is used during the search (e.g., with large datasets), the configurations may also include `FLAML_sample_size` to indicate the sample size used. For example:
```python
# {'lgbm': {'n_estimators': 729, 'num_leaves': 21, ..., 'FLAML_sample_size': 45000}, ...}
```
This information is preserved in `best_config_per_estimator` and is important for warm-starting subsequent runs with the correct sample sizes.
Other useful information:
```python