Fix: Preserve FLAML_sample_size in best_config_per_estimator (#1475)

* Initial plan * Fix: Preserve FLAML_sample_size in best_config_per_estimator Modified best_config_per_estimator property to keep FLAML_sample_size when returning best configurations. Previously, AutoMLState.sanitize() was removing this key, which caused the sample size information to be lost when using starting_points from a previous run. Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com> * Add a test to verify the improvement of starting_points * Update documentation to reflect FLAML_sample_size preservation Updated Task-Oriented-AutoML.md to document that best_config_per_estimator now preserves FLAML_sample_size: - Added note in "Warm start" section explaining that FLAML_sample_size is preserved for effective warm-starting - Added note in "Get best configuration" section with example showing FLAML_sample_size in output - Explains importance of sample size preservation for continuing optimization with correct sample sizes Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com> * Fix unintended code change * Improve docstrings and docs --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com> Co-authored-by: Li Jiang <bnujli@gmail.com> Co-authored-by: Li Jiang <lijiang1@microsoft.com>
2026-02-09 02:09:16 +08:00 · 2026-01-20 07:42:31 +08:00
parent 67bdcde4d5
commit 5f1aa2dda8
4 changed files with 281 additions and 12 deletions
--- a/flaml/automl/automl.py
+++ b/flaml/automl/automl.py
@@ -503,18 +503,135 @@ class AutoML(BaseEstimator):

    @property
    def best_config(self):
-        """A dictionary of the best configuration."""
+        """A dictionary of the best configuration.
+
+        The returned config dictionary can be used to:
+        1. Pass as `starting_points` to a new AutoML run.
+        2. Initialize the corresponding FLAML estimator directly.
+        3. Initialize the original model (e.g., LightGBM, XGBoost) after converting
+           FLAML-specific parameters.
+
+        Note:
+            The config contains FLAML's search space parameters, which may differ from
+            the original model's parameters. For example, FLAML uses `log_max_bin` for
+            LightGBM instead of `max_bin`. Use the FLAML estimator's `config2params()`
+            method to convert to the original model's parameters.
+
+        Example:
+
+        ```python
+        from flaml import AutoML
+        from flaml.automl.model import LGBMEstimator
+        from lightgbm import LGBMClassifier
+        from sklearn.datasets import load_iris
+
+        X, y = load_iris(return_X_y=True)
+
+        # Train with AutoML
+        automl = AutoML()
+        automl.fit(X, y, task="classification", time_budget=10)
+
+        # Get the best config
+        best_config = automl.best_config
+        print("Best config:", best_config)
+        # Example output: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
+        #                  'learning_rate': 0.1, 'log_max_bin': 8, ...}
+
+        # Option 1: Use FLAML estimator directly (handles parameter conversion internally)
+        flaml_estimator = LGBMEstimator(task="classification", **best_config)
+        flaml_estimator.fit(X, y)
+
+        # Option 2: Convert to original model parameters using config2params()
+        # This converts FLAML-specific params (e.g., log_max_bin -> max_bin)
+        original_params = flaml_estimator.params  # or use flaml_estimator.config2params(best_config)
+        print("Original model params:", original_params)
+        # Example output: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
+        #                  'learning_rate': 0.1, 'max_bin': 255, ...}  # log_max_bin converted to max_bin
+
+        # Now use with original LightGBM
+        lgbm_model = LGBMClassifier(**original_params)
+        lgbm_model.fit(X, y)
+        ```
+        """
        state = self._search_states.get(self._best_estimator)
        config = state and getattr(state, "best_config", None)
        return config and AutoMLState.sanitize(config)

    @property
    def best_config_per_estimator(self):
-        """A dictionary of all estimators' best configuration."""
-        return {
-            e: e_search_state.best_config and AutoMLState.sanitize(e_search_state.best_config)
-            for e, e_search_state in self._search_states.items()
-        }
+        """A dictionary of all estimators' best configuration.
+
+        Returns a dictionary where keys are estimator names (e.g., 'lgbm', 'xgboost')
+        and values are the best hyperparameter configurations found for each estimator.
+        The config may include `FLAML_sample_size` which indicates the sample size used
+        during training.
+
+        This is useful for:
+        1. Passing as `starting_points` to a new AutoML run for warm-starting.
+        2. Comparing the best configurations across different estimators.
+        3. Initializing the original models after converting FLAML-specific parameters.
+
+        Note:
+            The configs contain FLAML's search space parameters, which may differ from
+            the original models' parameters. Use each estimator's `config2params()` method
+            to convert to the original model's parameters.
+
+        Example:
+
+        ```python
+        from flaml import AutoML
+        from flaml.automl.model import LGBMEstimator, XGBoostEstimator
+        from lightgbm import LGBMClassifier
+        from xgboost import XGBClassifier
+        from sklearn.datasets import load_iris
+
+        X, y = load_iris(return_X_y=True)
+
+        # Train with AutoML
+        automl = AutoML()
+        automl.fit(X, y, task="classification", time_budget=30,
+                   estimator_list=['lgbm', 'xgboost'])
+
+        # Get best configs for all estimators
+        configs = automl.best_config_per_estimator
+        print(configs)
+        # Example output: {'lgbm': {'n_estimators': 4, 'num_leaves': 4, 'log_max_bin': 8, ...},
+        #                  'xgboost': {'n_estimators': 4, 'max_leaves': 4, ...}}
+
+        # Use as starting points for a new AutoML run (warm start)
+        new_automl = AutoML()
+        new_automl.fit(X, y, task="classification", time_budget=30,
+                       starting_points=configs)
+
+        # Or convert to original model parameters for direct use
+        if configs.get('lgbm'):
+            lgbm_config = configs['lgbm'].copy()
+            lgbm_config.pop('FLAML_sample_size', None)  # Remove FLAML internal param
+            flaml_lgbm = LGBMEstimator(task="classification", **lgbm_config)
+            original_lgbm_params = flaml_lgbm.params  # Converted params (log_max_bin -> max_bin), or use flaml_lgbm.config2params(lgbm_config)
+            lgbm_model = LGBMClassifier(**original_lgbm_params)
+            lgbm_model.fit(X, y)
+
+        if configs.get('xgboost'):
+            xgb_config = configs['xgboost'].copy()
+            xgb_config.pop('FLAML_sample_size', None)  # Remove FLAML internal param
+            flaml_xgb = XGBoostEstimator(task="classification", **xgb_config)
+            original_xgb_params = flaml_xgb.params  # Converted params
+            xgb_model = XGBClassifier(**original_xgb_params)
+            xgb_model.fit(X, y)
+        ```
+        """
+        result = {}
+        for e, e_search_state in self._search_states.items():
+            if e_search_state.best_config:
+                config = e_search_state.best_config.get("ml", e_search_state.best_config).copy()
+                # Remove internal keys that are not needed for starting_points, but keep FLAML_sample_size
+                config.pop("learner", None)
+                config.pop("_choice_", None)
+                result[e] = config
+            else:
+                result[e] = None
+        return result

    @property
    def best_loss_per_estimator(self):
--- a/test/automl/test_multiclass.py
+++ b/test/automl/test_multiclass.py
@@ -535,6 +535,32 @@ class TestMultiClass(unittest.TestCase):
        print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
        # print('Training duration of best run: {0:.4g} s'.format(new_automl_experiment.best_config_train_time))

+    def test_starting_points_should_improve_performance(self):
+        N = 10000  # a large N is needed to see the improvement
+        X_train, y_train = load_iris(return_X_y=True)
+        X_train = np.concatenate([X_train + 0.1 * i for i in range(N)], axis=0)
+        y_train = np.concatenate([y_train] * N, axis=0)
+
+        am1 = AutoML()
+        am1.fit(X_train, y_train, estimator_list=["lgbm"], time_budget=3, seed=11)
+
+        am2 = AutoML()
+        am2.fit(
+            X_train,
+            y_train,
+            estimator_list=["lgbm"],
+            time_budget=2,
+            seed=11,
+            starting_points=am1.best_config_per_estimator,
+        )
+
+        print(f"am1.best_loss: {am1.best_loss:.4f}")
+        print(f"am2.best_loss: {am2.best_loss:.4f}")
+
+        assert np.round(am2.best_loss, 4) <= np.round(
+            am1.best_loss, 4
+        ), "Starting points should help improve the performance!"
+

 if __name__ == "__main__":
    unittest.main()
--- a/website/docs/FAQ.md
+++ b/website/docs/FAQ.md
@@ -73,7 +73,9 @@ Optimization history can be checked from the [log](Use-Cases/Task-Oriented-AutoM

 ### How to get the best config of an estimator and use it to train the original model outside FLAML?

-When you finished training an AutoML estimator, you may want to use it in other code w/o depending on FLAML. You can get the `automl.best_config` and convert it to the parameters of the original model with below code:
+When you finished training an AutoML estimator, you may want to use it in other code w/o depending on FLAML. The `automl.best_config` contains FLAML's search space parameters, which may differ from the original model's parameters (e.g., FLAML uses `log_max_bin` for LightGBM instead of `max_bin`). You need to convert them using the `config2params()` method.
+
+**Method 1: Using the trained model instance**

 ```python
 from flaml import AutoML
@@ -86,10 +88,43 @@ automl.fit(X, y)

 print(f"{automl.best_estimator=}")
 print(f"{automl.best_config=}")
-print(f"params for best estimator: {automl.model.config2params(automl.best_config)}")
+# Example: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
+#           'learning_rate': 0.1, 'log_max_bin': 8, ...}
+
+# Convert to original model parameters
+best_params = automl.model.config2params(automl.best_config)
+print(f"params for best estimator: {best_params}")
+# Example: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
+#           'learning_rate': 0.1, 'max_bin': 255, ...}  # log_max_bin -> max_bin
 ```

-If the automl instance is not accessible and you've the `best_config`. You can also convert it with below code:
+**Method 2: Using FLAML estimator classes directly**
+
+If the automl instance is not accessible and you only have the `best_config`, you can convert it with below code:
+
+```python
+from flaml.automl.model import LGBMEstimator
+
+best_config = {
+    "n_estimators": 4,
+    "num_leaves": 4,
+    "min_child_samples": 20,
+    "learning_rate": 0.1,
+    "log_max_bin": 8,  # FLAML-specific parameter
+    "colsample_bytree": 1.0,
+    "reg_alpha": 0.0009765625,
+    "reg_lambda": 1.0,
+}
+
+# Create FLAML estimator - this automatically converts parameters
+flaml_estimator = LGBMEstimator(task="classification", **best_config)
+best_params = flaml_estimator.params  # Converted params ready for original model
+print(f"Converted params: {best_params}")
+# Example: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
+#           'learning_rate': 0.1, 'max_bin': 255, 'verbose': -1, ...}
+```
+
+**Method 3: Using task_factory (for any estimator type)**

 ```python
 from flaml.automl.task.factory import task_factory
@@ -107,15 +142,51 @@ model_class = task_factory(task).estimator_class_from_str(best_estimator)(task=t
 best_params = model_class.config2params(best_config)
 ```

-Then you can use it to train the sklearn estimators directly:
+Then you can use it to train the sklearn/lightgbm/xgboost estimators directly:

 ```python
-from sklearn.ensemble import RandomForestClassifier
+from lightgbm import LGBMClassifier

-model = RandomForestClassifier(**best_params)
+# Using LightGBM directly with converted parameters
+model = LGBMClassifier(**best_params)
 model.fit(X, y)
 ```

+**Using best_config_per_estimator for multiple estimators**
+
+```python
+from flaml import AutoML
+from flaml.automl.model import LGBMEstimator, XGBoostEstimator
+from lightgbm import LGBMClassifier
+from xgboost import XGBClassifier
+
+automl = AutoML()
+automl.fit(
+    X, y, task="classification", time_budget=30, estimator_list=["lgbm", "xgboost"]
+)
+
+# Get configs for all estimators
+configs = automl.best_config_per_estimator
+# Example: {'lgbm': {'n_estimators': 4, 'log_max_bin': 8, ...},
+#           'xgboost': {'n_estimators': 4, 'max_leaves': 4, ...}}
+
+# Convert and use LightGBM config
+if configs.get("lgbm"):
+    lgbm_config = configs["lgbm"].copy()
+    lgbm_config.pop("FLAML_sample_size", None)  # Remove FLAML internal param if present
+    flaml_lgbm = LGBMEstimator(task="classification", **lgbm_config)
+    lgbm_model = LGBMClassifier(**flaml_lgbm.params)
+    lgbm_model.fit(X, y)
+
+# Convert and use XGBoost config
+if configs.get("xgboost"):
+    xgb_config = configs["xgboost"].copy()
+    xgb_config.pop("FLAML_sample_size", None)  # Remove FLAML internal param if present
+    flaml_xgb = XGBoostEstimator(task="classification", **xgb_config)
+    xgb_model = XGBClassifier(**flaml_xgb.params)
+    xgb_model.fit(X, y)
+```
+
 ### How to save and load an AutoML object? (`pickle` / `load_pickle`)

 FLAML provides `AutoML.pickle()` / `AutoML.load_pickle()` as a convenient and robust way to persist an AutoML run.
--- a/website/docs/Use-Cases/Task-Oriented-AutoML.md
+++ b/website/docs/Use-Cases/Task-Oriented-AutoML.md
@@ -552,6 +552,8 @@ automl2.fit(

 `starting_points` is a dictionary or a str to specify the starting hyperparameter config. (1) When it is a dictionary, the keys are the estimator names. If you do not need to specify starting points for an estimator, exclude its name from the dictionary. The value for each key can be either a dictionary of a list of dictionaries, corresponding to one hyperparameter configuration, or multiple hyperparameter configurations, respectively. (2) When it is a str: if "data", use data-dependent defaults; if "data:path", use data-dependent defaults which are stored at path; if "static", use data-independent defaults. Please find more details about data-dependent defaults in [zero shot AutoML](Zero-Shot-AutoML#combine-zero-shot-automl-and-hyperparameter-tuning).

+**Note on sample size preservation**: When using `best_config_per_estimator` as starting points, the configurations now preserve `FLAML_sample_size` (if subsampling was used during the search). This ensures that the warm-started run continues optimization with the same sample sizes that produced the best results in the previous run, leading to more effective warm-starting.
+
 ### Log the trials

 The trials are logged in a file if a `log_file_name` is passed.
@@ -664,6 +666,25 @@ print(automl.best_config)
 # {'n_estimators': 148, 'num_leaves': 18, 'min_child_samples': 3, 'learning_rate': 0.17402065726724145, 'log_max_bin': 8, 'colsample_bytree': 0.6649148062238498, 'reg_alpha': 0.0009765625, 'reg_lambda': 0.0067613624509965}
 ```

+**Note**: The config contains FLAML's search space parameters, which may differ from the original model's parameters. For example, FLAML uses `log_max_bin` for LightGBM instead of `max_bin`. To convert to the original model's parameters, use the `config2params()` method:
+
+```python
+from flaml.automl.model import LGBMEstimator
+
+# Convert FLAML config to original model parameters
+flaml_estimator = LGBMEstimator(task="classification", **automl.best_config)
+original_params = flaml_estimator.params
+print(original_params)
+# {'n_estimators': 148, 'num_leaves': 18, 'min_child_samples': 3, 'learning_rate': 0.17402065726724145, 'max_bin': 255, ...}
+# Note: 'log_max_bin': 8 is converted to 'max_bin': 255 (2^8 - 1)
+
+# Now you can use original LightGBM directly
+from lightgbm import LGBMClassifier
+
+lgbm_model = LGBMClassifier(**original_params)
+lgbm_model.fit(X_train, y_train)
+```
+
 We can also find the best configuration per estimator.

 ```python
@@ -673,6 +694,40 @@ print(automl.best_config_per_estimator)

 The `None` value corresponds to the estimators which have not been tried.

+**Converting configs for all estimators to original model parameters:**
+
+```python
+from flaml.automl.model import LGBMEstimator, XGBoostEstimator
+from lightgbm import LGBMClassifier
+from xgboost import XGBClassifier
+
+configs = automl.best_config_per_estimator
+
+# Convert and use LightGBM config
+if configs.get("lgbm"):
+    lgbm_config = configs["lgbm"].copy()
+    lgbm_config.pop("FLAML_sample_size", None)  # Remove FLAML internal param if present
+    flaml_lgbm = LGBMEstimator(task="classification", **lgbm_config)
+    lgbm_model = LGBMClassifier(**flaml_lgbm.params)
+    lgbm_model.fit(X_train, y_train)
+
+# Convert and use XGBoost config
+if configs.get("xgboost"):
+    xgb_config = configs["xgboost"].copy()
+    xgb_config.pop("FLAML_sample_size", None)  # Remove FLAML internal param if present
+    flaml_xgb = XGBoostEstimator(task="classification", **xgb_config)
+    xgb_model = XGBClassifier(**flaml_xgb.params)
+    xgb_model.fit(X_train, y_train)
+```
+
+**Note**: When subsampling is used during the search (e.g., with large datasets), the configurations may also include `FLAML_sample_size` to indicate the sample size used. For example:
+
+```python
+# {'lgbm': {'n_estimators': 729, 'num_leaves': 21, ..., 'FLAML_sample_size': 45000}, ...}
+```
+
+This information is preserved in `best_config_per_estimator` and is important for warm-starting subsequent runs with the correct sample sizes.
+
 Other useful information:

 ```python