Files
FLAML/website/docs/Best-Practices.md
Copilot 9233a52736 Add configurable label overlap handling for classification holdout strategy (#1491)
* Initial plan

* Fix training/test set overlap in holdout classification by only adding missing labels when needed

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Address code review feedback: add bounds checking and fix edge cases

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix bounds checking: use correct comparison operator for array indexing

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix potential ValueError with max() on empty lists and simplify test assertions

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Add extra bounds checking for label_matches indices

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix pandas_on_spark compatibility by using iloc_pandas_on_spark util method

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Run pre-commit to fix formatting issues

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Improve missing label handling to avoid overlap when possible

For classes with multiple instances that end up in one set, properly
re-split them between train and val instead of duplicating. Only add
to both sets when the class has exactly 1 instance (unavoidable overlap).

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix: ensure at least 1 instance remains in original set when re-splitting

Addresses comments on lines 580 and 724 - prevents moving ALL instances
from validation to training or vice versa by using min(num_instances - 1, ...).

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Refactor: Extract missing label handling into two methods (fast/slow)

- Created _handle_missing_labels_fast(): Fast version that adds first instance to both sets (may overlap)
- Created _handle_missing_labels_no_overlap(): Precise version that avoids overlap when possible
- Added allow_label_overlap parameter to AutoML.fit() (default=True for fast version)
- Updated documentation with new parameter
- Both versions maintain label completeness while offering speed/precision tradeoff

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Add usage example for allow_label_overlap to Best-Practices.md

- Added comprehensive documentation in Classification section
- Included code examples showing both fast and precise versions
- Explained tradeoffs between speed and precision
- Noted that parameter only affects holdout evaluation

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Address code review feedback: update documentation and tests

- Updated docstrings to clarify fast version only adds instances to missing sets
- Fixed documentation to reflect actual behavior (not "both sets" but "set with missing label")
- Completely rewrote test_no_overlap.py to test both allow_label_overlap modes
- Added tests with sample_weights for better code coverage
- Added test for single-instance class handling
- All 5 tests passing

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix AttributeError: initialize _allow_label_overlap in settings and retrain_from_log

- Added allow_label_overlap to settings initialization with default=True
- Added parameter defaulting in fit() method to use settings value if not provided
- Added _allow_label_overlap initialization in retrain_from_log method
- Fixes test failures in test_multiclass, test_regression, and spark tests

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Add docstring to fit()

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
Co-authored-by: Li Jiang <lijiang1@microsoft.com>
2026-01-21 14:03:48 +08:00

160 lines
6.3 KiB
Markdown

# Best Practices
This page collects practical guidance for using FLAML effectively across common tasks.
## General tips
- Start simple: set `task`, `time_budget`, and keep `metric="auto"` unless you have a strong reason to override.
- Prefer correct splits: ensure your evaluation strategy matches your data (time series vs i.i.d., grouped data, etc.).
- Keep estimator lists explicit when debugging: start with a small `estimator_list` and expand.
- Use built-in discovery helpers to avoid stale hardcoded lists:
```python
from flaml import AutoML
from flaml.automl.task.factory import task_factory
automl = AutoML()
print("Built-in sklearn metrics:", sorted(automl.supported_metrics[0]))
print(
"classification estimators:",
sorted(task_factory("classification").estimators.keys()),
)
```
## Classification
- **Metric**: for binary classification, `metric="roc_auc"` is common; for multiclass, `metric="log_loss"` is often robust.
- **Imbalanced data**:
- pass `sample_weight` to `AutoML.fit()`;
- consider setting class weights via `custom_hp` / `fit_kwargs_by_estimator` for specific estimators (see [FAQ](FAQ)).
- **Probability vs label metrics**: use `roc_auc` / `log_loss` when you care about calibrated probabilities.
- **Label overlap control** (holdout evaluation only):
- By default, FLAML uses a fast strategy (`allow_label_overlap=True`) that ensures all labels are present in both training and validation sets by adding missing labels' first instances to both sets. This is efficient but may create minor overlap.
- For strict no-overlap validation, use `allow_label_overlap=False`. This slower but more precise strategy intelligently re-splits multi-instance classes to avoid overlap while maintaining label completeness.
```python
from flaml import AutoML
# Fast version (default): allows overlap for efficiency
automl_fast = AutoML()
automl_fast.fit(
X_train,
y_train,
task="classification",
eval_method="holdout",
allow_label_overlap=True,
) # default
# Precise version: avoids overlap when possible
automl_precise = AutoML()
automl_precise.fit(
X_train,
y_train,
task="classification",
eval_method="holdout",
allow_label_overlap=False,
) # slower but more precise
```
Note: This only affects holdout evaluation. CV and custom validation sets are unaffected.
## Regression
- **Default metric**: `metric="r2"` (minimizes `1 - r2`).
- If your target scale matters (e.g., dollar error), consider `mae`/`rmse`.
## Learning to rank
- Use `task="rank"` with group information (`groups` / `groups_val`) so metrics like `ndcg` and `ndcg@k` are meaningful.
- If you pass `metric="ndcg@10"`, also pass `groups` so FLAML can compute group-aware NDCG.
## Time series forecasting
- Use time-aware splitting. For holdout validation, set `eval_method="holdout"` and use a time-ordered dataset.
- Prefer supplying a DataFrame with a clear time column when possible.
- Optional time-series estimators depend on optional dependencies. To list what is available in your environment:
```python
from flaml.automl.task.factory import task_factory
print("forecast:", sorted(task_factory("forecast").estimators.keys()))
```
## NLP (Transformers)
- Install the optional dependency: `pip install "flaml[hf]"`.
- When you provide a custom metric, ensure it returns `(metric_to_minimize, metrics_to_log)` with stable keys.
## Speed, stability, and tricky settings
- **Time budget vs convergence**: if you see warnings about not all estimators converging, increase `time_budget` or reduce `estimator_list`.
- **Memory pressure / OOM**:
- set `free_mem_ratio` (e.g., `0.2`) to keep free memory above a threshold;
- set `model_history=False` to reduce stored artifacts;
- **Reproducibility**: set `seed` and keep `n_jobs` fixed; expect some runtime variance.
## Persisting models
FLAML supports **both** MLflow logging and pickle-based persistence. For production deployment, MLflow logging is typically the most important option because it plugs into the MLflow ecosystem (tracking, model registry, serving, governance). For quick local reuse, persisting the whole `AutoML` object via pickle is often the most convenient.
### Option 1: MLflow logging (recommended for production)
When you run `AutoML.fit()` inside an MLflow run, FLAML can log metrics/params automatically (disable via `mlflow_logging=False` if needed). To persist the trained `AutoML` object as a model artifact and reuse MLflow tooling end-to-end:
```python
import mlflow
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from flaml import AutoML
X, y = load_iris(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
automl = AutoML()
mlflow.set_experiment("flaml")
with mlflow.start_run(run_name="flaml_run") as run:
automl.fit(X_train, y_train, task="classification", time_budget=3)
run_id = run.info.run_id
# Later (or in a different process)
automl2 = mlflow.sklearn.load_model(f"runs:/{run_id}/model")
assert np.array_equal(automl2.predict(X_test), automl.predict(X_test))
```
### Option 2: Pickle the full `AutoML` instance (convenient)
Pickling stores the *entire* `AutoML` instance (not just the best estimator). This is useful when you prefer not to rely on MLflow or when you want to reuse additional attributes of the AutoML object without retraining.
In Microsoft Fabric scenarios, additional attributes is particularly important for re-plotting visualization figures without requiring model retraining.
```python
import mlflow
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from flaml import AutoML
X, y = load_iris(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
automl = AutoML()
mlflow.set_experiment("flaml")
with mlflow.start_run(run_name="flaml_run") as run:
automl.fit(X_train, y_train, task="classification", time_budget=3)
automl.pickle("automl.pkl")
automl2 = AutoML.load_pickle("automl.pkl")
assert np.array_equal(automl2.predict(X_test), automl.predict(X_test))
assert automl.best_config == automl2.best_config
assert automl.best_loss == automl2.best_loss
assert automl.mlflow_integration.infos == automl2.mlflow_integration.infos
```
See also: [Task-Oriented AutoML](Use-Cases/Task-Oriented-AutoML) and [FAQ](FAQ).