Compare commits

...

141 Commits
v2.0.3 ... main

Author SHA1 Message Date
dependabot[bot]
bc1e4dc5ea Bump webpack from 5.94.0 to 5.105.0 in /website (#1515) 2026-02-08 16:29:18 +08:00
Copilot
158ff7d99e Fix transformers API compatibility: support v4.26+ and v5.0+ with version-aware parameter selection (#1514)
* Initial plan

* Fix transformers API compatibility issues

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Add backward compatibility for transformers v4.26+ by version check

Support both tokenizer (v4.26-4.43) and processing_class (v4.44+) parameters based on installed transformers version. Fallback to tokenizer if version check fails.

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Improve exception handling specificity

Use specific exception types (ImportError, AttributeError, ValueError) instead of broad Exception catch for better error handling.

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Run pre-commit formatting on all files

Applied black formatting to fix code style across the repository.

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
2026-01-28 09:00:21 +08:00
Li Jiang
a5021152d2 ci: skip pre-commit workflow on main (#1513)
* ci: skip pre-commit workflow on main

* ci: run pre-commit only on pull requests
2026-01-25 21:10:05 +08:00
Copilot
fc4efe3510 Fix sklearn 1.7+ compatibility: BaseEstimator type detection for ensemble (#1512)
* Initial plan

* Fix ExtraTreesEstimator regression ensemble error with sklearn 1.7+

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Address code review feedback: improve __sklearn_tags__ implementation

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix format error

* Emphasize pre-commit

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <lijiang1@microsoft.com>
2026-01-23 10:20:59 +08:00
Li Jiang
cd0e9fb0d2 Only run save dependencies on main branch (#1510) 2026-01-22 11:07:40 +08:00
dependabot[bot]
a9c0a9e30a Bump lodash from 4.17.21 to 4.17.23 in /website (#1509)
Bumps [lodash](https://github.com/lodash/lodash) from 4.17.21 to 4.17.23.
- [Release notes](https://github.com/lodash/lodash/releases)
- [Commits](https://github.com/lodash/lodash/compare/4.17.21...4.17.23)

---
updated-dependencies:
- dependency-name: lodash
  dependency-version: 4.17.23
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-22 08:47:33 +08:00
Li Jiang
a05b669de3 Update Python version support and pre-commit in documentation (#1505) 2026-01-21 16:39:54 +08:00
Copilot
6e59103e86 Add hierarchical search space documentation (#1496)
* Initial plan

* Add hierarchical search space documentation to Tune-User-Defined-Function.md

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Add clarifying comments to hierarchical search space examples

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix formatting issues with pre-commit

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2026-01-21 14:40:56 +08:00
Copilot
d9e74031e0 Expose task-level and estimator-level preprocessors as public API (#1497)
* Initial plan

* Add public preprocess() API methods for AutoML and estimators

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Add documentation for preprocess() API methods

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Add example script demonstrating preprocess() API usage

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Address code review feedback - fix type hints and simplify test logic

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix formatting issues with pre-commit hooks

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Remove example.py, make tests faster

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
Co-authored-by: Li Jiang <lijiang1@microsoft.com>
2026-01-21 14:38:25 +08:00
Copilot
7ec1414e9b Clarify period parameter and automatic label lagging in time series forecasting (#1495)
* Initial plan

* Add comprehensive documentation for period parameter and automatic label lagging

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Address code review feedback on docstring clarity

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Clarify period vs prediction output length per @thinkall's feedback

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Refine terminology per code review feedback

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Run pre-commit formatting fixes

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2026-01-21 14:19:23 +08:00
Copilot
9233a52736 Add configurable label overlap handling for classification holdout strategy (#1491)
* Initial plan

* Fix training/test set overlap in holdout classification by only adding missing labels when needed

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Address code review feedback: add bounds checking and fix edge cases

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix bounds checking: use correct comparison operator for array indexing

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix potential ValueError with max() on empty lists and simplify test assertions

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Add extra bounds checking for label_matches indices

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix pandas_on_spark compatibility by using iloc_pandas_on_spark util method

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Run pre-commit to fix formatting issues

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Improve missing label handling to avoid overlap when possible

For classes with multiple instances that end up in one set, properly
re-split them between train and val instead of duplicating. Only add
to both sets when the class has exactly 1 instance (unavoidable overlap).

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix: ensure at least 1 instance remains in original set when re-splitting

Addresses comments on lines 580 and 724 - prevents moving ALL instances
from validation to training or vice versa by using min(num_instances - 1, ...).

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Refactor: Extract missing label handling into two methods (fast/slow)

- Created _handle_missing_labels_fast(): Fast version that adds first instance to both sets (may overlap)
- Created _handle_missing_labels_no_overlap(): Precise version that avoids overlap when possible
- Added allow_label_overlap parameter to AutoML.fit() (default=True for fast version)
- Updated documentation with new parameter
- Both versions maintain label completeness while offering speed/precision tradeoff

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Add usage example for allow_label_overlap to Best-Practices.md

- Added comprehensive documentation in Classification section
- Included code examples showing both fast and precise versions
- Explained tradeoffs between speed and precision
- Noted that parameter only affects holdout evaluation

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Address code review feedback: update documentation and tests

- Updated docstrings to clarify fast version only adds instances to missing sets
- Fixed documentation to reflect actual behavior (not "both sets" but "set with missing label")
- Completely rewrote test_no_overlap.py to test both allow_label_overlap modes
- Added tests with sample_weights for better code coverage
- Added test for single-instance class handling
- All 5 tests passing

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix AttributeError: initialize _allow_label_overlap in settings and retrain_from_log

- Added allow_label_overlap to settings initialization with default=True
- Added parameter defaulting in fit() method to use settings value if not provided
- Added _allow_label_overlap initialization in retrain_from_log method
- Fixes test failures in test_multiclass, test_regression, and spark tests

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Add docstring to fit()

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
Co-authored-by: Li Jiang <lijiang1@microsoft.com>
2026-01-21 14:03:48 +08:00
Copilot
7ac076d544 Use scientific notation for best error in logger output (#1498)
* Initial plan

* Change best error format from .4f to .4e for scientific notation

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2026-01-21 09:06:19 +08:00
Copilot
3d489f1aaa Add validation and clear error messages for custom_metric parameter (#1500)
* Initial plan

* Add validation and documentation for custom_metric parameter

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Refactor validation into reusable method and improve error handling

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Apply pre-commit formatting fixes

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2026-01-21 08:58:11 +08:00
Copilot
c64eeb5e8d Document that final_estimator parameters in ensemble are not auto-tuned (#1499)
* Initial plan

* Document final_estimator parameter behavior in ensemble configuration

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Address code review feedback: fix syntax in examples and use float comparison

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Run pre-commit to fix formatting issues

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2026-01-20 21:59:31 +08:00
Copilot
bf35f98a24 Document missing value handling behavior for AutoML estimators (#1473)
* Initial plan

* Add comprehensive documentation on missing value handling in FAQ

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Apply mdformat to FAQ.md

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Correct FAQ: FLAML does preprocess missing values with SimpleImputer

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2026-01-20 21:53:10 +08:00
Copilot
1687ca9a94 Fix eval_set preprocessing for XGBoost estimators with categorical features (#1470)
* Initial plan

* Initial analysis - reproduced eval_set preprocessing bug

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix eval_set preprocessing for XGBoost estimators with categorical features

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Add eval_set tests to test_xgboost function

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix linting issues with ruff and black

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2026-01-20 20:41:21 +08:00
Copilot
7a597adcc9 Add GitHub Copilot instructions for FLAML repository (#1502)
* Initial plan

* Add comprehensive Copilot instructions for FLAML repository

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Update forecast dependencies list to be complete

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Clarify Python version support details

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
2026-01-20 18:06:47 +08:00
Copilot
4ea9650f99 Fix nested dictionary merge in SearchThread losing sampled hyperparameters (#1494)
* Initial plan

* Add recursive dict update to fix nested config merge

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2026-01-20 15:50:18 +08:00
Li Jiang
fa1a32afb6 Fix indents (#1493) 2026-01-20 11:18:58 +08:00
Copilot
5eb7d623b0 Expand docs to include all flamlized estimators (#1472)
* Initial plan

* Add documentation for all flamlized estimators (RandomForest, ExtraTrees, LGBMClassifier, XGBRegressor)

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix markdown formatting per pre-commit

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2026-01-20 10:59:48 +08:00
Copilot
22dcfcd3c0 Add comprehensive metric documentation and URL reference to AutoML docstrings (#1471)
* Initial plan

* Update AutoML metric documentation with full list and documentation link

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Apply black and mdformat formatting to code and documentation

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Apply pre-commit formatting fixes

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2026-01-20 10:34:54 +08:00
Li Jiang
d7208b32d0 Bump version to 2.5.0 (#1492) 2026-01-20 10:30:39 +08:00
Copilot
5f1aa2dda8 Fix: Preserve FLAML_sample_size in best_config_per_estimator (#1475)
* Initial plan

* Fix: Preserve FLAML_sample_size in best_config_per_estimator

Modified best_config_per_estimator property to keep FLAML_sample_size when returning best configurations. Previously, AutoMLState.sanitize() was removing this key, which caused the sample size information to be lost when using starting_points from a previous run.

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Add a test to verify the improvement of starting_points

* Update documentation to reflect FLAML_sample_size preservation

Updated Task-Oriented-AutoML.md to document that best_config_per_estimator now preserves FLAML_sample_size:
- Added note in "Warm start" section explaining that FLAML_sample_size is preserved for effective warm-starting
- Added note in "Get best configuration" section with example showing FLAML_sample_size in output
- Explains importance of sample size preservation for continuing optimization with correct sample sizes

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix unintended code change

* Improve docstrings and docs

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
Co-authored-by: Li Jiang <lijiang1@microsoft.com>
2026-01-20 07:42:31 +08:00
Copilot
67bdcde4d5 Fix BlendSearch OptunaSearch warning for non-hierarchical spaces with Ray Tune domains (#1477)
* Initial plan

* Fix BlendSearch OptunaSearch warning for non-hierarchical spaces

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Clean up test file

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Add regression test for BlendSearch UDF mode warning fix

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Improve the fix and tests

* Fix Define-by-run function passed in  argument is not yet supported when using

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
Co-authored-by: Li Jiang <lijiang1@microsoft.com>
2026-01-20 00:01:41 +08:00
Copilot
46a406edd4 Add objective parameter to LGBMEstimator search space (#1474)
* Initial plan

* Add objective parameter to LGBMEstimator search_space

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Add test for LGBMEstimator objective parameter

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

* Fix format error

* Remove changes, just add a test to verify the current supported usage

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
Co-authored-by: Li Jiang <lijiang1@microsoft.com>
2026-01-19 21:10:21 +08:00
Li Jiang
f1817ea7b1 Add support to python 3.13 (#1486) 2026-01-19 18:31:43 +08:00
Li Jiang
f6a5163e6a Fix isinstance usage issues (#1488)
* Fix isinstance usage issues

* Pin python version to 3.12 for pre-commit

* Update mdformat to 0.7.22
2026-01-19 15:19:05 +08:00
Li Jiang
e64b486528 Fix Best Practices not shown (#1483)
* Simplify automl.fit calls in Best Practices

Removed 'retrain_full' and 'eval_method' parameters from automl.fit calls.

* Fix best practices not shown
2026-01-13 14:25:28 +08:00
Li Jiang
a74354f7a9 Update documents, Bump version to 2.4.1, Sync Fabric till 088cfb98 (#1482)
* Add best practices

* Update docs to reflect on the recent changes

* Improve model persisting best practices

* Bump version to 2.4.1

* List all estimators

* Remove autogen

* Update dependencies
2026-01-13 12:49:36 +08:00
Li Jiang
ced1d6f331 Support pickling the whole AutoML instance, Sync Fabric till 0d4ab16f (#1481) 2026-01-12 23:04:38 +08:00
Li Jiang
bb213e7ebd Add timeout for tests and remove macos test envs (#1479) 2026-01-10 22:48:54 +08:00
Li Jiang
d241e8de90 Update readme, enable all python versions for macos tests (#1478)
* Fix macOS hang with running coverage

* Run coverage only in ubuntu

* Fix syntax error

* Fix run tests logic

* Update readme

* Don't test python 3.10 on macos as it's stuck

* Enable all python versions for macos
2026-01-10 20:03:24 +08:00
Copilot
0b138d9193 Fix log_training_metric causing IndexError for time series models (#1469)
Co-authored-by: Li Jiang <lijiang1@microsoft.com>
2026-01-10 18:07:17 +08:00
Li Jiang
1c9835dc0a Add support to Python 3.12, Sync Fabric till dc382961 (#1467)
* Merged PR 1686010: Bump version to 2.3.5.post2, Distribute source and wheel, Fix license-file, Only log better models

- Fix license-file
- Bump version to 2.3.5.post2
- Distribute source and wheel
- Log better models only
- Add artifact_path to register_automl_pipeline
- Improve logging of _automl_user_configurations

----
This pull request fixes the project’s configuration by updating the license metadata for compliance with FLAML OSS 2.3.5.

The changes in `/pyproject.toml` update the project’s license and readme metadata by replacing deprecated keys with the new structured fields.
- `/pyproject.toml`: Replaced `license_file` with `license = { text = "MIT" }`.
- `/pyproject.toml`: Replaced `description-file` with `readme = "README.md"`.
<!-- GitOpsUserAgent=GitOps.Apps.Server.pullrequestcopilot -->

Related work items: #4252053

* Merged PR 1688479: Handle feature_importances_ is None, Catch RuntimeError and wait for spark cluster to recover

- Add warning message when feature_importances_ is None (#3982120)
- Catch RuntimeError and wait for spark cluster to recover (#3982133)

----
Bug fix.

This pull request prevents an AttributeError in the feature importance plotting function by adding a check for a `None` value with an informative warning message.
- `flaml/fabric/visualization.py`: Checks if `result.feature_importances_` is `None`, logs a warning with possible reasons, and returns early.
- `flaml/fabric/visualization.py`: Imports `logger` from `flaml.automl.logger` to support the warning message.
<!-- GitOpsUserAgent=GitOps.Apps.Server.pullrequestcopilot -->

Related work items: #3982120, #3982133

* Removed deprecated metadata section

* Fix log_params, log_artifact doesn't support run_id in mlflow 2.6.0

* Remove autogen

* Remove autogen

* Remove autogen

* Merged PR 1776547: Fix flaky test test_automl

Don't throw error when time budget is not enough

----
#### AI description  (iteration 1)
#### PR Classification
Bug fix addressing a failing test in the AutoML notebook example.

#### PR Summary
This PR fixes a flaky test by adding a conditional check in the AutoML test that prints a message and exits early if no best estimator is set, thereby preventing unpredictable test failures.
- `test/automl/test_notebook_example.py`: Introduced a check to print "Training budget is not sufficient" and return if `automl.best_estimator` is not found.
<!-- GitOpsUserAgent=GitOps.Apps.Server.pullrequestcopilot -->

Related work items: #4573514

* Merged PR 1777952: Fix unrecognized or malformed field 'license-file' when uploading wheel to feed

Try to fix InvalidDistribution: Invalid distribution metadata: unrecognized or malformed field 'license-file'

----
Bug fix addressing package metadata configuration.

This pull request fixes the error with unrecognized or malformed license file fields during wheel uploads by updating the setup configuration.
- In `setup.py`, added `license="MIT"` and `license_files=["LICENSE"]` to provide proper license metadata.
<!-- GitOpsUserAgent=GitOps.Apps.Server.pullrequestcopilot -->

Related work items: #4560034

* Cherry-pick Merged PR 1879296: Add support to python 3.12 and spark 4.0

* Cherry-pick Merged PR 1890869: Improve time_budget estimation for mlflow logging

* Cherry-pick Merged PR 1879296: Add support to python 3.12 and spark 4.0

* Disable openai workflow

* Add python 3.12 to test envs

* Manually trigger openai

* Support markdown files with underscore-prefixed file names

* Improve save dependencies

* SynapseML is not installed

* Fix syntax error:Module !flaml/autogen was never imported

* macos 3.12 also hangs

* fix syntax error

* Update python version in actions

* Install setuptools for using pkg_resources

* Fix test_automl_performance in Github actions

* Fix test_nested_run
2026-01-10 12:17:21 +08:00
Li Jiang
1285700d7a Update readme, bump version to 2.4.0, fix CI errors (#1466)
* Update gitignore

* Bump version to 2.4.0

* Update readme

* Pre-download california housing data

* Use pre-downloaded california housing data

* Pin lightning<=2.5.6

* Fix typo in find and replace

* Fix estimators has no attribute __sklearn_tags__

* Pin torch to 2.2.2 in tests

* Fix conflict

* Update pytorch-forecasting

* Update pytorch-forecasting

* Update pytorch-forecasting

* Use numpy<2 for testing

* Update scikit-learn

* Run Build and UT every other day

* Pin pip<24.1

* Pin pip<24.1 in pipeline

* Loosen pip, install pytorch_forecasting only in py311

* Add support to new versions of nlp dependecies

* Fix formats

* Remove redefinition

* Update mlflow versions

* Fix mlflow version syntax

* Update gitignore

* Clean up cache to free space

* Remove clean up action cache

* Fix blendsearch

* Update test workflow

* Update setup.py

* Fix catboost version

* Update workflow

* Prepare for python 3.14

* Support no catboost

* Fix tests

* Fix python_requires

* Update test workflow

* Fix vw tests

* Remove python 3.9

* Fix nlp tests

* Fix prophet

* Print pip freeze for better debugging

* Fix Optuna search does not support parameters of type Float with samplers of type Quantized

* Save dependencies for later inspection

* Fix coverage.xml not exists

* Fix github action permission

* Handle python 3.13

* Address openml is not installed

* Check dependencies before run tests

* Update dependencies

* Fix syntax error

* Use bash

* Update dependencies

* Fix git error

* Loose mlflow constraints

* Add rerun, use mlflow-skinny

* Fix git error

* Remove ray tests

* Update xgboost versions

* Fix automl pickle error

* Don't test python 3.10 on macos as it's stuck

* Rebase before push

* Reduce number of branches
2026-01-09 13:40:52 +08:00
dependabot[bot]
7f42bece89 Bump algoliasearch-helper from 3.11.1 to 3.26.0 in /website (#1461)
* Bump algoliasearch-helper from 3.11.1 to 3.26.0 in /website

Bumps [algoliasearch-helper](https://github.com/algolia/instantsearch) from 3.11.1 to 3.26.0.
- [Release notes](https://github.com/algolia/instantsearch/releases)
- [Commits](https://github.com/algolia/instantsearch/commits/algoliasearch-helper@3.26.0)

---
updated-dependencies:
- dependency-name: algoliasearch-helper
  dependency-version: 3.26.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

* Fix format error

* Fix format error

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <lijiang1@microsoft.com>
2025-10-09 14:37:31 +08:00
Keita Onabuta
e19107407b update loc second args - column (#1458)
Configure second args of loc function to time_col instead of dataframe - X.
2025-08-30 11:07:19 +08:00
Li Jiang
f5d6693253 Bump version to 2.3.7 (#1457) 2025-08-26 14:59:32 +08:00
Azamatkhan Arifkhanov
d4e43c50a2 Fix OSError: [Errno 24] Too many open files: 'nul' (#1455)
* Update model.py

Added closing of save_fds.

* Updated model.py for pre-commit requirements
2025-08-26 12:50:22 +08:00
dependabot[bot]
13aec414ea Bump brace-expansion from 1.1.11 to 1.1.12 in /website (#1453)
Bumps [brace-expansion](https://github.com/juliangruber/brace-expansion) from 1.1.11 to 1.1.12.
- [Release notes](https://github.com/juliangruber/brace-expansion/releases)
- [Commits](https://github.com/juliangruber/brace-expansion/compare/1.1.11...v1.1.12)

---
updated-dependencies:
- dependency-name: brace-expansion
  dependency-version: 1.1.12
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2025-08-14 10:50:51 +08:00
Li Jiang
bb16dcde93 Bump version to 2.3.6 (#1451) 2025-08-05 14:29:36 +08:00
Li Jiang
be81a76da9 Fix TypeError of customized kfold method which needs 'y' (#1450) 2025-08-02 08:05:50 +08:00
Li Jiang
2d16089529 Improve FAQ docs (#1448)
* Fix settings usage error

* Add new code example
2025-07-09 18:33:10 +08:00
Li Jiang
01c3c83653 Install wheel and setuptools (#1443) 2025-05-28 12:56:48 +08:00
Li Jiang
9b66103f7c Fix typo, add quotes to python-version (#1442) 2025-05-28 12:24:00 +08:00
Li Jiang
48dfd72e64 Fix CD actions (#1441)
* Fix CD actions

* Skip Build if no relevant changes
2025-05-28 10:45:27 +08:00
Li Jiang
dec92e5b02 Upgrade python 3.8 to 3.10 in github actions (#1440) 2025-05-27 21:34:21 +08:00
Li Jiang
22911ea1ef Merged PR 1685054: Add more logs and function wait_futures for easier post analysis (#1438)
- Add function wait_futures for easier post analysis
- Use logger instead of print

----
#### AI description  (iteration 1)
#### PR Classification
A code enhancement for debugging asynchronous mlflow logging and improving post-run analysis.

#### PR Summary
This PR adds detailed debug logging to the mlflow integration and introduces a new `wait_futures` function to streamline the collection of asynchronous task results for improved analysis.
- `flaml/fabric/mlflow.py`: Added debug log statements around starting and ending mlflow runs to trace run IDs and execution flow.
- `flaml/automl/automl.py`: Implemented the `wait_futures` function to handle asynchronous task results and replaced a print call with `logger.info` for consistent logging.
<!-- GitOpsUserAgent=GitOps.Apps.Server.pullrequestcopilot -->

Related work items: #4029592
2025-05-27 15:32:56 +08:00
murunlin
12183e5f73 Add the detailed info for parameter 'verbose' (#1435)
* explain-verbose-parameter

* concise-verbose-docstring

* explain-verbose-parameter

* explain-verbose-parameter

* test-ignore

* test-ignore

* sklearn-version-califonia

* submit-0526

---------

Co-authored-by: Runlin Mu (FESCO Adecco Human Resources) <v-runlinmu@microsoft.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2025-05-27 10:01:01 +08:00
Li Jiang
c2b25310fc Sync Fabric till 2cd1c3da (#1433)
* Sync Fabric till 2cd1c3da

* Remove synapseml from tag names

* Fix 'NoneType' object has no attribute 'DataFrame'

* Deprecated 3.8 support

* Fix 'NoneType' object has no attribute 'DataFrame'

* Still use python 3.8 for pydoc

* Don't run tests in parallel

* Remove autofe and lowcode
2025-05-23 10:19:31 +08:00
murunlin
0f9420590d fix: best_model_for_estimator returns inconsistent feature_importances_ compared to automl.model (#1429)
* mrl-issue1422-0513

* fix version dependency

* fix datasets version

* test completion

---------

Co-authored-by: Runlin Mu (FESCO Adecco Human Resources) <v-runlinmu@microsoft.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2025-05-15 09:37:34 +08:00
hexiang-x
5107c506b4 fix:When use_spark = True and mlflow_logging = True are set, an error is reported when logging the best model: 'NoneType' object has no attribute 'save' bug Something isn't working (#1432) 2025-05-14 19:34:06 +08:00
dependabot[bot]
9e219ef8dc Bump http-proxy-middleware from 2.0.7 to 2.0.9 in /website (#1425)
Bumps [http-proxy-middleware](https://github.com/chimurai/http-proxy-middleware) from 2.0.7 to 2.0.9.
- [Release notes](https://github.com/chimurai/http-proxy-middleware/releases)
- [Changelog](https://github.com/chimurai/http-proxy-middleware/blob/v2.0.9/CHANGELOG.md)
- [Commits](https://github.com/chimurai/http-proxy-middleware/compare/v2.0.7...v2.0.9)

---
updated-dependencies:
- dependency-name: http-proxy-middleware
  dependency-version: 2.0.9
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2025-04-23 14:22:12 +08:00
Li Jiang
6e4083743b Revert "Numpy 2.x is not supported yet. (#1424)" (#1426)
This reverts commit 17e95edd9e.
2025-04-22 21:31:44 +08:00
Li Jiang
17e95edd9e Numpy 2.x is not supported yet. (#1424) 2025-04-22 12:11:27 +08:00
Stickic-cyber
468bc62d27 Fix issue with "list index out of range" when max_iter=1 (#1419) 2025-04-09 21:54:17 +08:00
dependabot[bot]
437c239c11 Bump @babel/helpers from 7.20.1 to 7.26.10 in /website (#1413)
Bumps [@babel/helpers](https://github.com/babel/babel/tree/HEAD/packages/babel-helpers) from 7.20.1 to 7.26.10.
- [Release notes](https://github.com/babel/babel/releases)
- [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md)
- [Commits](https://github.com/babel/babel/commits/v7.26.10/packages/babel-helpers)

---
updated-dependencies:
- dependency-name: "@babel/helpers"
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2025-03-14 15:51:06 +08:00
dependabot[bot]
8e753f1092 Bump @babel/runtime from 7.20.1 to 7.26.10 in /website (#1414)
Bumps [@babel/runtime](https://github.com/babel/babel/tree/HEAD/packages/babel-runtime) from 7.20.1 to 7.26.10.
- [Release notes](https://github.com/babel/babel/releases)
- [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md)
- [Commits](https://github.com/babel/babel/commits/v7.26.10/packages/babel-runtime)

---
updated-dependencies:
- dependency-name: "@babel/runtime"
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2025-03-13 21:34:02 +08:00
dependabot[bot]
a3b57e11d4 Bump prismjs from 1.29.0 to 1.30.0 in /website (#1411)
Bumps [prismjs](https://github.com/PrismJS/prism) from 1.29.0 to 1.30.0.
- [Release notes](https://github.com/PrismJS/prism/releases)
- [Changelog](https://github.com/PrismJS/prism/blob/master/CHANGELOG.md)
- [Commits](https://github.com/PrismJS/prism/compare/v1.29.0...v1.30.0)

---
updated-dependencies:
- dependency-name: prismjs
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2025-03-13 14:06:41 +08:00
dependabot[bot]
a80dcf9925 Bump @babel/runtime-corejs3 from 7.20.1 to 7.26.10 in /website (#1412)
Bumps [@babel/runtime-corejs3](https://github.com/babel/babel/tree/HEAD/packages/babel-runtime-corejs3) from 7.20.1 to 7.26.10.
- [Release notes](https://github.com/babel/babel/releases)
- [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md)
- [Commits](https://github.com/babel/babel/commits/v7.26.10/packages/babel-runtime-corejs3)

---
updated-dependencies:
- dependency-name: "@babel/runtime-corejs3"
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-13 10:04:03 +08:00
SkBlaz
7157af44e0 Improved error handling in case no scikit present (#1402)
* Improved error handling in case no scikit present

Currently there is no description for when this error is thrown. Being explicit seems of value.

* Update histgb.py

---------

Co-authored-by: Li Jiang <bnujli@gmail.com>
2025-03-03 15:39:43 +08:00
Li Jiang
1798c4591e Upgrade setuptools (#1410) 2025-03-01 08:05:51 +08:00
Li Jiang
dd26263330 Bump version to 2.3.5 (#1409) 2025-02-17 22:26:59 +08:00
Li Jiang
2ba5f8bed1 Fix params pop error (#1408) 2025-02-17 15:06:05 +08:00
Daniel Grindrod
d0a11958a5 fix: Fixed bug where group folds and sample weights couldn't be used in the same automl instance (#1405) 2025-02-15 10:41:27 +08:00
dependabot[bot]
0ef9b00a75 Bump serialize-javascript from 6.0.0 to 6.0.2 in /website (#1407)
Bumps [serialize-javascript](https://github.com/yahoo/serialize-javascript) from 6.0.0 to 6.0.2.
- [Release notes](https://github.com/yahoo/serialize-javascript/releases)
- [Commits](https://github.com/yahoo/serialize-javascript/compare/v6.0.0...v6.0.2)

---
updated-dependencies:
- dependency-name: serialize-javascript
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2025-02-14 12:36:49 +08:00
Will Charles
840f76e5e5 Changed tune.report import for ray>=2 (#1392)
* Changed tune.report import for ray>=2

* env: Changed pydantic restriction in env

* Reverted Pydantic install conditions

* Reverted Pydantic install conditions

* test: Check if GPU is available

* tests: uncommented a line

* tests: Better fix for Ray GPU checking

* tests: Added timeout to dataset loading

* tests: Deleted _test_hf_data()

* test: Reduce lrl2 dataset size

* bug: timeout error

* bug: timeout error

* fix: Added threading check for timout issue

* Undo old commits

* Timeout fix from #1406

---------

Co-authored-by: Daniel Grindrod <dannycg1996@gmail.com>
2025-02-14 09:38:33 +08:00
Li Jiang
d8b7d25b80 Fix test hang issue (#1406)
* Add try except to resource.setrlimit

* Set time limit only in main thread

* Check only test model

* Pytest debug

* Test separately

* Move test_model.py to automl folder
2025-02-13 19:50:35 +08:00
Li Jiang
6d53929803 Bump version to 2.3.4 (#1389) 2024-12-18 12:49:59 +08:00
Daniel Grindrod
c038fbca07 fix: KeyError no longer occurs when using groupfolds for regression tasks. (#1385)
* fix: Now resetting indexes for regression datasets when using group folds

* refactor: Simplified if statement to include all fold types

* docs: Updated docs to make it clear that group folds can be used for regression tasks

---------

Co-authored-by: Daniel Grindrod <daniel.grindrod@evotec.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-12-18 10:06:58 +08:00
dependabot[bot]
6a99202492 Bump nanoid from 3.3.6 to 3.3.8 in /website (#1387)
Bumps [nanoid](https://github.com/ai/nanoid) from 3.3.6 to 3.3.8.
- [Release notes](https://github.com/ai/nanoid/releases)
- [Changelog](https://github.com/ai/nanoid/blob/main/CHANGELOG.md)
- [Commits](https://github.com/ai/nanoid/compare/3.3.6...3.3.8)

---
updated-dependencies:
- dependency-name: nanoid
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-12-17 19:26:34 +08:00
Daniel Grindrod
42d1dcfa0e fix: Fixed bug with catboost and groups (#1383)
Co-authored-by: Daniel Grindrod <daniel.grindrod@evotec.com>
2024-12-17 13:54:49 +08:00
EgorKraevTransferwise
b83c8a7d3b Pass cost_attr and cost_budget from flaml.tune.run() to the search algo (#1382) 2024-12-04 20:50:15 +08:00
dependabot[bot]
b9194cdcf2 Bump cross-spawn from 7.0.3 to 7.0.6 in /website (#1379)
Bumps [cross-spawn](https://github.com/moxystudio/node-cross-spawn) from 7.0.3 to 7.0.6.
- [Changelog](https://github.com/moxystudio/node-cross-spawn/blob/master/CHANGELOG.md)
- [Commits](https://github.com/moxystudio/node-cross-spawn/compare/v7.0.3...v7.0.6)

---
updated-dependencies:
- dependency-name: cross-spawn
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-20 15:48:39 +08:00
Li Jiang
9a1f6b0291 Bump version to 2.3.3 (#1378) 2024-11-13 11:44:34 +08:00
kernelmethod
07f4413aae Fix logging nuisances that can arise when importing flaml (#1377) 2024-11-13 07:49:55 +08:00
Daniel Grindrod
5a74227bc3 Flaml: fix lgbm reproducibility (#1369)
* fix: Fixed bug where every underlying LGBMRegressor or LGBMClassifier had n_estimators = 1

* test: Added test showing case where FLAMLised CatBoostModel result isn't reproducible

* fix: Fixing issue where callbacks cause LGBM results to not be reproducible

* Update test/automl/test_regression.py

Co-authored-by: Li Jiang <bnujli@gmail.com>

* fix: Adding back the LGBM EarlyStopping

* refactor: Fix tweaked to ensure other models aren't likely to be affected

* test: Fixed test to allow reproduced results to be better than the FLAML results, when LGBM earlystopping is involved

---------

Co-authored-by: Daniel Grindrod <Daniel.Grindrod@evotec.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-11-01 10:06:15 +08:00
Ranuga
7644958e21 Add documentation for automl.model.estimator usage (#1311)
* Added documentation for automl.model.estimator usage

Updated documentation across various examples and the model.py file to include information about automl.model.estimator. This addition enhances the clarity and usability of FLAML by providing users with clear guidance on how to utilize this feature in their AutoML workflows. These changes aim to improve the overall user experience and facilitate easier understanding of FLAML's capabilities.

* fix: Ran pre-commit hook on docs

---------

Co-authored-by: Li Jiang <bnujli@gmail.com>
Co-authored-by: Daniel Grindrod <dannycg1996@gmail.com>
Co-authored-by: Daniel Grindrod <Daniel.Grindrod@evotec.com>
2024-10-31 20:53:54 +08:00
Daniel Grindrod
a316f84fe1 fix: LinearSVC results now reproducible (#1376)
Co-authored-by: Daniel Grindrod <Daniel.Grindrod@evotec.com>
2024-10-31 14:02:16 +08:00
Daniel Grindrod
72881d3a2b fix: Fixing the random state of ElasticNetClassifier by default, to ensure reproduciblity. Also included elasticnet in reproducibility tests (#1374)
Co-authored-by: Daniel Grindrod <Daniel.Grindrod@evotec.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-10-29 14:21:43 +08:00
Li Jiang
69da685d1e Fix data transform issue, spark log_loss metric compute error and json dumps TypeError (Sync Fabric till 3c545e67) (#1371)
* Merged PR 1444697: Fix json dumps TypeError

Fix json dumps TypeError

----
Bug fix to address a `TypeError` in `json.dumps`.

This pull request fixes a `TypeError` encountered when using `json.dumps` on `automl._automl_user_configurations` by introducing a safe JSON serialization function.
- Added `safe_json_dumps` function in `flaml/fabric/mlflow.py` to handle non-serializable objects.
- Updated `MLflowIntegration` class in `flaml/fabric/mlflow.py` to use `safe_json_dumps` for JSON serialization.
- Modified `test/automl/test_multiclass.py` to test the new `safe_json_dumps` function.

Related work items: #3439408

* Fix data transform issue and spark log_loss metric compute error
2024-10-29 11:58:40 +08:00
Li Jiang
c01c3910eb Update version.py (#1372) 2024-10-29 09:33:23 +08:00
dependabot[bot]
98d3fd2f48 Bump http-proxy-middleware from 2.0.6 to 2.0.7 in /website (#1370)
Bumps [http-proxy-middleware](https://github.com/chimurai/http-proxy-middleware) from 2.0.6 to 2.0.7.
- [Release notes](https://github.com/chimurai/http-proxy-middleware/releases)
- [Changelog](https://github.com/chimurai/http-proxy-middleware/blob/v2.0.7/CHANGELOG.md)
- [Commits](https://github.com/chimurai/http-proxy-middleware/compare/v2.0.6...v2.0.7)

---
updated-dependencies:
- dependency-name: http-proxy-middleware
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-28 10:43:28 +08:00
Li Jiang
9724c626cc Remove outdated comment (#1366) 2024-10-24 12:17:21 +08:00
smty2018
0d92400200 Included that retrain_full = True does not include the user provided validation data in the docs. #1228 (#1245)
* Update Task-Oriented-AutoML.md

* Update Task-Oriented-AutoML.md

* Update marker

* Fix format

---------

Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-10-23 16:48:45 +08:00
Daniel Grindrod
d224218ecf fix: FLAML catboost metrics arent reproducible (#1364)
* fix: CatBoostRegressors metrics are now reproducible

* test: Made tests live, which ensure the reproducibility of catboost models

* fix: Added defunct line of code as a comment

* fix: Re-adding removed if statement, and test to show one issue that if statement can cause

* fix: Stopped ending CatBoost training early when time budget is running out

---------

Co-authored-by: Daniel Grindrod <Daniel.Grindrod@evotec.com>
2024-10-23 13:51:23 +08:00
Daniel Grindrod
a2a5e1abb9 test: Adding tests to verify model reproducibility (#1362) 2024-10-12 09:53:16 +08:00
Daniel Grindrod
5c0f18b7bc fix: Cross validation process isn't always run to completion (#1360) 2024-10-01 08:24:53 +08:00
dependabot[bot]
e5d95f5674 Bump express from 4.19.2 to 4.21.0 in /website (#1357) 2024-09-22 11:01:00 +08:00
Li Jiang
49ba962d47 Support logger_formatter without automl dependencies (#1356) 2024-09-21 20:04:46 +08:00
Li Jiang
8e171bc402 Remove temporary pickle files (#1354)
* Remove temporary pickle files

* Update version to 2.3.1

* Use TemporaryDirectory for pickle and log_artifact

* Fix 'CatBoostClassifier' object has no attribute '_get_param_names'
2024-09-21 15:46:32 +08:00
dependabot[bot]
c90946f303 Bump webpack from 5.76.1 to 5.94.0 in /website (#1342)
Bumps [webpack](https://github.com/webpack/webpack) from 5.76.1 to 5.94.0.
- [Release notes](https://github.com/webpack/webpack/releases)
- [Commits](https://github.com/webpack/webpack/compare/v5.76.1...v5.94.0)

---
updated-dependencies:
- dependency-name: webpack
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-06 11:56:42 +08:00
dependabot[bot]
64f30af603 Bump micromatch from 4.0.5 to 4.0.8 in /website (#1343)
Bumps [micromatch](https://github.com/micromatch/micromatch) from 4.0.5 to 4.0.8.
- [Release notes](https://github.com/micromatch/micromatch/releases)
- [Changelog](https://github.com/micromatch/micromatch/blob/master/CHANGELOG.md)
- [Commits](https://github.com/micromatch/micromatch/compare/4.0.5...4.0.8)

---
updated-dependencies:
- dependency-name: micromatch
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-09-05 15:18:26 +08:00
Li Jiang
f45582d3c7 Add info of tutorial automl 2024 (#1344)
* Add info of tutorial automl 2024

* Add notebooks

* Fix links

* Update usage of built-in LLMs
2024-09-04 19:35:09 +08:00
Li Jiang
bf4bca2195 Add contributors wall (#1341)
* Add contributors wall

* code format
2024-08-30 22:33:44 +08:00
Li Jiang
efaba26d2e Update version and readme (#1338)
* Update version and readme

* Update pr template
2024-08-22 22:33:23 +00:00
Li Jiang
62194f321d Update issue templates (#1337) 2024-08-21 10:00:48 +00:00
Li Jiang
5bfa0b1cd3 Improve mlflow integration and add more models (#1331)
* Add more spark models and improved mlflow integration

* Update test_extra_models, setup and gitignore

* Remove autofe

* Remove autofe

* Remove autofe

* Sync changes in internal

* Fix test for env without pyspark

* Fix import errors

* Fix tests

* Fix typos

* Fix pytorch-forecasting version

* Remove internal funcs, rename _mlflow.py

* Fix import error

* Fix dependency

* Fix experiment name setting

* Fix dependency

* Update pandas version

* Update pytorch-forecasting version

* Add warning message for not has_automl

* Fix test errors with nltk 3.8.2

* Don't enable mlflow logging w/o an active run

* Fix pytorch-forecasting can't be pickled issue

* Update pyspark tests condition

* Update synapseml

* Update synapseml

* No parent run, no logging for OSS

* Log when autolog is enabled

* upgrade code

* Enable autolog for tune

* Increase time budget for test

* End run before start a new run

* Update parent run

* Fix import error

* clean up

* skip macos and win

* Update notes

* Update default value of model_history
2024-08-13 07:53:47 +00:00
dependabot[bot]
bd34b4e75a Bump express from 4.18.2 to 4.19.2 in /website (#1293)
Bumps [express](https://github.com/expressjs/express) from 4.18.2 to 4.19.2.
- [Release notes](https://github.com/expressjs/express/releases)
- [Changelog](https://github.com/expressjs/express/blob/master/History.md)
- [Commits](https://github.com/expressjs/express/compare/4.18.2...4.19.2)

---
updated-dependencies:
- dependency-name: express
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-08-12 12:55:25 +00:00
dependabot[bot]
7670945298 Bump follow-redirects from 1.15.4 to 1.15.6 in /website (#1291)
Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.15.4 to 1.15.6.
- [Release notes](https://github.com/follow-redirects/follow-redirects/releases)
- [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.15.4...v1.15.6)

---
updated-dependencies:
- dependency-name: follow-redirects
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-08-12 12:52:11 +00:00
dependabot[bot]
43537cb539 Bump webpack-dev-middleware from 5.3.3 to 5.3.4 in /website (#1292)
Bumps [webpack-dev-middleware](https://github.com/webpack/webpack-dev-middleware) from 5.3.3 to 5.3.4.
- [Release notes](https://github.com/webpack/webpack-dev-middleware/releases)
- [Changelog](https://github.com/webpack/webpack-dev-middleware/blob/v5.3.4/CHANGELOG.md)
- [Commits](https://github.com/webpack/webpack-dev-middleware/compare/v5.3.3...v5.3.4)

---
updated-dependencies:
- dependency-name: webpack-dev-middleware
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-08-12 12:50:17 +00:00
Gökhan Geyik
f913b79225 Fix(doc): Page Not Found (#1296)
- Fix the redirect link that received a page not found error.

Co-authored-by: Li Jiang <bnujli@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2024-08-12 12:01:46 +00:00
dependabot[bot]
a092a39b5e Bump braces from 3.0.2 to 3.0.3 in /website (#1336)
Bumps [braces](https://github.com/micromatch/braces) from 3.0.2 to 3.0.3.
- [Changelog](https://github.com/micromatch/braces/blob/master/CHANGELOG.md)
- [Commits](https://github.com/micromatch/braces/compare/3.0.2...3.0.3)

---
updated-dependencies:
- dependency-name: braces
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-08-12 08:37:56 +00:00
Jirka Borovec
04bf1b8741 update py versions, sourced from PyPI (#1332)
* update py versions, sourced from PyPI

* lint

---------

Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-08-12 04:53:48 +00:00
Jirka Borovec
b348cb1136 configure & apply pyupgrade with py3.8+ (#1333)
* configure pyupgrade with `py3.8+`

* apply update

---------

Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-08-12 02:54:18 +00:00
Jirka Borovec
cd0e88e383 fix missing req. arg for new datasets package (#1334)
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-08-12 02:19:11 +00:00
Li Jiang
a17c6e392e Fix test errors of nltk and numpy (#1335)
* Fix test errors with nltk 3.8.2

* Fix test errors with numpy large

* Fix test errors with numpy large
2024-08-12 00:14:21 +00:00
Li Jiang
52627ff14b Add 3.11 icon (#1330) 2024-08-08 06:18:49 +00:00
Li Jiang
7729855f49 Bump version to 2.2.0 (#1329) 2024-08-08 01:05:53 +00:00
Noël Barron
0fe284b21f Doc and comment typos improvements (#1319)
* typographical corrections in the descriptions, comment improvements, general formatting for consistency

* consistent indentation for better readability, improved comments, typographical corrections

* updated docstrings for better clarity, added type hint for **kwargs, typographical corrections (no functionality changes)

* Fix format

---------

Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-08-06 15:29:37 +00:00
Yang, Bo
853c9501bc Keep searching hyperparameters when r2_score raises an error (#1325)
* Keep searching hyperparameters when `r2_score` raises an error

* Add log info

---------

Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-08-06 15:01:10 +00:00
Yang, Bo
8e63dd417b Don't pass callbacks=None to XGBoostSklearnEstimator._fit (#1322)
* Don't pass `callbacks=None` to `XGBoostSklearnEstimator._fit`

The original implmentation would pass `callbacks=None` to `XGBoostSklearnEstimator._fit` and eventually lead to a `TypeError` of `XGBModel.fit() got an unexpected keyword argument 'callbacks'`. This PR instead does not pass the `callbacks=None` parameter to avoid the error.

* Update setup.py to allow for xgboost 2.x

---------

Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-08-06 09:24:11 +00:00
Li Jiang
f27f98c6d7 Fix test mac os python 3.11 (#1328)
* add test

* Skip test_autohf_classificationhead.py for MacOS py311

* Skip test/nlp/test_default.py for MacOS py311

* Check test_tune

* Check test_lexiflow

* Check test_tune

* Remove checks

* Skip test_nested_run for macos py311)

* Skip test_nested_space for macos py311

* Test tune on MacOS Python 3.11 w/o pytest

* Split tests by folder

* Skip test lexiflow for MacOS py311

* Enable test_tune for MacOS py311

* Clean up
2024-08-06 05:50:44 +00:00
Li Jiang
a68d073ccf Add support to python 3.11 (#1326)
* Add support to python 3.11

* Fix workflow python version comparison

* Ray is not supported in python 3.11

* Fix test_numpy
2024-07-31 00:18:41 +00:00
Li Jiang
15fda2206b Add example of how to get best config and convert it to parameters (#1323) 2024-07-24 08:20:36 +00:00
leafy-lee
a9d7b7f971 Handle IntLogUniformDistribution Deprecation before Optuna<=v4.0.0 (#1324)
Co-authored-by: Yifei Li <v-liyifei@microsoft.com>
2024-07-24 07:02:06 +00:00
Li Jiang
d24d2e0088 Upgrade Optuna (#1321) 2024-07-23 01:21:20 +00:00
Ranuga
67f4048667 Update ts_model.py (#1312)
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-07-22 05:32:51 +00:00
Li Jiang
d8129b9211 Fix typos, upgrade yarn packages, add some improvements (#1290)
* Fix typos, upgrade yarn packages, add some improvements

* Fix joblib 1.4.0 breaks joblib-spark

* Fix xgboost test error

* Pin xgboost<2.0.0

* Try update prophet to 1.5.1

* Update github workflow

* Revert prophet version

* Update github workflow

* Update install libomp

* Fix test errors

* Fix test errors

* Add retry to test and coverage

* Revert "Add retry to test and coverage"

This reverts commit ce13097cd5.

* Increase test budget

* Add more data to test_models, try fixing ValueError: Found array with 0 sample(s) (shape=(0, 252)) while a minimum of 1 is required.
2024-07-19 13:40:04 +00:00
Jirka Borovec
165d7467f9 precommit: introduce mdformat (#1276)
* precommit: introduce `mdformat`

* precommit: apply
2024-03-19 22:46:56 +00:00
Gleb Levitski
3de0dc667e Add ruff sort to pre-commit and sort imports in the library (#1259)
* lint

* bump ver

* bump ver

* fixed circular import

---------

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2024-03-12 21:28:57 +00:00
dependabot[bot]
6840dc2b09 Bump follow-redirects from 1.15.2 to 1.15.4 in /website (#1266)
Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.15.2 to 1.15.4.
- [Release notes](https://github.com/follow-redirects/follow-redirects/releases)
- [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.15.2...v1.15.4)

---
updated-dependencies:
- dependency-name: follow-redirects
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2024-03-12 16:50:01 +00:00
Chi Wang
1a9fa3ac23 Np.inf (#1289)
* np.Inf -> np.inf

* bump version to 2.1.2
2024-03-12 16:27:05 +00:00
Jack Gerrits
325baa40a5 Don't specify a pre-release in the numpy dependency (#1286) 2024-03-12 14:43:49 +00:00
Dhruv Thakur
550d1cfe9b Update AutoML-NLP.md (#1239)
* Update AutoML-NLP.md

#834

* more space

---------

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
Co-authored-by: Chi Wang <wang.chi@microsoft.com>
2024-02-10 07:32:57 +00:00
Jirka Borovec
249f0f1708 docs: fix link to reference (#1263)
* docs: fix link to reference

* Apply suggestions from code review

Co-authored-by: Li Jiang <bnujli@gmail.com>

---------

Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-02-09 16:48:51 +00:00
Li Jiang
b645da3ea7 Fix spark errors (#1274)
* Fix mlflow not found error

* Fix joblib>1.2.0 force cancel error

* Remove joblib version constraint

* Update log

* Improve joblib exception catch

* Added permissions
2024-02-09 01:08:24 +00:00
ScottzCodez
0415638dd1 Update Installation.md (#1258)
Typo Fixed.
2023-11-29 01:39:20 +00:00
Gleb Levitski
6b93c2e394 [ENH] Add support for sklearn HistGradientBoostingEstimator (#1230)
* Update model.py

HistGradientBoosting support

* Create __init__.py

* Update model.py

* Create histgb.py

* Update __init__.py

* Update test_model.py

* added histgb to estimator list

* Update Task-Oriented-AutoML.md

added docs

* lint

* fixed bugs

---------

Co-authored-by: Gleb <gleb@Glebs-MacBook-Pro.local>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2023-10-31 14:45:23 +00:00
dependabot[bot]
a93bf39720 Bump @babel/traverse from 7.20.1 to 7.23.2 in /website (#1248)
Bumps [@babel/traverse](https://github.com/babel/babel/tree/HEAD/packages/babel-traverse) from 7.20.1 to 7.23.2.
- [Release notes](https://github.com/babel/babel/releases)
- [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md)
- [Commits](https://github.com/babel/babel/commits/v7.23.2/packages/babel-traverse)

---
updated-dependencies:
- dependency-name: "@babel/traverse"
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-21 14:48:46 +00:00
dependabot[bot]
dc8060a21b Bump postcss from 8.4.18 to 8.4.31 in /website (#1238)
Bumps [postcss](https://github.com/postcss/postcss) from 8.4.18 to 8.4.31.
- [Release notes](https://github.com/postcss/postcss/releases)
- [Changelog](https://github.com/postcss/postcss/blob/main/CHANGELOG.md)
- [Commits](https://github.com/postcss/postcss/compare/8.4.18...8.4.31)

---
updated-dependencies:
- dependency-name: postcss
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Qingyun Wu <qingyun.wu@psu.edu>
2023-10-12 07:56:29 +00:00
Aindree Chatterjee
30db685cee Update README.md with autogen links (#1235)
* Update README.md

Added the links to discord, website and github repo for Autogen in ReadMe.md's first news.
In corelation to issue #1231

* Update README.md
2023-10-09 15:32:39 +00:00
Chi Wang
fda9fa0103 improve docstr of preprocessors (#1227)
* improve docstr of preprocessors

* Update SynapseML version

* RFix test

---------

Co-authored-by: Li Jiang <bnujli@gmail.com>
2023-09-29 03:07:21 +00:00
Qingyun Wu
830ec4541c Update autogen links (#1214)
* update links

* update autogen doc link

* wording

---------

Co-authored-by: Chi Wang <wang.chi@microsoft.com>
2023-09-23 16:55:30 +00:00
Dominik Moritz
46162578f8 Fix typo Whetehr -> Whether (#1220)
Co-authored-by: Chi Wang <wang.chi@microsoft.com>
2023-09-22 15:27:02 +00:00
Dominik Moritz
8658e51182 fix ref to research (#1218)
Co-authored-by: Chi Wang <wang.chi@microsoft.com>
2023-09-22 15:26:21 +00:00
Chi Wang
868e7dd1ca support xgboost 2.0 (#1219)
* support xgboost 2.0

* try classes_

* test version

* quote

* use_label_encoder

* Fix xgboost test error

* remove deprecated files

* remove deprecated files

* remove deprecated import

* replace deprecated import in integrate_spark.ipynb

* replace deprecated import in automl_lightgbm.ipynb

* formatted integrate_spark.ipynb

* replace deprecated import

* try fix driver python path

* Update python-package.yml

* replace deprecated reference

* move spark python env var to other section

* Update setup.py, install xgb<2 for MacOS

* Fix typo

* assert

* Try assert xgboost version

* Fail fast

* Keep all test/spark to try fail fast

* No need to skip spark test in Mac or Win

* Remove assert xgb version

* Remove fail fast

* Found root cause, fix test_sparse_matrix_xgboost

* Revert "No need to skip spark test in Mac or Win"

This reverts commit a09034817f.

* remove assertion

---------

Co-authored-by: Li Jiang <bnujli@gmail.com>
Co-authored-by: levscaut <57213911+levscaut@users.noreply.github.com>
Co-authored-by: levscaut <lwd2010530@qq.com>
Co-authored-by: Li Jiang <lijiang1@microsoft.com>
2023-09-22 06:55:00 +00:00
Chi Wang
4886cb5689 Rename Responsive -> Conversable (#1202)
* responsive -> conversable

* preview

* rename

* register reply

* rename and version

* bump version to 2.1.0

* notebook

* bug fix
2023-09-12 00:07:35 +00:00
Chi Wang
599731cb22 rename human to user_proxy (#1215)
* rename human to user_proxy

* notebook update and bug fix
2023-09-11 14:33:47 +00:00
Chi Wang
0cb79dfdff group chat for visualization (#1213)
* group chat for visualization

* show figure

* webpage update

* link update

* example 2

* example 2

---------

Co-authored-by: Qingyun Wu <qingyun.wu@psu.edu>
2023-09-10 23:20:45 +00:00
Qingyun Wu
f70df312f4 Migration headsup (#1204)
* add readme

* migration headsup

* remove move date

* Update README.md

Co-authored-by: Chi Wang <wang.chi@microsoft.com>

---------

Co-authored-by: Chi Wang <wang.chi@microsoft.com>
2023-09-09 00:08:24 +00:00
275 changed files with 23150 additions and 4062 deletions

View File

@@ -1,5 +1,7 @@
[run]
branch = True
source = flaml
source =
flaml
omit =
*test*
*/test/*
*/flaml/autogen/*

73
.github/ISSUE_TEMPLATE.md vendored Normal file
View File

@@ -0,0 +1,73 @@
### Description
<!-- A clear and concise description of the issue or feature request. -->
### Environment
- FLAML version: <!-- Specify the FLAML version (e.g., v0.2.0) -->
- Python version: <!-- Specify the Python version (e.g., 3.8) -->
- Operating System: <!-- Specify the OS (e.g., Windows 10, Ubuntu 20.04) -->
### Steps to Reproduce (for bugs)
<!-- Provide detailed steps to reproduce the issue. Include code snippets, configuration files, or any other relevant information. -->
1. Step 1
1. Step 2
1. ...
### Expected Behavior
<!-- Describe what you expected to happen. -->
### Actual Behavior
<!-- Describe what actually happened. Include any error messages, stack traces, or unexpected behavior. -->
### Screenshots / Logs (if applicable)
<!-- If relevant, include screenshots or logs that help illustrate the issue. -->
### Additional Information
<!-- Include any additional information that might be helpful, such as specific configurations, data samples, or context about the environment. -->
### Possible Solution (if you have one)
<!-- If you have suggestions on how to address the issue, provide them here. -->
### Is this a Bug or Feature Request?
<!-- Choose one: Bug | Feature Request -->
### Priority
<!-- Choose one: High | Medium | Low -->
### Difficulty
<!-- Choose one: Easy | Moderate | Hard -->
### Any related issues?
<!-- If this is related to another issue, reference it here. -->
### Any relevant discussions?
<!-- If there are any discussions or forum threads related to this issue, provide links. -->
### Checklist
<!-- Please check the items that you have completed -->
- [ ] I have searched for similar issues and didn't find any duplicates.
- [ ] I have provided a clear and concise description of the issue.
- [ ] I have included the necessary environment details.
- [ ] I have outlined the steps to reproduce the issue.
- [ ] I have included any relevant logs or screenshots.
- [ ] I have indicated whether this is a bug or a feature request.
- [ ] I have set the priority and difficulty levels.
### Additional Comments
<!-- Any additional comments or context that you think would be helpful. -->

53
.github/ISSUE_TEMPLATE/bug_report.yml vendored Normal file
View File

@@ -0,0 +1,53 @@
name: Bug Report
description: File a bug report
title: "[Bug]: "
labels: ["bug"]
body:
- type: textarea
id: description
attributes:
label: Describe the bug
description: A clear and concise description of what the bug is.
placeholder: What went wrong?
- type: textarea
id: reproduce
attributes:
label: Steps to reproduce
description: |
Steps to reproduce the behavior:
1. Step 1
2. Step 2
3. ...
4. See error
placeholder: How can we replicate the issue?
- type: textarea
id: modelused
attributes:
label: Model Used
description: A description of the model that was used when the error was encountered
placeholder: gpt-4, mistral-7B etc
- type: textarea
id: expected_behavior
attributes:
label: Expected Behavior
description: A clear and concise description of what you expected to happen.
placeholder: What should have happened?
- type: textarea
id: screenshots
attributes:
label: Screenshots and logs
description: If applicable, add screenshots and logs to help explain your problem.
placeholder: Add screenshots here
- type: textarea
id: additional_information
attributes:
label: Additional Information
description: |
- FLAML Version: <!-- Specify the FLAML version (e.g., v0.2.0) -->
- Operating System: <!-- Specify the OS (e.g., Windows 10, Ubuntu 20.04) -->
- Python Version: <!-- Specify the Python version (e.g., 3.8) -->
- Related Issues: <!-- Link to any related issues here (e.g., #1) -->
- Any other relevant information.
placeholder: Any additional details

1
.github/ISSUE_TEMPLATE/config.yml vendored Normal file
View File

@@ -0,0 +1 @@
blank_issues_enabled: true

View File

@@ -0,0 +1,26 @@
name: Feature Request
description: File a feature request
labels: ["enhancement"]
title: "[Feature Request]: "
body:
- type: textarea
id: problem_description
attributes:
label: Is your feature request related to a problem? Please describe.
description: A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
placeholder: What problem are you trying to solve?
- type: textarea
id: solution_description
attributes:
label: Describe the solution you'd like
description: A clear and concise description of what you want to happen.
placeholder: How do you envision the solution?
- type: textarea
id: additional_context
attributes:
label: Additional context
description: Add any other context or screenshots about the feature request here.
placeholder: Any additional information

View File

@@ -0,0 +1,41 @@
name: General Issue
description: File a general issue
title: "[Issue]: "
labels: []
body:
- type: textarea
id: description
attributes:
label: Describe the issue
description: A clear and concise description of what the issue is.
placeholder: What went wrong?
- type: textarea
id: reproduce
attributes:
label: Steps to reproduce
description: |
Steps to reproduce the behavior:
1. Step 1
2. Step 2
3. ...
4. See error
placeholder: How can we replicate the issue?
- type: textarea
id: screenshots
attributes:
label: Screenshots and logs
description: If applicable, add screenshots and logs to help explain your problem.
placeholder: Add screenshots here
- type: textarea
id: additional_information
attributes:
label: Additional Information
description: |
- FLAML Version: <!-- Specify the FLAML version (e.g., v0.2.0) -->
- Operating System: <!-- Specify the OS (e.g., Windows 10, Ubuntu 20.04) -->
- Python Version: <!-- Specify the Python version (e.g., 3.8) -->
- Related Issues: <!-- Link to any related issues here (e.g., #1) -->
- Any other relevant information.
placeholder: Any additional details

View File

@@ -12,7 +12,7 @@
## Checks
<!-- - I've used [pre-commit](https://microsoft.github.io/FLAML/docs/Contribute#pre-commit) to lint the changes in this PR (note the same in integrated in our CI checks). -->
- [ ] I've used [pre-commit](https://microsoft.github.io/FLAML/docs/Contribute#pre-commit) to lint the changes in this PR (note the same in integrated in our CI checks).
- [ ] I've included any doc changes needed for https://microsoft.github.io/FLAML/. See https://microsoft.github.io/FLAML/docs/Contribute#documentation to build and test documentation locally.
- [ ] I've added tests (if relevant) corresponding to the changes introduced in this PR.
- [ ] I've made sure all auto checks have passed.

243
.github/copilot-instructions.md vendored Normal file
View File

@@ -0,0 +1,243 @@
# GitHub Copilot Instructions for FLAML
## Project Overview
FLAML (Fast Library for Automated Machine Learning & Tuning) is a lightweight Python library for efficient automation of machine learning and AI operations. It automates workflow based on large language models, machine learning models, etc. and optimizes their performance.
**Key Components:**
- `flaml/automl/`: AutoML functionality for classification and regression
- `flaml/tune/`: Generic hyperparameter tuning
- `flaml/default/`: Zero-shot AutoML with default configurations
- `flaml/autogen/`: Legacy autogen code (note: AutoGen has moved to a separate repository)
- `flaml/fabric/`: Microsoft Fabric integration
- `test/`: Comprehensive test suite
## Build and Test Commands
### Installation
```bash
# Basic installation
pip install -e .
# Install with test dependencies
pip install -e .[test]
# Install with automl dependencies
pip install -e .[automl]
# Install with forecast dependencies (Linux only)
pip install -e .[forecast]
```
### Running Tests
```bash
# Run all tests (excluding autogen)
pytest test/ --ignore=test/autogen --reruns 2 --reruns-delay 10
# Run tests with coverage
coverage run -a -m pytest test --ignore=test/autogen --reruns 2 --reruns-delay 10
coverage xml
# Check dependencies
python test/check_dependency.py
```
### Linting and Formatting
```bash
# Run pre-commit hooks
pre-commit run --all-files
# Format with black (line length: 120)
black . --line-length 120
# Run ruff for linting and auto-fix
ruff check . --fix
```
## Code Style and Formatting
### Python Style
- **Line length:** 120 characters (configured in both Black and Ruff)
- **Formatter:** Black (v23.3.0+)
- **Linter:** Ruff with Pyflakes and pycodestyle rules
- **Import sorting:** Use isort (via Ruff)
- **Python version:** Supports Python >= 3.10 (full support for 3.10, 3.11, 3.12 and 3.13)
### Code Quality Rules
- Follow Black formatting conventions
- Keep imports sorted and organized
- Avoid unused imports (F401) - these are flagged but not auto-fixed
- Avoid wildcard imports (F403) where possible
- Complexity: Max McCabe complexity of 10
- Use type hints where appropriate
- Write clear docstrings for public APIs
### Pre-commit Hooks
The repository uses pre-commit hooks for:
- Checking for large files, AST syntax, YAML/TOML/JSON validity
- Detecting merge conflicts and private keys
- Trailing whitespace and end-of-file fixes
- pyupgrade for Python 3.8+ syntax
- Black formatting
- Markdown formatting (mdformat with GFM and frontmatter support)
- Ruff linting with auto-fix
## Testing Strategy
### Test Organization
- Tests are in the `test/` directory, organized by module
- `test/automl/`: AutoML feature tests
- `test/tune/`: Hyperparameter tuning tests
- `test/default/`: Zero-shot AutoML tests
- `test/nlp/`: NLP-related tests
- `test/spark/`: Spark integration tests
### Test Requirements
- Write tests for new functionality
- Ensure tests pass on multiple Python versions (3.10, 3.11, 3.12 and 3.13)
- Tests should work on both Ubuntu and Windows
- Use pytest markers for platform-specific tests (e.g., `@pytest.mark.spark`)
- Tests should be idempotent and not depend on external state
- Use `--reruns 2 --reruns-delay 10` for flaky tests
### Coverage
- Aim for good test coverage on new code
- Coverage reports are generated for Python 3.11 builds
- Coverage reports are uploaded to Codecov
## Git Workflow and Best Practices
### Branching
- Main branch: `main`
- Create feature branches from `main`
- PR reviews are required before merging
### Commit Messages
- Use clear, descriptive commit messages
- Reference issue numbers when applicable
- ALWAYS run `pre-commit run --all-files` before each commit to avoid formatting issues
### Pull Requests
- Ensure all tests pass before requesting review
- Update documentation if adding new features
- Follow the PR template in `.github/PULL_REQUEST_TEMPLATE.md`
- ALWAYS run `pre-commit run --all-files` before each commit to avoid formatting issues
## Project Structure
```
flaml/
├── automl/ # AutoML functionality
├── tune/ # Hyperparameter tuning
├── default/ # Zero-shot AutoML
├── autogen/ # Legacy autogen (deprecated, moved to separate repo)
├── fabric/ # Microsoft Fabric integration
├── onlineml/ # Online learning
└── version.py # Version information
test/ # Test suite
├── automl/
├── tune/
├── default/
├── nlp/
└── spark/
notebook/ # Example notebooks
website/ # Documentation website
```
## Dependencies and Package Management
### Core Dependencies
- NumPy >= 1.17
- Python >= 3.10 (officially supported: 3.10, 3.11, 3.12 and 3.13)
### Optional Dependencies
- `[automl]`: lightgbm, xgboost, scipy, pandas, scikit-learn
- `[test]`: Full test suite dependencies
- `[spark]`: PySpark and joblib dependencies
- `[forecast]`: holidays, prophet, statsmodels, hcrystalball, pytorch-forecasting, pytorch-lightning, tensorboardX
- `[hf]`: Hugging Face transformers and datasets
- See `setup.py` for complete list
### Version Constraints
- Be mindful of Python version-specific dependencies (check setup.py)
- XGBoost versions differ based on Python version
- NumPy 2.0+ only for Python >= 3.13
- Some features (like vowpalwabbit) only work with older Python versions
## Boundaries and Restrictions
### Do NOT Modify
- `.git/` directory and Git configuration
- `LICENSE` file
- Version information in `flaml/version.py` (unless explicitly updating version)
- GitHub Actions workflows without careful consideration
- Existing test files unless fixing bugs or adding coverage
### Be Cautious With
- `setup.py`: Changes to dependencies should be carefully reviewed
- `pyproject.toml`: Linting and testing configuration
- `.pre-commit-config.yaml`: Pre-commit hook configuration
- Backward compatibility: FLAML is a library with external users
### Security Considerations
- Never commit secrets or API keys
- Be careful with external data sources in tests
- Validate user inputs in public APIs
- Follow secure coding practices for ML operations
## Special Notes
### AutoGen Migration
- AutoGen has moved to a separate repository: https://github.com/microsoft/autogen
- The `flaml/autogen/` directory contains legacy code
- Tests in `test/autogen/` are ignored in the main test suite
- Direct users to the new AutoGen repository for AutoGen-related issues
### Platform-Specific Considerations
- Some tests only run on Linux (e.g., forecast tests with prophet)
- Windows and Ubuntu are the primary supported platforms
- macOS support exists but requires special libomp setup for lgbm/xgboost
### Performance
- FLAML focuses on efficient automation and tuning
- Consider computational cost when adding new features
- Optimize for low resource usage where possible
## Documentation
- Main documentation: https://microsoft.github.io/FLAML/
- Update documentation when adding new features
- Provide clear examples in docstrings
- Add notebook examples for significant new features
## Contributing
- Follow the contributing guide: https://microsoft.github.io/FLAML/docs/Contribute
- Sign the Microsoft CLA when making your first contribution
- Be respectful and follow the Microsoft Open Source Code of Conduct
- Join the Discord community for discussions: https://discord.gg/Cppx2vSPVP

View File

@@ -12,26 +12,17 @@ jobs:
deploy:
strategy:
matrix:
os: ['ubuntu-latest']
python-version: [3.8]
os: ["ubuntu-latest"]
python-version: ["3.12"]
runs-on: ${{ matrix.os }}
environment: package
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Cache conda
uses: actions/cache@v3
uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
path: ~/conda_pkgs_dir
key: conda-${{ matrix.os }}-python-${{ matrix.python-version }}-${{ hashFiles('environment.yml') }}
- name: Setup Miniconda
uses: conda-incubator/setup-miniconda@v2
with:
auto-update-conda: true
auto-activate-base: false
activate-environment: hcrystalball
python-version: ${{ matrix.python-version }}
use-only-tar-bz2: true
- name: Install from source
# This is required for the pre-commit tests
shell: pwsh
@@ -42,7 +33,7 @@ jobs:
- name: Build
shell: pwsh
run: |
pip install twine
pip install twine wheel setuptools
python setup.py sdist bdist_wheel
- name: Publish to PyPI
env:

View File

@@ -17,6 +17,9 @@ on:
merge_group:
types: [checks_requested]
permissions:
contents: write
jobs:
checks:
if: github.event_name != 'push'
@@ -34,11 +37,11 @@ jobs:
- name: setup python
uses: actions/setup-python@v4
with:
python-version: "3.8"
python-version: "3.12"
- name: pydoc-markdown install
run: |
python -m pip install --upgrade pip
pip install pydoc-markdown==4.5.0
pip install pydoc-markdown==4.7.0 setuptools
- name: pydoc-markdown run
run: |
pydoc-markdown
@@ -70,11 +73,11 @@ jobs:
- name: setup python
uses: actions/setup-python@v4
with:
python-version: "3.8"
python-version: "3.12"
- name: pydoc-markdown install
run: |
python -m pip install --upgrade pip
pip install pydoc-markdown==4.5.0
pip install pydoc-markdown==4.7.0 setuptools
- name: pydoc-markdown run
run: |
pydoc-markdown

View File

@@ -4,14 +4,17 @@
name: OpenAI
on:
pull_request:
branches: ['main']
paths:
- 'flaml/autogen/**'
- 'test/autogen/**'
- 'notebook/autogen_openai_completion.ipynb'
- 'notebook/autogen_chatgpt_gpt4.ipynb'
- '.github/workflows/openai.yml'
workflow_dispatch:
# pull_request:
# branches: ['main']
# paths:
# - 'flaml/autogen/**'
# - 'test/autogen/**'
# - 'notebook/autogen_openai_completion.ipynb'
# - 'notebook/autogen_chatgpt_gpt4.ipynb'
# - '.github/workflows/openai.yml'
permissions: {}
jobs:
test:

View File

@@ -1,15 +1,14 @@
name: Code formatting
# see: https://help.github.com/en/actions/reference/events-that-trigger-workflows
on: # Trigger the workflow on push or pull request, but only for the main branch
push:
branches: [main]
on:
pull_request: {}
defaults:
run:
shell: bash
permissions: {}
jobs:
pre-commit-check:

View File

@@ -14,9 +14,20 @@ on:
- 'setup.py'
pull_request:
branches: ['main']
paths:
- 'flaml/**'
- 'test/**'
- 'notebook/**'
- '.github/workflows/python-package.yml'
- 'setup.py'
merge_group:
types: [checks_requested]
schedule:
# Every other day at 02:00 UTC
- cron: '0 2 */2 * *'
permissions:
contents: write
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref }}
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
@@ -28,20 +39,18 @@ jobs:
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest, windows-2019]
python-version: ["3.8", "3.9", "3.10"]
os: [ubuntu-latest, windows-latest]
python-version: ["3.10", "3.11", "3.12", "3.13"]
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: On mac + python 3.10, install libomp to facilitate lgbm and xgboost install
if: matrix.os == 'macOS-latest' && matrix.python-version == '3.10'
- name: On mac, install libomp to facilitate lgbm and xgboost install
if: matrix.os == 'macos-latest'
run: |
# remove libomp version constraint after xgboost works with libomp>11.1.0 on python 3.10
wget https://raw.githubusercontent.com/Homebrew/homebrew-core/679923b4eb48a8dc7ecc1f05d06063cd79b3fc00/Formula/libomp.rb -O $(find $(brew --repository) -name libomp.rb)
brew unlink libomp
brew update
brew install libomp
export CC=/usr/bin/clang
export CXX=/usr/bin/clang++
@@ -51,74 +60,82 @@ jobs:
export LDFLAGS="$LDFLAGS -Wl,-rpath,/usr/local/opt/libomp/lib -L/usr/local/opt/libomp/lib -lomp"
- name: Install packages and dependencies
run: |
python -m pip install --upgrade pip wheel
python -m pip install --upgrade pip wheel setuptools
pip install -e .
python -c "import flaml"
pip install -e .[test]
- name: On Ubuntu python 3.8, install pyspark 3.2.3
if: matrix.python-version == '3.8' && matrix.os == 'ubuntu-latest'
- name: On Ubuntu python 3.11, install pyspark 3.5.1
if: matrix.python-version == '3.11' && matrix.os == 'ubuntu-latest'
run: |
pip install pyspark==3.2.3
pip install pyspark==3.5.1
pip list | grep "pyspark"
- name: If linux, install ray 2
- name: On Ubuntu python 3.12, install pyspark 4.0.1
if: matrix.python-version == '3.12' && matrix.os == 'ubuntu-latest'
run: |
pip install pyspark==4.0.1
pip list | grep "pyspark"
- name: On Ubuntu python 3.13, install pyspark 4.1.0
if: matrix.python-version == '3.13' && matrix.os == 'ubuntu-latest'
run: |
pip install pyspark==4.1.0
pip list | grep "pyspark"
# # TODO: support ray
# - name: If linux and python<3.11, install ray 2
# if: matrix.os == 'ubuntu-latest' && matrix.python-version < '3.11'
# run: |
# pip install "ray[tune]<2.5.0"
- name: Install prophet when on linux
if: matrix.os == 'ubuntu-latest'
run: |
pip install "ray[tune]<2.5.0"
- name: If mac, install ray
if: matrix.os == 'macOS-latest'
run: |
pip install -e .[ray]
- name: If linux or mac, install prophet on python < 3.9
if: (matrix.os == 'macOS-latest' || matrix.os == 'ubuntu-latest') && matrix.python-version != '3.9' && matrix.python-version != '3.10'
run: |
pip install -e .[forecast]
- name: Install vw on python < 3.10
if: matrix.python-version != '3.10'
# TODO: support vw for python 3.10+
- name: If linux and python<3.10, install vw
if: matrix.os == 'ubuntu-latest' && matrix.python-version < '3.10'
run: |
pip install -e .[vw]
- name: Uninstall pyspark on (python 3.9) or (python 3.8 + windows)
if: matrix.python-version == '3.9' || (matrix.python-version == '3.8' && matrix.os == 'windows-2019')
- name: Pip freeze
run: |
# Uninstall pyspark to test env without pyspark
pip uninstall -y pyspark
pip freeze
- name: Check dependencies
run: |
python test/check_dependency.py
- name: Clear pip cache
run: |
pip cache purge
- name: Test with pytest
if: matrix.python-version != '3.10'
timeout-minutes: 120
if: matrix.python-version != '3.11'
run: |
pytest test
pytest test/ --ignore=test/autogen --reruns 2 --reruns-delay 10
- name: Coverage
if: matrix.python-version == '3.10'
timeout-minutes: 120
if: matrix.python-version == '3.11'
run: |
pip install coverage
coverage run -a -m pytest test
coverage run -a -m pytest test --ignore=test/autogen --reruns 2 --reruns-delay 10
coverage xml
- name: Upload coverage to Codecov
if: matrix.python-version == '3.10'
if: matrix.python-version == '3.11'
uses: codecov/codecov-action@v3
with:
file: ./coverage.xml
flags: unittests
- name: Save dependencies
if: github.ref == 'refs/heads/main'
shell: bash
run: |
git config --global user.name 'github-actions[bot]'
git config --global user.email 'github-actions[bot]@users.noreply.github.com'
git config advice.addIgnoredFile false
# docs:
BRANCH=unit-tests-installed-dependencies
git fetch origin
git checkout -B "$BRANCH" "origin/$BRANCH"
# runs-on: ubuntu-latest
# steps:
# - uses: actions/checkout@v3
# - name: Setup Python
# uses: actions/setup-python@v4
# with:
# python-version: '3.8'
# - name: Compile documentation
# run: |
# pip install -e .
# python -m pip install sphinx sphinx_rtd_theme
# cd docs
# make html
# - name: Deploy to GitHub pages
# if: ${{ github.ref == 'refs/heads/main' }}
# uses: JamesIves/github-pages-deploy-action@3.6.2
# with:
# GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
# BRANCH: gh-pages
# FOLDER: docs/_build/html
# CLEAN: true
pip freeze > installed_all_dependencies_${{ matrix.python-version }}_${{ matrix.os }}.txt
python test/check_dependency.py > installed_first_tier_dependencies_${{ matrix.python-version }}_${{ matrix.os }}.txt
git add installed_*dependencies*.txt
mv coverage.xml ./coverage_${{ matrix.python-version }}_${{ matrix.os }}.xml || true
git add -f ./coverage_${{ matrix.python-version }}_${{ matrix.os }}.xml || true
git commit -m "Update installed dependencies for Python ${{ matrix.python-version }} on ${{ matrix.os }}" || exit 0
git push origin "$BRANCH" --force

24
.gitignore vendored
View File

@@ -60,6 +60,7 @@ coverage.xml
.hypothesis/
.pytest_cache/
cover/
junit
# Translations
*.mo
@@ -163,5 +164,28 @@ output/
flaml/tune/spark/mylearner.py
*.pkl
data/
benchmark/pmlb/csv_datasets
benchmark/*.csv
checkpoints/
test/default
test/housing.json
test/nlp/default/transformer_ms/seq-classification.json
flaml/fabric/fanova/*fanova.c
# local config files
*.config.local
local_debug/
patch.diff
# Test things
notebook/lightning_logs/
lightning_logs/
flaml/autogen/extensions/tmp/
test/autogen/my_tmp/
catboost_*
# Internal configs
.pypirc

View File

@@ -22,10 +22,28 @@ repos:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: no-commit-to-branch
- repo: https://github.com/asottile/pyupgrade
rev: v2.31.1
hooks:
- id: pyupgrade
args: [--py38-plus]
name: Upgrade code
- repo: https://github.com/psf/black
rev: 23.3.0
hooks:
- id: black
- repo: https://github.com/executablebooks/mdformat
rev: 0.7.22
hooks:
- id: mdformat
additional_dependencies:
- mdformat-gfm
- mdformat-black
- mdformat_frontmatter
- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: v0.0.261
hooks:

View File

@@ -1,5 +1,5 @@
# basic setup
FROM python:3.7
FROM mcr.microsoft.com/devcontainers/python:3.10
RUN apt-get update && apt-get -y update
RUN apt-get install -y sudo git npm

371
NOTICE.md
View File

@@ -1,221 +1,222 @@
NOTICES
# NOTICES
This repository incorporates material as listed below or described in the code.
#
## Component. Ray.
Code in tune/[analysis.py, sample.py, trial.py, result.py],
searcher/[suggestion.py, variant_generator.py], and scheduler/trial_scheduler.py is adapted from
https://github.com/ray-project/ray/blob/master/python/ray/tune/
## Open Source License/Copyright Notice.
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
1. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
1. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
1. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
1. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
1. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
1. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
1. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
1. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "{}"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
```
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "{}"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
```
Copyright {yyyy} {name of copyright owner}
Copyright {yyyy} {name of copyright owner}
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
```
http://www.apache.org/licenses/LICENSE-2.0
```
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--------------------------------------------------------------------------------
______________________________________________________________________
Code in python/ray/rllib/{evolution_strategies, dqn} adapted from
https://github.com/openai (MIT License)
@@ -240,7 +241,7 @@ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
--------------------------------------------------------------------------------
______________________________________________________________________
Code in python/ray/rllib/impala/vtrace.py from
https://github.com/deepmind/scalable_agent
@@ -251,7 +252,9 @@ Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
```
https://www.apache.org/licenses/LICENSE-2.0
```
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
@@ -259,7 +262,8 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--------------------------------------------------------------------------------
______________________________________________________________________
Code in python/ray/rllib/ars is adapted from https://github.com/modestyachts/ARS
Copyright (c) 2018, ARS contributors (Horia Mania, Aurelia Guy, Benjamin Recht)
@@ -269,11 +273,11 @@ Redistribution and use of ARS in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation and/or
other materials provided with the distribution.
1. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation and/or
other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
@@ -286,5 +290,6 @@ ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
------------------
Code in python/ray/_private/prometheus_exporter.py is adapted from https://github.com/census-instrumentation/opencensus-python/blob/master/contrib/opencensus-ext-prometheus/opencensus/ext/prometheus/stats_exporter/__init__.py
______________________________________________________________________
Code in python/ray/\_private/prometheus_exporter.py is adapted from https://github.com/census-instrumentation/opencensus-python/blob/master/contrib/opencensus-ext-prometheus/opencensus/ext/prometheus/stats_exporter/__init__.py

View File

@@ -1,11 +1,11 @@
[![PyPI version](https://badge.fury.io/py/FLAML.svg)](https://badge.fury.io/py/FLAML)
![Conda version](https://img.shields.io/conda/vn/conda-forge/flaml)
[![Build](https://github.com/microsoft/FLAML/actions/workflows/python-package.yml/badge.svg)](https://github.com/microsoft/FLAML/actions/workflows/python-package.yml)
![Python Version](https://img.shields.io/badge/3.8%20%7C%203.9%20%7C%203.10-blue)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/FLAML)](https://pypi.org/project/FLAML/)
[![Downloads](https://pepy.tech/badge/flaml)](https://pepy.tech/project/flaml)
[![](https://img.shields.io/discord/1025786666260111483?logo=discord&style=flat)](https://discord.gg/Cppx2vSPVP)
<!-- [![Join the chat at https://gitter.im/FLAMLer/community](https://badges.gitter.im/FLAMLer/community.svg)](https://gitter.im/FLAMLer/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) -->
<!-- [![Join the chat at https://gitter.im/FLAMLer/community](https://badges.gitter.im/FLAMLer/community.svg)](https://gitter.im/FLAMLer/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) -->
# A Fast Library for Automated Machine Learning & Tuning
@@ -14,39 +14,36 @@
<br>
</p>
:fire: The automated multi-agent chat framework in [autogen](https://microsoft.github.io/FLAML/docs/Use-Cases/Autogen) is in preview from v2.0.0.
:fire: FLAML is highlighted in OpenAI's [cookbook](https://github.com/openai/openai-cookbook#related-resources-from-around-the-web).
:fire: [autogen](https://microsoft.github.io/FLAML/docs/Use-Cases/Autogen) is released with support for ChatGPT and GPT-4, based on [Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference](https://arxiv.org/abs/2303.04673).
:fire: FLAML supports Code-First AutoML & Tuning Private Preview in [Microsoft Fabric Data Science](https://learn.microsoft.com/en-us/fabric/data-science/).
:fire: FLAML supports AutoML and Hyperparameter Tuning in [Microsoft Fabric Data Science](https://learn.microsoft.com/en-us/fabric/data-science/automated-machine-learning-fabric). In addition, we've introduced Python 3.11 and 3.12 support, along with a range of new estimators, and comprehensive integration with MLflow—thanks to contributions from the Microsoft Fabric product team.
:fire: Heads-up: [AutoGen](https://microsoft.github.io/autogen/) has moved to a dedicated [GitHub repository](https://github.com/microsoft/autogen). FLAML no longer includes the `autogen` module—please use AutoGen directly.
## What is FLAML
FLAML is a lightweight Python library for efficient automation of machine
learning and AI operations. It automates workflow based on large language models, machine learning models, etc.
and optimizes their performance.
* FLAML enables building next-gen GPT-X applications based on multi-agent conversations with minimal effort. It simplifies the orchestration, automation and optimization of a complex GPT-X workflow. It maximizes the performance of GPT-X models and augments their weakness.
* For common machine learning tasks like classification and regression, it quickly finds quality models for user-provided data with low computational resources. It is easy to customize or extend. Users can find their desired customizability from a smooth range.
* It supports fast and economical automatic tuning (e.g., inference hyperparameters for foundation models, configurations in MLOps/LMOps workflows, pipelines, mathematical/statistical models, algorithms, computing experiments, software configurations), capable of handling large search space with heterogeneous evaluation cost and complex constraints/guidance/early stopping.
- FLAML enables economical automation and tuning for ML/AI workflows, including model selection and hyperparameter optimization under resource constraints.
- For common machine learning tasks like classification and regression, it quickly finds quality models for user-provided data with low computational resources. It is easy to customize or extend. Users can find their desired customizability from a smooth range.
- It supports fast and economical automatic tuning (e.g., inference hyperparameters for foundation models, configurations in MLOps/LMOps workflows, pipelines, mathematical/statistical models, algorithms, computing experiments, software configurations), capable of handling large search space with heterogeneous evaluation cost and complex constraints/guidance/early stopping.
FLAML is powered by a series of [research studies](/docs/Research) from Microsoft Research and collaborators such as Penn State University, Stevens Institute of Technology, University of Washington, and University of Waterloo.
FLAML is powered by a series of [research studies](https://microsoft.github.io/FLAML/docs/Research/) from Microsoft Research and collaborators such as Penn State University, Stevens Institute of Technology, University of Washington, and University of Waterloo.
FLAML has a .NET implementation in [ML.NET](http://dot.net/ml), an open-source, cross-platform machine learning framework for .NET.
## Installation
FLAML requires **Python version >= 3.8**. It can be installed from pip:
The latest version of FLAML requires **Python >= 3.10 and < 3.14**. While other Python versions may work for core components, full model support is not guaranteed. FLAML can be installed via `pip`:
```bash
pip install flaml
```
Minimal dependencies are installed without extra options. You can install extra options based on the feature you need. For example, use the following to install the dependencies needed by the [`autogen`](https://microsoft.github.io/FLAML/docs/Use-Cases/Autogen) package.
Minimal dependencies are installed without extra options. You can install extra options based on the feature you need. For example, use the following to install the dependencies needed by the [`automl`](https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML) module.
```bash
pip install "flaml[autogen]"
pip install "flaml[automl]"
```
Find more options in [Installation](https://microsoft.github.io/FLAML/docs/Installation).
@@ -54,56 +51,34 @@ Each of the [`notebook examples`](https://github.com/microsoft/FLAML/tree/main/n
## Quickstart
* (New) The [autogen](https://microsoft.github.io/FLAML/docs/Use-Cases/Autogen) package enables the next-gen GPT-X applications with a generic multi-agent conversation framework.
It offers customizable and conversable agents which integrate LLMs, tools and human.
By automating chat among multiple capable agents, one can easily make them collectively perform tasks autonomously or with human feedback, including tasks that require using tools via code. For example,
```python
from flaml import autogen
assistant = autogen.AssistantAgent("assistant")
user_proxy = autogen.UserProxyAgent("user_proxy")
user_proxy.initiate_chat(assistant, message="Show me the YTD gain of 10 largest technology companies as of today.")
# This initiates an automated chat between the two agents to solve the task
```
Autogen also helps maximize the utility out of the expensive LLMs such as ChatGPT and GPT-4. It offers a drop-in replacement of `openai.Completion` or `openai.ChatCompletion` with powerful functionalites like tuning, caching, templating, filtering. For example, you can optimize generations by LLM with your own tuning data, success metrics and budgets.
```python
# perform tuning
config, analysis = autogen.Completion.tune(
data=tune_data,
metric="success",
mode="max",
eval_func=eval_func,
inference_budget=0.05,
optimization_budget=3,
num_samples=-1,
)
# perform inference for a test instance
response = autogen.Completion.create(context=test_instance, **config)
```
* With three lines of code, you can start using this economical and fast
AutoML engine as a [scikit-learn style estimator](https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML).
- With three lines of code, you can start using this economical and fast
AutoML engine as a [scikit-learn style estimator](https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML).
```python
from flaml import AutoML
automl = AutoML()
automl.fit(X_train, y_train, task="classification")
```
* You can restrict the learners and use FLAML as a fast hyperparameter tuning
tool for XGBoost, LightGBM, Random Forest etc. or a [customized learner](https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#estimator-and-search-space).
- You can restrict the learners and use FLAML as a fast hyperparameter tuning
tool for XGBoost, LightGBM, Random Forest etc. or a [customized learner](https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#estimator-and-search-space).
```python
automl.fit(X_train, y_train, task="classification", estimator_list=["lgbm"])
```
* You can also run generic hyperparameter tuning for a [custom function](https://microsoft.github.io/FLAML/docs/Use-Cases/Tune-User-Defined-Function).
- You can also run generic hyperparameter tuning for a [custom function](https://microsoft.github.io/FLAML/docs/Use-Cases/Tune-User-Defined-Function).
```python
from flaml import tune
tune.run(evaluation_function, config={}, low_cost_partial_config={}, time_budget_s=3600)
tune.run(
evaluation_function, config={}, low_cost_partial_config={}, time_budget_s=3600
)
```
* [Zero-shot AutoML](https://microsoft.github.io/FLAML/docs/Use-Cases/Zero-Shot-AutoML) allows using the existing training API from lightgbm, xgboost etc. while getting the benefit of AutoML in choosing high-performance hyperparameter configurations per task.
- [Zero-shot AutoML](https://microsoft.github.io/FLAML/docs/Use-Cases/Zero-Shot-AutoML) allows using the existing training API from lightgbm, xgboost etc. while getting the benefit of AutoML in choosing high-performance hyperparameter configurations per task.
```python
from flaml.default import LGBMRegressor
@@ -143,3 +118,9 @@ provided by the bot. You will only need to do this once across all repos using o
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
## Contributors Wall
<a href="https://github.com/microsoft/flaml/graphs/contributors">
<img src="https://contrib.rocks/image?repo=microsoft/flaml&max=204" />
</a>

View File

@@ -4,7 +4,7 @@
Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)), please report it to us as described below.
If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](<https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)>), please report it to us as described below.
## Reporting Security Issues
@@ -12,19 +12,19 @@ If you believe you have found a security vulnerability in any Microsoft-owned re
Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).
If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
* Full paths of source file(s) related to the manifestation of the issue
* The location of the affected source code (tag/branch/commit or direct URL)
* Any special configuration required to reproduce the issue
* Step-by-step instructions to reproduce the issue
* Proof-of-concept or exploit code (if possible)
* Impact of the issue, including how an attacker might exploit the issue
- Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
- Full paths of source file(s) related to the manifestation of the issue
- The location of the affected source code (tag/branch/commit or direct URL)
- Any special configuration required to reproduce the issue
- Step-by-step instructions to reproduce the issue
- Proof-of-concept or exploit code (if possible)
- Impact of the issue, including how an attacker might exploit the issue
This information will help us triage your report more quickly.

View File

@@ -1,10 +1,20 @@
import logging
from flaml.automl import AutoML, logger_formatter
from flaml.tune.searcher import CFO, BlendSearch, FLOW2, BlendSearchTuner, RandomSearch
from flaml.onlineml.autovw import AutoVW
from flaml.version import __version__
import warnings
try:
from flaml.automl import AutoML, logger_formatter
has_automl = True
except ImportError:
has_automl = False
from flaml.onlineml.autovw import AutoVW
from flaml.tune.searcher import CFO, FLOW2, BlendSearch, BlendSearchTuner, RandomSearch
from flaml.version import __version__
# Set the root logger.
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
if logger.level == logging.NOTSET:
logger.setLevel(logging.INFO)
if not has_automl:
warnings.warn("flaml.automl is not available. Please install flaml[automl] to enable AutoML functionalities.")

View File

@@ -1,3 +1,12 @@
from .oai import *
import warnings
from .agentchat import *
from .code_utils import DEFAULT_MODEL, FAST_MODEL
from .oai import *
warnings.warn(
"The `flaml.autogen` module is deprecated and will be removed in a future release. "
"Please refer to `https://github.com/microsoft/autogen` for latest usage.",
DeprecationWarning,
stacklevel=2,
)

View File

@@ -1,12 +1,12 @@
from .agent import Agent
from .responsive_agent import ResponsiveAgent
from .assistant_agent import AssistantAgent
from .user_proxy_agent import UserProxyAgent
from .conversable_agent import ConversableAgent
from .groupchat import GroupChat, GroupChatManager
from .user_proxy_agent import UserProxyAgent
__all__ = [
"Agent",
"ResponsiveAgent",
"ConversableAgent",
"AssistantAgent",
"UserProxyAgent",
"GroupChat",

View File

@@ -25,10 +25,10 @@ class Agent:
return self._name
def send(self, message: Union[Dict, str], recipient: "Agent", request_reply: Optional[bool] = None):
"""(Aabstract method) Send a message to another agent."""
"""(Abstract method) Send a message to another agent."""
async def a_send(self, message: Union[Dict, str], recipient: "Agent", request_reply: Optional[bool] = None):
"""(Aabstract async method) Send a message to another agent."""
"""(Abstract async method) Send a message to another agent."""
def receive(self, message: Union[Dict, str], sender: "Agent", request_reply: Optional[bool] = None):
"""(Abstract method) Receive a message from another agent."""

View File

@@ -1,26 +1,27 @@
from .responsive_agent import ResponsiveAgent
from typing import Callable, Dict, Optional, Union
from .conversable_agent import ConversableAgent
class AssistantAgent(ResponsiveAgent):
"""(In preview) Assistant agent, designed to solve a task with LLM.
AssistantAgent is a subclass of ResponsiveAgent configured with a default system message.
The default system message is designed to solve a task with LLM,
including suggesting python code blocks and debugging.
`human_input_mode` is default to "NEVER"
and `code_execution_config` is default to False.
This agent doesn't execute code by default, and expects the user to execute the code.
class AssistantAgent(ConversableAgent):
"""(In preview) Assistant agent, designed to solve tasks with LLM.
AssistantAgent is a subclass of ConversableAgent configured with a default system message.
The default system message is designed to solve tasks with LLM,
including suggesting Python code blocks and debugging.
`human_input_mode` defaults to "NEVER"
and `code_execution_config` defaults to False.
This agent doesn't execute code by default and expects the user to execute the code.
"""
DEFAULT_SYSTEM_MESSAGE = """You are a helpful AI assistant.
Solve tasks using your coding and language skills.
In the following cases, suggest python code (in a python coding block) or shell script (in a sh coding block) for the user to execute.
In the following cases, suggest Python code (in a Python coding block) or shell script (in an sh coding block) for the user to execute.
1. When you need to collect info, use the code to output the info you need, for example, browse or search the web, download/read a file, print the content of a webpage or a file, get the current date/time. After sufficient info is printed and the task is ready to be solved based on your language skill, you can solve the task by yourself.
2. When you need to perform some task with code, use the code to perform the task and output the result. Finish the task smartly.
Solve the task step by step if you need to. If a plan is not provided, explain your plan first. Be clear which step uses code, and which step uses your language skill.
When using code, you must indicate the script type in the code block. The user cannot provide any other feedback or perform any other action beyond executing the code you suggest. The user can't modify your code. So do not suggest incomplete code which requires users to modify. Don't use a code block if it's not intended to be executed by the user.
If you want the user to save the code in a file before executing it, put # filename: <filename> inside the code block as the first line. Don't include multiple code blocks in one response. Do not ask users to copy and paste the result. Instead, use 'print' function for the output when relevant. Check the execution result returned by the user.
If you want the user to save the code in a file before executing it, put # filename: <filename> inside the code block as the first line. Don't include multiple code blocks in one response. Do not ask users to copy and paste the result. Instead, use the 'print' function for the output when relevant. Check the execution result returned by the user.
If the result indicates there is an error, fix the error and output the code again. Suggest the full code instead of partial code or code changes. If the error can't be fixed or if the task is not solved even after the code is executed successfully, analyze the problem, revisit your assumption, collect additional info you need, and think of a different approach to try.
When you find an answer, verify the answer carefully. Include verifiable evidence in your response if possible.
Reply "TERMINATE" in the end when everything is done.
@@ -35,24 +36,24 @@ Reply "TERMINATE" in the end when everything is done.
max_consecutive_auto_reply: Optional[int] = None,
human_input_mode: Optional[str] = "NEVER",
code_execution_config: Optional[Union[Dict, bool]] = False,
**kwargs,
**kwargs: Dict,
):
"""
Args:
name (str): agent name.
system_message (str): system message for the ChatCompletion inference.
Please override this attribute if you want to reprogram the agent.
llm_config (dict): llm inference configuration.
Please refer to [autogen.Completion.create](/docs/reference/autogen/oai/completion#create)
name (str): Agent name.
system_message (Optional[str]): System message for the ChatCompletion inference.
Override this attribute if you want to reprogram the agent.
llm_config (Optional[Union[Dict, bool]]): LLM inference configuration.
Refer to [autogen.Completion.create](/docs/reference/autogen/oai/completion#create)
for available options.
is_termination_msg (function): a function that takes a message in the form of a dictionary
is_termination_msg (Optional[Callable[[Dict], bool]]): A function that takes a message in the form of a dictionary
and returns a boolean value indicating if this received message is a termination message.
The dict can contain the following keys: "content", "role", "name", "function_call".
max_consecutive_auto_reply (int): the maximum number of consecutive auto replies.
default to None (no limit provided, class attribute MAX_CONSECUTIVE_AUTO_REPLY will be used as the limit in this case).
max_consecutive_auto_reply (Optional[int]): The maximum number of consecutive auto replies.
Defaults to None (no limit provided, class attribute MAX_CONSECUTIVE_AUTO_REPLY will be used as the limit in this case).
The limit only plays a role when human_input_mode is not "ALWAYS".
**kwargs (dict): Please refer to other kwargs in
[ResponsiveAgent](responsive_agent#__init__).
**kwargs (Dict): Additional keyword arguments. Refer to other kwargs in
[ConversableAgent](conversable_agent#__init__).
"""
super().__init__(
name,

View File

@@ -1,14 +1,14 @@
import re
import os
from pydantic import BaseModel, Extra, root_validator
from typing import Any, Callable, Dict, List, Optional, Union
import re
from time import sleep
from typing import Any, Callable, Dict, List, Optional, Union
from pydantic import BaseModel, Extra, root_validator
from flaml.autogen.agentchat import Agent, UserProxyAgent
from flaml.autogen.code_utils import UNKNOWN, extract_code, execute_code, infer_lang
from flaml.autogen.code_utils import UNKNOWN, execute_code, extract_code, infer_lang
from flaml.autogen.math_utils import get_answer
PROMPTS = {
# default
"default": """Let's use Python to solve a math problem.
@@ -156,7 +156,7 @@ class MathUserProxyAgent(UserProxyAgent):
when the number of auto reply reaches the max_consecutive_auto_reply or when is_termination_msg is True.
default_auto_reply (str or dict or None): the default auto reply message when no code execution or llm based reply is generated.
max_invalid_q_per_step (int): (ADDED) the maximum number of invalid queries per step.
**kwargs (dict): other kwargs in [UserProxyAgent](user_proxy_agent#__init__).
**kwargs (dict): other kwargs in [UserProxyAgent](../user_proxy_agent#__init__).
"""
super().__init__(
name=name,
@@ -165,7 +165,7 @@ class MathUserProxyAgent(UserProxyAgent):
default_auto_reply=default_auto_reply,
**kwargs,
)
self.register_auto_reply([Agent, None], MathUserProxyAgent._generate_math_reply, 1)
self.register_reply([Agent, None], MathUserProxyAgent._generate_math_reply, 1)
# fixed var
self._max_invalid_q_per_step = max_invalid_q_per_step

View File

@@ -1,6 +1,7 @@
from typing import Any, Callable, Dict, List, Optional, Tuple, Union
from flaml.autogen.agentchat.agent import Agent
from flaml.autogen.agentchat.assistant_agent import AssistantAgent
from typing import Callable, Dict, Optional, Union, List, Tuple, Any
class RetrieveAssistantAgent(AssistantAgent):
@@ -16,7 +17,7 @@ class RetrieveAssistantAgent(AssistantAgent):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.register_auto_reply(Agent, RetrieveAssistantAgent._generate_retrieve_assistant_reply)
self.register_reply(Agent, RetrieveAssistantAgent._generate_retrieve_assistant_reply)
def _generate_retrieve_assistant_reply(
self,

View File

@@ -1,12 +1,13 @@
import chromadb
from flaml.autogen.agentchat.agent import Agent
from flaml.autogen.agentchat import UserProxyAgent
from flaml.autogen.retrieve_utils import create_vector_db_from_dir, query_vector_db, num_tokens_from_text
from flaml.autogen.code_utils import extract_code
from typing import Any, Callable, Dict, List, Optional, Tuple, Union
from typing import Callable, Dict, Optional, Union, List, Tuple, Any
import chromadb
from IPython import get_ipython
from flaml.autogen.agentchat import UserProxyAgent
from flaml.autogen.agentchat.agent import Agent
from flaml.autogen.code_utils import extract_code
from flaml.autogen.retrieve_utils import create_vector_db_from_dir, num_tokens_from_text, query_vector_db
try:
from termcolor import colored
except ImportError:
@@ -122,7 +123,7 @@ class RetrieveUserProxyAgent(UserProxyAgent):
can be found at `https://www.sbert.net/docs/pretrained_models.html`. The default model is a
fast model. If you want to use a high performance model, `all-mpnet-base-v2` is recommended.
- customized_prompt (Optional, str): the customized prompt for the retrieve chat. Default is None.
**kwargs (dict): other kwargs in [UserProxyAgent](user_proxy_agent#__init__).
**kwargs (dict): other kwargs in [UserProxyAgent](../user_proxy_agent#__init__).
"""
super().__init__(
name=name,
@@ -148,7 +149,7 @@ class RetrieveUserProxyAgent(UserProxyAgent):
self._ipython = get_ipython()
self._doc_idx = -1 # the index of the current used doc
self._results = {} # the results of the current query
self.register_auto_reply(Agent, RetrieveUserProxyAgent._generate_retrieve_user_reply)
self.register_reply(Agent, RetrieveUserProxyAgent._generate_retrieve_user_reply)
@staticmethod
def get_max_tokens(model="gpt-3.5-turbo"):

View File

@@ -1,10 +1,10 @@
import asyncio
from collections import defaultdict
import copy
import json
from collections import defaultdict
from typing import Any, Callable, Dict, List, Optional, Tuple, Type, Union
from flaml.autogen import oai
from .agent import Agent
from flaml.autogen.code_utils import (
DEFAULT_MODEL,
UNKNOWN,
@@ -13,6 +13,8 @@ from flaml.autogen.code_utils import (
infer_lang,
)
from .agent import Agent
try:
from termcolor import colored
except ImportError:
@@ -21,11 +23,11 @@ except ImportError:
return x
class ResponsiveAgent(Agent):
"""(Experimental) A class for generic responsive agents which can be configured as assistant or user proxy.
class ConversableAgent(Agent):
"""(In preview) A class for generic conversable agents which can be configured as assistant or user proxy.
After receiving each message, the agent will send a reply to the sender unless the msg is a termination msg.
For example, AssistantAgent and UserProxyAgent are subclasses of ResponsiveAgent,
For example, AssistantAgent and UserProxyAgent are subclasses of this class,
configured with different default settings.
To modify auto reply, override `generate_reply` method.
@@ -119,12 +121,12 @@ class ResponsiveAgent(Agent):
self._default_auto_reply = default_auto_reply
self._reply_func_list = []
self.reply_at_receive = defaultdict(bool)
self.register_auto_reply([Agent, None], ResponsiveAgent.generate_oai_reply)
self.register_auto_reply([Agent, None], ResponsiveAgent.generate_code_execution_reply)
self.register_auto_reply([Agent, None], ResponsiveAgent.generate_function_call_reply)
self.register_auto_reply([Agent, None], ResponsiveAgent.check_termination_and_human_reply)
self.register_reply([Agent, None], ConversableAgent.generate_oai_reply)
self.register_reply([Agent, None], ConversableAgent.generate_code_execution_reply)
self.register_reply([Agent, None], ConversableAgent.generate_function_call_reply)
self.register_reply([Agent, None], ConversableAgent.check_termination_and_human_reply)
def register_auto_reply(
def register_reply(
self,
trigger: Union[Type[Agent], str, Agent, Callable[[Agent], bool], List],
reply_func: Callable,
@@ -151,7 +153,7 @@ class ResponsiveAgent(Agent):
The function takes a recipient agent, a list of messages, a sender agent and a config as input and returns a reply message.
```python
def reply_func(
recipient: ResponsiveAgent,
recipient: ConversableAgent,
messages: Optional[List[Dict]] = None,
sender: Optional[Agent] = None,
config: Optional[Any] = None,
@@ -499,7 +501,7 @@ class ResponsiveAgent(Agent):
def initiate_chat(
self,
recipient: "ResponsiveAgent",
recipient: "ConversableAgent",
clear_history: Optional[bool] = True,
silent: Optional[bool] = False,
**context,
@@ -522,7 +524,7 @@ class ResponsiveAgent(Agent):
async def a_initiate_chat(
self,
recipient: "ResponsiveAgent",
recipient: "ConversableAgent",
clear_history: Optional[bool] = True,
silent: Optional[bool] = False,
**context,
@@ -611,7 +613,7 @@ class ResponsiveAgent(Agent):
if messages is None:
messages = self._oai_messages[sender]
last_n_messages = code_execution_config.pop("last_n_messages", 1)
for i in range(last_n_messages):
for i in range(min(len(messages), last_n_messages)):
message = messages[-(i + 1)]
code_blocks = extract_code(message["content"])
if len(code_blocks) == 1 and code_blocks[0][0] == UNKNOWN:

View File

@@ -1,8 +1,9 @@
from dataclasses import dataclass
import sys
from dataclasses import dataclass
from typing import Dict, List, Optional, Union
from .agent import Agent
from .responsive_agent import ResponsiveAgent
from .conversable_agent import ConversableAgent
@dataclass
@@ -39,7 +40,7 @@ class GroupChat:
Read the following conversation.
Then select the next role from {self.agent_names} to play. Only return the role."""
def select_speaker(self, last_speaker: Agent, selector: ResponsiveAgent):
def select_speaker(self, last_speaker: Agent, selector: ConversableAgent):
"""Select the next speaker."""
selector.update_system_message(self.select_speaker_msg())
final, name = selector.generate_oai_reply(
@@ -63,7 +64,7 @@ Then select the next role from {self.agent_names} to play. Only return the role.
return "\n".join([f"{agent.name}: {agent.system_message}" for agent in self.agents])
class GroupChatManager(ResponsiveAgent):
class GroupChatManager(ConversableAgent):
"""(In preview) A chat manager agent that can manage a group chat of multiple agents."""
def __init__(
@@ -84,7 +85,7 @@ class GroupChatManager(ResponsiveAgent):
system_message=system_message,
**kwargs,
)
self.register_auto_reply(Agent, GroupChatManager.run_chat, config=groupchat, reset_config=GroupChat.reset)
self.register_reply(Agent, GroupChatManager.run_chat, config=groupchat, reset_config=GroupChat.reset)
# self._random = random.Random(seed)
def run_chat(

View File

@@ -1,14 +1,15 @@
from .responsive_agent import ResponsiveAgent
from typing import Callable, Dict, Optional, Union
from .conversable_agent import ConversableAgent
class UserProxyAgent(ResponsiveAgent):
class UserProxyAgent(ConversableAgent):
"""(In preview) A proxy agent for the user, that can execute code and provide feedback to the other agents.
UserProxyAgent is a subclass of ResponsiveAgent configured with `human_input_mode` to ALWAYS
UserProxyAgent is a subclass of ConversableAgent configured with `human_input_mode` to ALWAYS
and `llm_config` to False. By default, the agent will prompt for human input every time a message is received.
Code execution is enabled by default. LLM-based auto reply is disabled by default.
To modify auto reply, register a method with (`register_auto_reply`)[responsive_agent#register_auto_reply].
To modify auto reply, register a method with (`register_reply`)[conversable_agent#register_reply].
To modify the way to get human input, override `get_human_input` method.
To modify the way to execute code blocks, single code block, or function call, override `execute_code_blocks`,
`run_code`, and `execute_function` methods respectively.

View File

@@ -1,13 +1,14 @@
import logging
import os
import pathlib
import re
import signal
import subprocess
import sys
import os
import pathlib
from typing import List, Dict, Tuple, Optional, Union, Callable
import re
import time
from hashlib import md5
import logging
from typing import Callable, Dict, List, Optional, Tuple, Union
from flaml.autogen import oai
try:
@@ -124,7 +125,7 @@ def improve_function(file_name, func_name, objective, **config):
"""(work in progress) Improve the function to achieve the objective."""
params = {**_IMPROVE_FUNCTION_CONFIG, **config}
# read the entire file into a str
with open(file_name, "r") as f:
with open(file_name) as f:
file_string = f.read()
response = oai.Completion.create(
{"func_name": func_name, "objective": objective, "file_string": file_string}, **params
@@ -157,7 +158,7 @@ def improve_code(files, objective, suggest_only=True, **config):
code = ""
for file_name in files:
# read the entire file into a string
with open(file_name, "r") as f:
with open(file_name) as f:
file_string = f.read()
code += f"""{file_name}:
{file_string}

View File

@@ -1,5 +1,6 @@
from typing import Optional
from flaml.autogen import oai, DEFAULT_MODEL
from flaml.autogen import DEFAULT_MODEL, oai
_MATH_PROMPT = "{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\boxed{{}}."
_MATH_CONFIG = {
@@ -129,7 +130,7 @@ def _fix_a_slash_b(string: str) -> str:
try:
a = int(a_str)
b = int(b_str)
assert string == "{}/{}".format(a, b)
assert string == f"{a}/{b}"
new_string = "\\frac{" + str(a) + "}{" + str(b) + "}"
return new_string
except Exception:

View File

@@ -1,10 +1,10 @@
from flaml.autogen.oai.completion import Completion, ChatCompletion
from flaml.autogen.oai.completion import ChatCompletion, Completion
from flaml.autogen.oai.openai_utils import (
get_config_list,
config_list_from_json,
config_list_from_models,
config_list_gpt4_gpt35,
config_list_openai_aoai,
config_list_from_models,
config_list_from_json,
get_config_list,
)
__all__ = [

View File

@@ -1,28 +1,31 @@
from time import sleep
import logging
import time
from typing import List, Optional, Dict, Callable, Union
import sys
import shutil
import sys
import time
from time import sleep
from typing import Callable, Dict, List, Optional, Union
import numpy as np
from flaml import tune, BlendSearch
from flaml.tune.space import is_constant
from flaml import BlendSearch, tune
from flaml.automl.logger import logger_formatter
from flaml.tune.space import is_constant
from .openai_utils import get_key
try:
import openai
from openai.error import (
ServiceUnavailableError,
RateLimitError,
APIError,
InvalidRequestError,
APIConnectionError,
Timeout,
AuthenticationError,
)
from openai import Completion as openai_Completion
import diskcache
import openai
from openai import Completion as openai_Completion
from openai.error import (
APIConnectionError,
APIError,
AuthenticationError,
InvalidRequestError,
RateLimitError,
ServiceUnavailableError,
Timeout,
)
ERROR = None
except ImportError:
@@ -697,7 +700,7 @@ class Completion(openai_Completion):
E.g., `prompt="Complete the following sentence: {prefix}, context={"prefix": "Today I feel"}`.
The actual prompt will be:
"Complete the following sentence: Today I feel".
More examples can be found at [templating](/docs/Use-Cases/Autogen#templating).
More examples can be found at [templating](https://microsoft.github.io/autogen/docs/Use-Cases/enhanced_inference#templating).
use_cache (bool, Optional): Whether to use cached responses.
config_list (List, Optional): List of configurations for the completion to try.
The first one that does not raise an error will be used.

View File

@@ -1,7 +1,7 @@
import os
import json
from typing import List, Optional, Dict, Set, Union
import logging
import os
from typing import Dict, List, Optional, Set, Union
NON_CACHE_KEY = ["api_key", "api_base", "api_type", "api_version"]

View File

@@ -1,13 +1,14 @@
from typing import List, Union, Dict, Tuple
import os
import requests
from urllib.parse import urlparse
import glob
import tiktoken
import chromadb
from chromadb.api import API
import chromadb.utils.embedding_functions as ef
import logging
import os
from typing import Dict, List, Tuple, Union
from urllib.parse import urlparse
import chromadb
import chromadb.utils.embedding_functions as ef
import requests
import tiktoken
from chromadb.api import API
logger = logging.getLogger(__name__)
TEXT_FORMATS = ["txt", "json", "csv", "tsv", "md", "html", "htm", "rtf", "rst", "jsonl", "log", "xml", "yaml", "yml"]
@@ -125,7 +126,7 @@ def split_files_to_chunks(
"""Split a list of files into chunks of max_tokens."""
chunks = []
for file in files:
with open(file, "r") as f:
with open(file) as f:
text = f.read()
chunks += split_text_to_chunks(text, max_tokens, chunk_mode, must_break_at_empty_line)
return chunks

View File

@@ -1,5 +1,9 @@
from flaml.automl.automl import AutoML, size
from flaml.automl.logger import logger_formatter
from flaml.automl.state import SearchState, AutoMLState
__all__ = ["AutoML", "AutoMLState", "SearchState", "logger_formatter", "size"]
try:
from flaml.automl.automl import AutoML, size
from flaml.automl.state import AutoMLState, SearchState
__all__ = ["AutoML", "AutoMLState", "SearchState", "logger_formatter", "size"]
except ImportError:
__all__ = ["logger_formatter"]

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1 @@
from .histgb import HistGradientBoostingEstimator

View File

@@ -0,0 +1,75 @@
try:
from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor
except ImportError as e:
print(f"scikit-learn is required for HistGradientBoostingEstimator. Please install it; error: {e}")
from flaml import tune
from flaml.automl.model import SKLearnEstimator
from flaml.automl.task import Task
class HistGradientBoostingEstimator(SKLearnEstimator):
"""The class for tuning Histogram Gradient Boosting."""
ITER_HP = "max_iter"
HAS_CALLBACK = False
DEFAULT_ITER = 100
@classmethod
def search_space(cls, data_size: int, task, **params) -> dict:
upper = max(5, min(32768, int(data_size[0]))) # upper must be larger than lower
return {
"n_estimators": {
"domain": tune.lograndint(lower=4, upper=upper),
"init_value": 4,
"low_cost_init_value": 4,
},
"max_leaves": {
"domain": tune.lograndint(lower=4, upper=upper),
"init_value": 4,
"low_cost_init_value": 4,
},
"min_samples_leaf": {
"domain": tune.lograndint(lower=2, upper=2**7 + 1),
"init_value": 20,
},
"learning_rate": {
"domain": tune.loguniform(lower=1 / 1024, upper=1.0),
"init_value": 0.1,
},
"log_max_bin": { # log transformed with base 2, <= 256
"domain": tune.lograndint(lower=3, upper=9),
"init_value": 8,
},
"l2_regularization": {
"domain": tune.loguniform(lower=1 / 1024, upper=1024),
"init_value": 1.0,
},
}
def config2params(self, config: dict) -> dict:
params = super().config2params(config)
if "log_max_bin" in params:
params["max_bins"] = (1 << params.pop("log_max_bin")) - 1
if "max_leaves" in params:
params["max_leaf_nodes"] = params.get("max_leaf_nodes", params.pop("max_leaves"))
if "n_estimators" in params:
params["max_iter"] = params.get("max_iter", params.pop("n_estimators"))
if "random_state" not in params:
params["random_state"] = 24092023
if "n_jobs" in params:
params.pop("n_jobs")
return params
def __init__(
self,
task: Task,
**config,
):
super().__init__(task, **config)
self.params["verbose"] = 0
if self._task.is_classification():
self.estimator_class = HistGradientBoostingClassifier
else:
self.estimator_class = HistGradientBoostingRegressor

View File

@@ -2,21 +2,29 @@
# * Copyright (c) Microsoft Corporation. All rights reserved.
# * Licensed under the MIT License. See LICENSE file in the
# * project root for license information.
import numpy as np
from datetime import datetime
from typing import TYPE_CHECKING, Union
import json
import os
import random
import re
import uuid
from datetime import datetime, timedelta
from decimal import ROUND_HALF_UP, Decimal
from typing import TYPE_CHECKING, Union
import numpy as np
from flaml.automl.spark import DataFrame, F, Series, T, pd, ps, psDataFrame, psSeries
from flaml.automl.training_log import training_log_reader
from flaml.automl.spark import ps, psDataFrame, psSeries, DataFrame, Series, pd
try:
from scipy.sparse import vstack, issparse
from scipy.sparse import issparse, vstack
except ImportError:
pass
if TYPE_CHECKING:
from flaml.automl.task import Task
TS_TIMESTAMP_COL = "ds"
TS_VALUE_COL = "y"
@@ -41,8 +49,12 @@ def load_openml_dataset(dataset_id, data_dir=None, random_state=0, dataset_forma
y_train: A series or array of labels for training data.
y_test: A series or array of labels for test data.
"""
import openml
import pickle
try:
import openml
except ImportError:
openml = None
from sklearn.model_selection import train_test_split
filename = "openml_ds" + str(dataset_id) + ".pkl"
@@ -53,15 +65,15 @@ def load_openml_dataset(dataset_id, data_dir=None, random_state=0, dataset_forma
dataset = pickle.load(f)
else:
print("download dataset from openml")
dataset = openml.datasets.get_dataset(dataset_id)
dataset = openml.datasets.get_dataset(dataset_id) if openml else None
if not os.path.exists(data_dir):
os.makedirs(data_dir)
with open(filepath, "wb") as f:
pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL)
print("Dataset name:", dataset.name)
print("Dataset name:", dataset.name) if dataset else None
try:
X, y, *__ = dataset.get_data(target=dataset.default_target_attribute, dataset_format=dataset_format)
except ValueError:
except (ValueError, AttributeError, TypeError):
from sklearn.datasets import fetch_openml
X, y = fetch_openml(data_id=dataset_id, return_X_y=True)
@@ -93,9 +105,10 @@ def load_openml_task(task_id, data_dir):
y_train: A series of labels for training data.
y_test: A series of labels for test data.
"""
import openml
import pickle
import openml
task = openml.tasks.get_task(task_id)
filename = "openml_task" + str(task_id) + ".pkl"
filepath = os.path.join(data_dir, filename)
@@ -289,7 +302,7 @@ class DataTransformer:
y = y.rename(TS_VALUE_COL)
for column in X.columns:
# sklearn\utils\validation.py needs int/float values
if X[column].dtype.name in ("object", "category"):
if X[column].dtype.name in ("object", "category", "string"):
if X[column].nunique() == 1 or X[column].nunique(dropna=True) == n - X[column].isnull().sum():
X.drop(columns=column, inplace=True)
drop = True
@@ -341,8 +354,8 @@ class DataTransformer:
drop = True
else:
drop = False
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
self.transformer = ColumnTransformer(
[
@@ -441,3 +454,343 @@ class DataTransformer:
def group_counts(groups):
_, i, c = np.unique(groups, return_counts=True, return_index=True)
return c[np.argsort(i)]
def get_random_dataframe(n_rows: int = 200, ratio_none: float = 0.1, seed: int = 42) -> DataFrame:
"""Generate a random pandas DataFrame with various data types for testing.
This function creates a DataFrame with multiple column types including:
- Timestamps
- Integers
- Floats
- Categorical values
- Booleans
- Lists (tags)
- Decimal strings
- UUIDs
- Binary data (as hex strings)
- JSON blobs
- Nullable text fields
Parameters
----------
n_rows : int, default=200
Number of rows in the generated DataFrame
ratio_none : float, default=0.1
Probability of generating None values in applicable columns
seed : int, default=42
Random seed for reproducibility
Returns
-------
pd.DataFrame
A DataFrame with 14 columns of various data types
Examples
--------
>>> df = get_random_dataframe(100, 0.05, 123)
>>> df.shape
(100, 14)
>>> df.dtypes
timestamp datetime64[ns]
id int64
score float64
status object
flag object
count object
value object
tags object
rating object
uuid object
binary object
json_blob object
category category
nullable_text object
dtype: object
"""
np.random.seed(seed)
random.seed(seed)
def random_tags():
tags = ["AI", "ML", "data", "robotics", "vision"]
return random.sample(tags, k=random.randint(1, 3)) if random.random() > ratio_none else None
def random_decimal():
return (
str(Decimal(random.uniform(1, 5)).quantize(Decimal("0.01"), rounding=ROUND_HALF_UP))
if random.random() > ratio_none
else None
)
def random_json_blob():
blob = {"a": random.randint(1, 10), "b": random.random()}
return json.dumps(blob) if random.random() > ratio_none else None
def random_binary():
return bytes(random.randint(0, 255) for _ in range(4)).hex() if random.random() > ratio_none else None
data = {
"timestamp": [
datetime(2020, 1, 1) + timedelta(days=np.random.randint(0, 1000)) if np.random.rand() > ratio_none else None
for _ in range(n_rows)
],
"id": range(1, n_rows + 1),
"score": np.random.uniform(0, 100, n_rows),
"status": np.random.choice(
["active", "inactive", "pending", None],
size=n_rows,
p=[(1 - ratio_none) / 3, (1 - ratio_none) / 3, (1 - ratio_none) / 3, ratio_none],
),
"flag": np.random.choice(
[True, False, None], size=n_rows, p=[(1 - ratio_none) / 2, (1 - ratio_none) / 2, ratio_none]
),
"count": [np.random.randint(0, 100) if np.random.rand() > ratio_none else None for _ in range(n_rows)],
"value": [round(np.random.normal(50, 15), 2) if np.random.rand() > ratio_none else None for _ in range(n_rows)],
"tags": [random_tags() for _ in range(n_rows)],
"rating": [random_decimal() for _ in range(n_rows)],
"uuid": [str(uuid.uuid4()) if np.random.rand() > ratio_none else None for _ in range(n_rows)],
"binary": [random_binary() for _ in range(n_rows)],
"json_blob": [random_json_blob() for _ in range(n_rows)],
"category": pd.Categorical(
np.random.choice(
["A", "B", "C", None],
size=n_rows,
p=[(1 - ratio_none) / 3, (1 - ratio_none) / 3, (1 - ratio_none) / 3, ratio_none],
)
),
"nullable_text": [random.choice(["Good", "Bad", "Average", None]) for _ in range(n_rows)],
}
return pd.DataFrame(data)
def auto_convert_dtypes_spark(
df: psDataFrame,
na_values: list = None,
category_threshold: float = 0.3,
convert_threshold: float = 0.6,
sample_ratio: float = 0.1,
) -> tuple[psDataFrame, dict]:
"""Automatically convert data types in a PySpark DataFrame using heuristics.
This function analyzes a sample of the DataFrame to infer appropriate data types
and applies the conversions. It handles timestamps, numeric values, booleans,
and categorical fields.
Args:
df: A PySpark DataFrame to convert.
na_values: List of strings to be considered as NA/NaN. Defaults to
['NA', 'na', 'NULL', 'null', ''].
category_threshold: Maximum ratio of unique values to total values
to consider a column categorical. Defaults to 0.3.
convert_threshold: Minimum ratio of successfully converted values required
to apply a type conversion. Defaults to 0.6.
sample_ratio: Fraction of data to sample for type inference. Defaults to 0.1.
Returns:
tuple: (The DataFrame with converted types, A dictionary mapping column names to
their inferred types as strings)
Note:
- 'category' in the schema dict is conceptual as PySpark doesn't have a true
category type like pandas
- The function uses sampling for efficiency with large datasets
"""
n_rows = df.count()
if na_values is None:
na_values = ["NA", "na", "NULL", "null", ""]
# Normalize NA-like values
for colname, coltype in df.dtypes:
if coltype == "string":
df = df.withColumn(
colname,
F.when(F.trim(F.lower(F.col(colname))).isin([v.lower() for v in na_values]), None).otherwise(
F.col(colname)
),
)
schema = {}
for colname in df.columns:
# Sample once at an appropriate ratio
sample_ratio_to_use = min(1.0, sample_ratio if n_rows * sample_ratio > 100 else 100 / n_rows)
col_sample = df.select(colname).sample(withReplacement=False, fraction=sample_ratio_to_use).dropna()
sample_count = col_sample.count()
inferred_type = "string" # Default
if col_sample.dtypes[0][1] != "string":
schema[colname] = col_sample.dtypes[0][1]
continue
if sample_count == 0:
schema[colname] = "string"
continue
# Check if timestamp
ts_col = col_sample.withColumn("parsed", F.to_timestamp(F.col(colname)))
# Check numeric
if (
col_sample.withColumn("n", F.col(colname).cast("double")).filter("n is not null").count()
>= sample_count * convert_threshold
):
# All whole numbers?
all_whole = (
col_sample.withColumn("n", F.col(colname).cast("double"))
.filter("n is not null")
.withColumn("frac", F.abs(F.col("n") % 1))
.filter("frac > 0.000001")
.count()
== 0
)
inferred_type = "int" if all_whole else "double"
# Check low-cardinality (category-like)
elif (
sample_count > 0
and col_sample.select(F.countDistinct(F.col(colname))).collect()[0][0] / sample_count <= category_threshold
):
inferred_type = "category" # Will just be string, but marked as such
# Check if timestamp
elif ts_col.filter(F.col("parsed").isNotNull()).count() >= sample_count * convert_threshold:
inferred_type = "timestamp"
schema[colname] = inferred_type
# Apply inferred schema
for colname, inferred_type in schema.items():
if inferred_type == "int":
df = df.withColumn(colname, F.col(colname).cast(T.IntegerType()))
elif inferred_type == "double":
df = df.withColumn(colname, F.col(colname).cast(T.DoubleType()))
elif inferred_type == "boolean":
df = df.withColumn(
colname,
F.when(F.lower(F.col(colname)).isin("true", "yes", "1"), True)
.when(F.lower(F.col(colname)).isin("false", "no", "0"), False)
.otherwise(None),
)
elif inferred_type == "timestamp":
df = df.withColumn(colname, F.to_timestamp(F.col(colname)))
elif inferred_type == "category":
df = df.withColumn(colname, F.col(colname).cast(T.StringType())) # Marked conceptually
# otherwise keep as string (or original type)
return df, schema
def auto_convert_dtypes_pandas(
df: DataFrame,
na_values: list = None,
category_threshold: float = 0.3,
convert_threshold: float = 0.6,
sample_ratio: float = 1.0,
) -> tuple[DataFrame, dict]:
"""Automatically convert data types in a pandas DataFrame using heuristics.
This function analyzes the DataFrame to infer appropriate data types
and applies the conversions. It handles timestamps, timedeltas, numeric values,
and categorical fields.
Args:
df: A pandas DataFrame to convert.
na_values: List of strings to be considered as NA/NaN. Defaults to
['NA', 'na', 'NULL', 'null', ''].
category_threshold: Maximum ratio of unique values to total values
to consider a column categorical. Defaults to 0.3.
convert_threshold: Minimum ratio of successfully converted values required
to apply a type conversion. Defaults to 0.6.
sample_ratio: Fraction of data to sample for type inference. Not used in pandas version
but included for API compatibility. Defaults to 1.0.
Returns:
tuple: (The DataFrame with converted types, A dictionary mapping column names to
their inferred types as strings)
"""
if na_values is None:
na_values = {"NA", "na", "NULL", "null", ""}
# Remove the empty string separately (handled by the regex `^\s*$`)
vals = [re.escape(v) for v in na_values if v != ""]
# Build inner alternation group
inner = "|".join(vals) if vals else ""
if inner:
pattern = re.compile(rf"^\s*(?:{inner})?\s*$")
else:
pattern = re.compile(r"^\s*$")
df_converted = df.convert_dtypes()
schema = {}
# Sample if needed (for API compatibility)
if sample_ratio < 1.0:
df = df.sample(frac=sample_ratio)
n_rows = len(df)
for col in df.columns:
series = df[col]
# Replace NA-like values if string
if series.dtype == object:
mask = series.astype(str).str.match(pattern)
series_cleaned = series.where(~mask, np.nan)
else:
series_cleaned = series
# Skip conversion if already non-object data type, except bool which can potentially be categorical
if (
not isinstance(series_cleaned.dtype, pd.BooleanDtype)
and not isinstance(series_cleaned.dtype, pd.StringDtype)
and series_cleaned.dtype != "object"
):
# Keep the original data type for non-object dtypes
df_converted[col] = series
schema[col] = str(series_cleaned.dtype)
continue
# print(f"type: {series_cleaned.dtype}, column: {series_cleaned.name}")
if not isinstance(series_cleaned.dtype, pd.BooleanDtype):
# Try numeric (int or float)
numeric = pd.to_numeric(series_cleaned, errors="coerce")
if numeric.notna().sum() >= n_rows * convert_threshold:
if (numeric.dropna() % 1 == 0).all():
try:
df_converted[col] = numeric.astype("int") # Nullable integer
schema[col] = "int"
continue
except Exception:
pass
df_converted[col] = numeric.astype("double")
schema[col] = "double"
continue
# Try datetime
datetime_converted = pd.to_datetime(series_cleaned, errors="coerce")
if datetime_converted.notna().sum() >= n_rows * convert_threshold:
df_converted[col] = datetime_converted
schema[col] = "timestamp"
continue
# Try timedelta
try:
timedelta_converted = pd.to_timedelta(series_cleaned, errors="coerce")
if timedelta_converted.notna().sum() >= n_rows * convert_threshold:
df_converted[col] = timedelta_converted
schema[col] = "timedelta"
continue
except TypeError:
pass
# Try category
try:
unique_ratio = series_cleaned.nunique(dropna=True) / n_rows if n_rows > 0 else 1.0
if unique_ratio <= category_threshold:
df_converted[col] = series_cleaned.astype("category")
schema[col] = "category"
continue
except Exception:
pass
df_converted[col] = series_cleaned.astype("string")
schema[col] = "string"
return df_converted, schema

View File

@@ -1,7 +1,37 @@
import logging
import os
class ColoredFormatter(logging.Formatter):
# ANSI escape codes for colors
COLORS = {
# logging.DEBUG: "\033[36m", # Cyan
# logging.INFO: "\033[32m", # Green
logging.WARNING: "\033[33m", # Yellow
logging.ERROR: "\033[31m", # Red
logging.CRITICAL: "\033[1;31m", # Bright Red
}
RESET = "\033[0m" # Reset to default
def __init__(self, fmt, datefmt, use_color=True):
super().__init__(fmt, datefmt)
self.use_color = use_color
def format(self, record):
formatted = super().format(record)
if self.use_color:
color = self.COLORS.get(record.levelno, "")
if color:
return f"{color}{formatted}{self.RESET}"
return formatted
logger = logging.getLogger(__name__)
logger_formatter = logging.Formatter(
"[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s", "%m-%d %H:%M:%S"
use_color = True
if os.getenv("FLAML_LOG_NO_COLOR"):
use_color = False
logger_formatter = ColoredFormatter(
"[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s", "%m-%d %H:%M:%S", use_color
)
logger.propagate = False

View File

@@ -2,30 +2,31 @@
# * Copyright (c) FLAML authors. All rights reserved.
# * Licensed under the MIT License. See LICENSE file in the
# * project root for license information.
import time
from typing import Union, Callable, TypeVar, Optional, Tuple
import logging
import time
from typing import Callable, Optional, Tuple, TypeVar, Union
import numpy as np
from flaml.automl.data import group_counts
from flaml.automl.task.task import Task
from flaml.automl.model import BaseEstimator, TransformersEstimator
from flaml.automl.spark import psDataFrame, psSeries, ERROR as SPARK_ERROR, Series, DataFrame
from flaml.automl.spark import ERROR as SPARK_ERROR
from flaml.automl.spark import DataFrame, Series, psDataFrame, psSeries
from flaml.automl.task.task import Task
from flaml.automl.time_series import TimeSeriesDataset
try:
from sklearn.metrics import (
mean_squared_error,
r2_score,
roc_auc_score,
accuracy_score,
mean_absolute_error,
log_loss,
average_precision_score,
f1_score,
log_loss,
mean_absolute_error,
mean_absolute_percentage_error,
mean_squared_error,
ndcg_score,
r2_score,
roc_auc_score,
)
except ImportError:
pass
@@ -33,7 +34,6 @@ except ImportError:
if SPARK_ERROR is None:
from flaml.automl.spark.metrics import spark_metric_loss_score
from flaml.automl.time_series import TimeSeriesDataset
logger = logging.getLogger(__name__)
@@ -89,6 +89,11 @@ huggingface_metric_to_mode = {
"wer": "min",
}
huggingface_submetric_to_metric = {"rouge1": "rouge", "rouge2": "rouge"}
spark_metric_name_dict = {
"Regression": ["r2", "rmse", "mse", "mae", "var"],
"Binary Classification": ["pr_auc", "roc_auc"],
"Multi-class Classification": ["accuracy", "log_loss", "f1", "micro_f1", "macro_f1"],
}
def metric_loss_score(
@@ -122,9 +127,21 @@ def metric_loss_score(
import datasets
datasets_metric_name = huggingface_submetric_to_metric.get(metric_name, metric_name.split(":")[0])
metric = datasets.load_metric(datasets_metric_name)
metric_mode = huggingface_metric_to_mode[datasets_metric_name]
# datasets>=3 removed load_metric; prefer evaluate if available
try:
import evaluate
metric = evaluate.load(datasets_metric_name, trust_remote_code=True)
except Exception:
if hasattr(datasets, "load_metric"):
metric = datasets.load_metric(datasets_metric_name, trust_remote_code=True)
else:
from datasets import load_metric as _load_metric # older datasets
metric = _load_metric(datasets_metric_name, trust_remote_code=True)
if metric_name.startswith("seqeval"):
y_processed_true = [[labels[tr] for tr in each_list] for each_list in y_processed_true]
elif metric in ("pearsonr", "spearmanr"):
@@ -294,14 +311,14 @@ def get_y_pred(estimator, X, eval_metric, task: Task):
else:
y_pred = estimator.predict(X)
if isinstance(y_pred, Series) or isinstance(y_pred, DataFrame):
if isinstance(y_pred, (Series, DataFrame)):
y_pred = y_pred.values
return y_pred
def to_numpy(x):
if isinstance(x, Series or isinstance(x, DataFrame)):
if isinstance(x, (Series, DataFrame)):
x = x.values
else:
x = np.ndarray(x)
@@ -323,7 +340,7 @@ def compute_estimator(
estimator_name: str,
eval_method: str,
eval_metric: Union[str, Callable],
best_val_loss=np.Inf,
best_val_loss=np.inf,
n_jobs: Optional[int] = 1, # some estimators of EstimatorSubclass don't accept n_jobs. Should be None in that case.
estimator_class: Optional[EstimatorSubclass] = None,
cv_score_agg_func: Optional[callable] = None,
@@ -334,6 +351,14 @@ def compute_estimator(
if fit_kwargs is None:
fit_kwargs = {}
fe_params = {}
for param, value in config_dic.items():
if param.startswith("fe."):
fe_params[param] = value
for param, value in fe_params.items():
config_dic.pop(param)
estimator_class = estimator_class or task.estimator_class_from_str(estimator_name)
estimator = estimator_class(
**config_dic,
@@ -401,12 +426,21 @@ def train_estimator(
free_mem_ratio=0,
) -> Tuple[EstimatorSubclass, float]:
start_time = time.time()
fe_params = {}
for param, value in config_dic.items():
if param.startswith("fe."):
fe_params[param] = value
for param, value in fe_params.items():
config_dic.pop(param)
estimator_class = estimator_class or task.estimator_class_from_str(estimator_name)
estimator = estimator_class(
**config_dic,
task=task,
n_jobs=n_jobs,
)
if fit_kwargs is None:
fit_kwargs = {}
@@ -552,7 +586,7 @@ def _eval_estimator(
# TODO: why are integer labels being cast to str in the first place?
if isinstance(val_pred_y, Series) or isinstance(val_pred_y, DataFrame) or isinstance(val_pred_y, np.ndarray):
if isinstance(val_pred_y, (Series, DataFrame, np.ndarray)):
test = val_pred_y if isinstance(val_pred_y, np.ndarray) else val_pred_y.values
if not np.issubdtype(test.dtype, np.number):
# some NLP models return a list
@@ -567,17 +601,27 @@ def _eval_estimator(
pred_time = (time.time() - pred_start) / num_val_rows
val_loss = metric_loss_score(
eval_metric,
y_processed_predict=val_pred_y,
y_processed_true=y_val,
labels=labels,
sample_weight=weight_val,
groups=groups_val,
)
try:
val_loss = metric_loss_score(
eval_metric,
y_processed_predict=val_pred_y,
y_processed_true=y_val,
labels=labels,
sample_weight=weight_val,
groups=groups_val,
)
except ValueError as e:
# `r2_score` and other metrics may raise a `ValueError` when a model returns `inf` or `nan` values. In this case, we set the val_loss to infinity.
val_loss = np.inf
logger.warning(f"ValueError {e} happened in `metric_loss_score`, set `val_loss` to `np.inf`")
metric_for_logging = {"pred_time": pred_time}
if log_training_metric:
train_pred_y = get_y_pred(estimator, X_train, eval_metric, task)
# For time series forecasting, X_train may be a sampled dataset whose
# test partition can be empty. Use the training partition from X_val
# (which is the dataset used to define y_train above) to keep shapes
# aligned and avoid empty prediction inputs.
X_train_for_metric = X_val.X_train if isinstance(X_val, TimeSeriesDataset) else X_train
train_pred_y = get_y_pred(estimator, X_train_for_metric, eval_metric, task)
metric_for_logging["train_loss"] = metric_loss_score(
eval_metric,
train_pred_y,

File diff suppressed because it is too large Load Diff

View File

@@ -4,16 +4,15 @@ This directory contains utility functions used by AutoNLP. Currently we support
Please refer to this [link](https://microsoft.github.io/FLAML/docs/Examples/AutoML-NLP) for examples.
# Troubleshooting fine-tuning HPO for pre-trained language models
The frequent updates of transformers may lead to fluctuations in the results of tuning. To help users quickly troubleshoot the result of AutoNLP when a tuning failure occurs (e.g., failing to reproduce previous results), we have provided the following jupyter notebook:
* [Troubleshooting HPO for fine-tuning pre-trained language models](https://github.com/microsoft/FLAML/blob/main/notebook/research/acl2021.ipynb)
- [Troubleshooting HPO for fine-tuning pre-trained language models](https://github.com/microsoft/FLAML/blob/main/notebook/research/acl2021.ipynb)
Our findings on troubleshooting fine-tuning the Electra and RoBERTa model for the GLUE dataset can be seen in the following paper published in ACL 2021:
* [An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models](https://arxiv.org/abs/2106.09204). Xueqing Liu, Chi Wang. ACL-IJCNLP 2021.
- [An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models](https://arxiv.org/abs/2106.09204). Xueqing Liu, Chi Wang. ACL-IJCNLP 2021.
```bibtex
@inproceedings{liu2021hpo,

View File

@@ -1,17 +1,18 @@
from dataclasses import dataclass
from transformers.data.data_collator import (
DataCollatorWithPadding,
DataCollatorForTokenClassification,
DataCollatorForSeq2Seq,
)
from collections import OrderedDict
from dataclasses import dataclass
from transformers.data.data_collator import (
DataCollatorForSeq2Seq,
DataCollatorForTokenClassification,
DataCollatorWithPadding,
)
from flaml.automl.task.task import (
TOKENCLASSIFICATION,
MULTICHOICECLASSIFICATION,
SUMMARIZATION,
SEQCLASSIFICATION,
SEQREGRESSION,
SUMMARIZATION,
TOKENCLASSIFICATION,
)
@@ -19,6 +20,7 @@ from flaml.automl.task.task import (
class DataCollatorForMultipleChoiceClassification(DataCollatorWithPadding):
def __call__(self, features):
from itertools import chain
import torch
label_name = "label" if "label" in features[0].keys() else "labels"
@@ -30,7 +32,7 @@ class DataCollatorForMultipleChoiceClassification(DataCollatorWithPadding):
[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
]
flattened_features = list(chain(*flattened_features))
batch = super(DataCollatorForMultipleChoiceClassification, self).__call__(flattened_features)
batch = super().__call__(flattened_features)
# Un-flatten
batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
# Add back labels

View File

@@ -1,10 +1,11 @@
import argparse
from dataclasses import dataclass, field
from typing import Optional, List
from typing import List, Optional
from flaml.automl.task.task import NLG_TASKS
try:
from transformers import TrainingArguments
from transformers import Seq2SeqTrainingArguments as TrainingArguments
except ImportError:
TrainingArguments = object
@@ -76,6 +77,14 @@ class TrainingArgumentsForAuto(TrainingArguments):
logging_steps: int = field(default=500, metadata={"help": "Log every X updates steps."})
# Newer versions of HuggingFace Transformers may access `TrainingArguments.generation_config`
# (e.g., in generation-aware trainers/callbacks). Keep this attribute to remain compatible
# while defaulting to None for non-generation tasks.
generation_config: Optional[object] = field(
default=None,
metadata={"help": "Optional generation config (or path) used by generation-aware trainers."},
)
@staticmethod
def load_args_from_console():
from dataclasses import fields

View File

@@ -1,14 +1,16 @@
from itertools import chain
import numpy as np
from flaml.automl.task.task import (
SUMMARIZATION,
SEQREGRESSION,
SEQCLASSIFICATION,
MULTICHOICECLASSIFICATION,
TOKENCLASSIFICATION,
NLG_TASKS,
)
from flaml.automl.data import pd
from flaml.automl.task.task import (
MULTICHOICECLASSIFICATION,
NLG_TASKS,
SEQCLASSIFICATION,
SEQREGRESSION,
SUMMARIZATION,
TOKENCLASSIFICATION,
)
def todf(X, Y, column_name):
@@ -209,29 +211,28 @@ def tokenize_onedataframe(
hf_args=None,
prefix_str=None,
):
with tokenizer.as_target_tokenizer():
_, tokenized_column_names = tokenize_row(
dict(X.iloc[0]),
_, tokenized_column_names = tokenize_row(
dict(X.iloc[0]),
tokenizer,
prefix=(prefix_str,) if task is SUMMARIZATION else None,
task=task,
hf_args=hf_args,
return_column_name=True,
)
d = X.apply(
lambda x: tokenize_row(
x,
tokenizer,
prefix=(prefix_str,) if task is SUMMARIZATION else None,
task=task,
hf_args=hf_args,
return_column_name=True,
)
d = X.apply(
lambda x: tokenize_row(
x,
tokenizer,
prefix=(prefix_str,) if task is SUMMARIZATION else None,
task=task,
hf_args=hf_args,
),
axis=1,
result_type="expand",
)
X_tokenized = pd.DataFrame(columns=tokenized_column_names)
X_tokenized[tokenized_column_names] = d
return X_tokenized
),
axis=1,
result_type="expand",
)
X_tokenized = pd.DataFrame(columns=tokenized_column_names)
X_tokenized[tokenized_column_names] = d
return X_tokenized
def tokenize_row(
@@ -243,7 +244,7 @@ def tokenize_row(
return_column_name=False,
):
if prefix:
this_row = tuple(["".join(x) for x in zip(prefix, this_row)])
this_row = tuple("".join(x) for x in zip(prefix, this_row))
# tokenizer.pad_token = tokenizer.eos_token
tokenized_example = tokenizer(
@@ -377,6 +378,7 @@ def load_model(checkpoint_path, task, num_labels=None):
transformers.logging.set_verbosity_error()
from transformers import AutoConfig
from flaml.automl.task.task import (
SEQCLASSIFICATION,
SEQREGRESSION,
@@ -384,14 +386,16 @@ def load_model(checkpoint_path, task, num_labels=None):
)
def get_this_model(checkpoint_path, task, model_config):
from transformers import AutoModelForSequenceClassification
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoModelForMultipleChoice
from transformers import AutoModelForTokenClassification
from transformers import (
AutoModelForMultipleChoice,
AutoModelForSeq2SeqLM,
AutoModelForSequenceClassification,
AutoModelForTokenClassification,
)
if task in (SEQCLASSIFICATION, SEQREGRESSION):
return AutoModelForSequenceClassification.from_pretrained(
checkpoint_path, config=model_config, ignore_mismatched_sizes=True
checkpoint_path, config=model_config, ignore_mismatched_sizes=True, trust_remote_code=True
)
elif task == TOKENCLASSIFICATION:
return AutoModelForTokenClassification.from_pretrained(checkpoint_path, config=model_config)

View File

@@ -1,11 +1,12 @@
from typing import Dict, Any
from typing import Any, Dict
import numpy as np
from flaml.automl.task.task import (
SUMMARIZATION,
SEQREGRESSION,
SEQCLASSIFICATION,
MULTICHOICECLASSIFICATION,
SEQCLASSIFICATION,
SEQREGRESSION,
SUMMARIZATION,
TOKENCLASSIFICATION,
)
@@ -24,14 +25,12 @@ def load_default_huggingface_metric_for_task(task):
def is_a_list_of_str(this_obj):
return (isinstance(this_obj, list) or isinstance(this_obj, np.ndarray)) and all(
isinstance(x, str) for x in this_obj
)
return isinstance(this_obj, (list, np.ndarray)) and all(isinstance(x, str) for x in this_obj)
def _clean_value(value: Any) -> str:
if isinstance(value, float):
return "{:.5}".format(value)
return f"{value:.5}"
else:
return str(value).replace("/", "_")
@@ -85,7 +84,7 @@ class Counter:
@staticmethod
def get_trial_fold_name(local_dir, trial_config, trial_id):
Counter.counter += 1
experiment_tag = "{0}_{1}".format(str(Counter.counter), format_vars(trial_config))
experiment_tag = f"{str(Counter.counter)}_{format_vars(trial_config)}"
logdir = get_logdir_name(_generate_dirname(experiment_tag, trial_id=trial_id), local_dir)
return logdir

View File

@@ -1,3 +1,5 @@
import atexit
import logging
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
@@ -6,15 +8,18 @@ try:
import pyspark.pandas as ps
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.pandas import DataFrame as psDataFrame
from pyspark.pandas import Series as psSeries
from pyspark.pandas import set_option
from pyspark.sql import DataFrame as sparkDataFrame
from pyspark.pandas import DataFrame as psDataFrame, Series as psSeries, set_option
from pyspark.sql import SparkSession
from pyspark.util import VersionUtils
except ImportError:
class psDataFrame:
pass
F = T = ps = sparkDataFrame = psSeries = psDataFrame
F = T = ps = sparkDataFrame = SparkSession = psSeries = psDataFrame
_spark_major_minor_version = set_option = None
ERROR = ImportError(
"""Please run pip install flaml[spark]
@@ -30,3 +35,60 @@ try:
from pandas import DataFrame, Series
except ImportError:
DataFrame = Series = pd = None
logger = logging.getLogger(__name__)
def disable_spark_ansi_mode():
"""Disable Spark ANSI mode if it is enabled."""
spark = SparkSession.getActiveSession() if hasattr(SparkSession, "getActiveSession") else None
adjusted = False
try:
ps_conf = ps.get_option("compute.fail_on_ansi_mode")
except Exception:
ps_conf = None
ansi_conf = [None, ps_conf] # ansi_conf and ps_conf original values
# Spark may store the config as string 'true'/'false' (or boolean in some contexts)
if spark is not None:
ansi_conf[0] = spark.conf.get("spark.sql.ansi.enabled")
ansi_enabled = (
(isinstance(ansi_conf[0], str) and ansi_conf[0].lower() == "true")
or (isinstance(ansi_conf[0], bool) and ansi_conf[0] is True)
or ansi_conf[0] is None
)
try:
if ansi_enabled:
logger.debug("Adjusting spark.sql.ansi.enabled to false")
spark.conf.set("spark.sql.ansi.enabled", "false")
adjusted = True
except Exception:
# If reading/setting options fail for some reason, keep going and let
# pandas-on-Spark raise a meaningful error later.
logger.exception("Failed to set spark.sql.ansi.enabled")
if ansi_conf[1]:
logger.debug("Adjusting pandas-on-Spark compute.fail_on_ansi_mode to False")
ps.set_option("compute.fail_on_ansi_mode", False)
adjusted = True
return spark, ansi_conf, adjusted
def restore_spark_ansi_mode(spark, ansi_conf, adjusted):
"""Restore Spark ANSI mode to its original setting."""
# Restore the original spark.sql.ansi.enabled to avoid persistent side-effects.
if adjusted and spark and ansi_conf[0] is not None:
try:
logger.debug(f"Restoring spark.sql.ansi.enabled to {ansi_conf[0]}")
spark.conf.set("spark.sql.ansi.enabled", ansi_conf[0])
except Exception:
logger.exception("Failed to restore spark.sql.ansi.enabled")
if adjusted and ansi_conf[1]:
logger.debug(f"Restoring pandas-on-Spark compute.fail_on_ansi_mode to {ansi_conf[1]}")
ps.set_option("compute.fail_on_ansi_mode", ansi_conf[1])
spark, ansi_conf, adjusted = disable_spark_ansi_mode()
atexit.register(restore_spark_ansi_mode, spark, ansi_conf, adjusted)

View File

@@ -1,97 +0,0 @@
ParamList_LightGBM_Base = [
"baggingFraction",
"baggingFreq",
"baggingSeed",
"binSampleCount",
"boostFromAverage",
"boostingType",
"catSmooth",
"categoricalSlotIndexes",
"categoricalSlotNames",
"catl2",
"chunkSize",
"dataRandomSeed",
"defaultListenPort",
"deterministic",
"driverListenPort",
"dropRate",
"dropSeed",
"earlyStoppingRound",
"executionMode",
"extraSeed" "featureFraction",
"featureFractionByNode",
"featureFractionSeed",
"featuresCol",
"featuresShapCol",
"fobj" "improvementTolerance",
"initScoreCol",
"isEnableSparse",
"isProvideTrainingMetric",
"labelCol",
"lambdaL1",
"lambdaL2",
"leafPredictionCol",
"learningRate",
"matrixType",
"maxBin",
"maxBinByFeature",
"maxCatThreshold",
"maxCatToOnehot",
"maxDeltaStep",
"maxDepth",
"maxDrop",
"metric",
"microBatchSize",
"minDataInLeaf",
"minDataPerBin",
"minDataPerGroup",
"minGainToSplit",
"minSumHessianInLeaf",
"modelString",
"monotoneConstraints",
"monotoneConstraintsMethod",
"monotonePenalty",
"negBaggingFraction",
"numBatches",
"numIterations",
"numLeaves",
"numTasks",
"numThreads",
"objectiveSeed",
"otherRate",
"parallelism",
"passThroughArgs",
"posBaggingFraction",
"predictDisableShapeCheck",
"predictionCol",
"repartitionByGroupingColumn",
"seed",
"skipDrop",
"slotNames",
"timeout",
"topK",
"topRate",
"uniformDrop",
"useBarrierExecutionMode",
"useMissing",
"useSingleDatasetMode",
"validationIndicatorCol",
"verbosity",
"weightCol",
"xGBoostDartMode",
"zeroAsMissing",
"objective",
]
ParamList_LightGBM_Classifier = ParamList_LightGBM_Base + [
"isUnbalance",
"probabilityCol",
"rawPredictionCol",
"thresholds",
]
ParamList_LightGBM_Regressor = ParamList_LightGBM_Base + ["tweedieVariancePower"]
ParamList_LightGBM_Ranker = ParamList_LightGBM_Base + [
"groupCol",
"evalAt",
"labelGain",
"maxPosition",
]

View File

@@ -1,14 +1,17 @@
import numpy as np
import json
from typing import Union
from flaml.automl.spark import psSeries, F
import numpy as np
from pyspark.ml.evaluation import (
BinaryClassificationEvaluator,
RegressionEvaluator,
MulticlassClassificationEvaluator,
MultilabelClassificationEvaluator,
RankingEvaluator,
RegressionEvaluator,
)
from flaml.automl.spark import F, T, psDataFrame, psSeries, sparkDataFrame
def ps_group_counts(groups: Union[psSeries, np.ndarray]) -> np.ndarray:
if isinstance(groups, np.ndarray):
@@ -34,6 +37,16 @@ def _compute_label_from_probability(df, probability_col, prediction_col):
return df
def string_to_array(s):
try:
return json.loads(s)
except json.JSONDecodeError:
return []
string_to_array_udf = F.udf(string_to_array, T.ArrayType(T.DoubleType()))
def spark_metric_loss_score(
metric_name: str,
y_predict: psSeries,
@@ -133,6 +146,11 @@ def spark_metric_loss_score(
)
elif metric_name == "log_loss":
# For log_loss, prediction_col should be probability, and we need to convert it to label
# handle data like "{'type': '1', 'values': '[1, 2, 3]'}"
# Fix cannot resolve "array_max(prediction)" due to data type mismatch: Parameter 1 requires the "ARRAY" type,
# however "prediction" has the type "STRUCT<type: TINYINT, size: INT, indices: ARRAY<INT>, values: ARRAY<DOUBLE>>"
df = df.withColumn(prediction_col, df[prediction_col].cast(T.StringType()))
df = df.withColumn(prediction_col, string_to_array_udf(df[prediction_col]))
df = _compute_label_from_probability(df, prediction_col, prediction_col + "_label")
evaluator = MulticlassClassificationEvaluator(
metricName="logLoss",

View File

@@ -1,17 +1,19 @@
import logging
from typing import Union, List, Optional, Tuple
from typing import List, Optional, Tuple, Union
import numpy as np
from flaml.automl.spark import (
sparkDataFrame,
ps,
DataFrame,
F,
Series,
T,
_spark_major_minor_version,
ps,
psDataFrame,
psSeries,
_spark_major_minor_version,
DataFrame,
Series,
set_option,
sparkDataFrame,
)
logger = logging.getLogger(__name__)
@@ -57,17 +59,29 @@ def to_pandas_on_spark(
```
"""
set_option("compute.default_index_type", default_index_type)
if isinstance(df, (DataFrame, Series)):
return ps.from_pandas(df)
elif isinstance(df, sparkDataFrame):
if _spark_major_minor_version[0] == 3 and _spark_major_minor_version[1] < 3:
return df.to_pandas_on_spark(index_col=index_col)
try:
orig_ps_conf = ps.get_option("compute.fail_on_ansi_mode")
except Exception:
orig_ps_conf = None
if orig_ps_conf:
ps.set_option("compute.fail_on_ansi_mode", False)
try:
if isinstance(df, (DataFrame, Series)):
return ps.from_pandas(df)
elif isinstance(df, sparkDataFrame):
if _spark_major_minor_version[0] == 3 and _spark_major_minor_version[1] < 3:
return df.to_pandas_on_spark(index_col=index_col)
else:
return df.pandas_api(index_col=index_col)
elif isinstance(df, (psDataFrame, psSeries)):
return df
else:
return df.pandas_api(index_col=index_col)
elif isinstance(df, (psDataFrame, psSeries)):
return df
else:
raise TypeError(f"{type(df)} is not one of pandas.DataFrame, pandas.Series and pyspark.sql.DataFrame")
raise TypeError(f"{type(df)} is not one of pandas.DataFrame, pandas.Series and pyspark.sql.DataFrame")
finally:
# Restore original config
if orig_ps_conf:
ps.set_option("compute.fail_on_ansi_mode", orig_ps_conf)
def train_test_split_pyspark(

View File

@@ -1,13 +1,15 @@
import inspect
import copy
import inspect
import time
from typing import Any, Optional
import numpy as np
from flaml import tune
from flaml.automl.logger import logger
from flaml.automl.ml import compute_estimator, train_estimator
from flaml.automl.spark import DataFrame, Series, psDataFrame, psSeries
from flaml.automl.time_series.ts_data import TimeSeriesDataset
from flaml.automl.spark import psDataFrame, psSeries, DataFrame, Series
class SearchState:
@@ -35,10 +37,9 @@ class SearchState:
if isinstance(domain_one_dim, sample.Domain):
renamed_type = list(inspect.signature(domain_one_dim.is_valid).parameters.values())[0].annotation
type_match = (
renamed_type == Any
renamed_type is Any
or isinstance(value_one_dim, renamed_type)
or isinstance(value_one_dim, int)
and renamed_type is float
or (renamed_type is float and isinstance(value_one_dim, int))
)
if not (type_match and domain_one_dim.is_valid(value_one_dim)):
return False
@@ -63,6 +64,7 @@ class SearchState:
custom_hp=None,
max_iter=None,
budget=None,
featurization="auto",
):
self.init_eci = learner_class.cost_relative2lgbm() if budget >= 0 else 1
self._search_space_domain = {}
@@ -80,6 +82,7 @@ class SearchState:
else:
data_size = data.shape
search_space = learner_class.search_space(data_size=data_size, task=task)
self.data_size = data_size
if custom_hp is not None:
@@ -89,9 +92,7 @@ class SearchState:
starting_point = AutoMLState.sanitize(starting_point)
if max_iter > 1 and not self.valid_starting_point(starting_point, search_space):
# If the number of iterations is larger than 1, remove invalid point
logger.warning(
"Starting point {} removed because it is outside of the search space".format(starting_point)
)
logger.warning(f"Starting point {starting_point} removed because it is outside of the search space")
starting_point = None
elif isinstance(starting_point, list):
starting_point = [AutoMLState.sanitize(x) for x in starting_point]
@@ -206,7 +207,7 @@ class SearchState:
self.val_loss, self.config = obj, config
def get_hist_config_sig(self, sample_size, config):
config_values = tuple([config[k] for k in self._hp_names if k in config])
config_values = tuple(config[k] for k in self._hp_names if k in config)
config_sig = str(sample_size) + "_" + str(config_values)
return config_sig
@@ -288,9 +289,11 @@ class AutoMLState:
budget = (
None
if state.time_budget < 0
else state.time_budget - state.time_from_start
if sample_size == state.data_size[0]
else (state.time_budget - state.time_from_start) / 2 * sample_size / state.data_size[0]
else (
state.time_budget - state.time_from_start
if sample_size == state.data_size[0]
else (state.time_budget - state.time_from_start) / 2 * sample_size / state.data_size[0]
)
)
(
@@ -351,6 +354,7 @@ class AutoMLState:
estimator: str,
config_w_resource: dict,
sample_size: Optional[int] = None,
is_retrain: bool = False,
):
if not sample_size:
sample_size = config_w_resource.get("FLAML_sample_size", len(self.y_train_all))
@@ -376,9 +380,8 @@ class AutoMLState:
this_estimator_kwargs[
"groups"
] = groups # NOTE: _train_with_config is after kwargs is updated to fit_kwargs_by_estimator
this_estimator_kwargs.update({"is_retrain": is_retrain})
budget = None if self.time_budget < 0 else self.time_budget - self.time_from_start
estimator, train_time = train_estimator(
X_train=sampled_X_train,
y_train=sampled_y_train,

View File

@@ -1,8 +1,9 @@
from typing import Optional, Union
import numpy as np
from flaml.automl.data import DataFrame, Series
from flaml.automl.task.task import Task, TS_FORECAST
from flaml.automl.task.task import TS_FORECAST, Task
def task_factory(

View File

@@ -1,43 +1,39 @@
import logging
import time
from typing import List, Optional
import numpy as np
from flaml.automl.data import TS_TIMESTAMP_COL, concat
from flaml.automl.ml import EstimatorSubclass, get_val_loss, default_cv_score_agg_func
from flaml.automl.task.task import (
Task,
get_classification_objective,
TS_FORECAST,
TS_FORECASTPANEL,
)
from flaml.config import RANDOM_SEED
from flaml.automl.spark import ps, psDataFrame, psSeries, pd
import numpy as np
from flaml.automl.data import TS_TIMESTAMP_COL, concat
from flaml.automl.ml import EstimatorSubclass, default_cv_score_agg_func, get_val_loss
from flaml.automl.spark import pd, ps, psDataFrame, psSeries
from flaml.automl.spark.utils import (
iloc_pandas_on_spark,
len_labels,
set_option,
spark_kFold,
train_test_split_pyspark,
unique_pandas_on_spark,
unique_value_first_index,
len_labels,
set_option,
)
from flaml.automl.task.task import TS_FORECAST, TS_FORECASTPANEL, Task, get_classification_objective
from flaml.config import RANDOM_SEED
try:
from scipy.sparse import issparse
except ImportError:
pass
try:
from sklearn.utils import shuffle
from sklearn.model_selection import (
train_test_split,
RepeatedStratifiedKFold,
RepeatedKFold,
GroupKFold,
TimeSeriesSplit,
GroupShuffleSplit,
RepeatedKFold,
RepeatedStratifiedKFold,
StratifiedGroupKFold,
TimeSeriesSplit,
train_test_split,
)
from sklearn.utils import shuffle
except ImportError:
pass
@@ -49,19 +45,31 @@ class GenericTask(Task):
def estimators(self):
if self._estimators is None:
# put this into a function to avoid circular dependency
from flaml.automl.contrib.histgb import HistGradientBoostingEstimator
from flaml.automl.model import (
XGBoostSklearnEstimator,
XGBoostLimitDepthEstimator,
RandomForestEstimator,
CatBoostEstimator,
ElasticNetEstimator,
ExtraTreesEstimator,
KNeighborsEstimator,
LassoLarsEstimator,
LGBMEstimator,
LRL1Classifier,
LRL2Classifier,
CatBoostEstimator,
ExtraTreesEstimator,
KNeighborsEstimator,
RandomForestEstimator,
SGDEstimator,
SparkAFTSurvivalRegressionEstimator,
SparkGBTEstimator,
SparkGLREstimator,
SparkLGBMEstimator,
SparkLinearRegressionEstimator,
SparkLinearSVCEstimator,
SparkNaiveBayesEstimator,
SparkRandomForestEstimator,
SVCEstimator,
TransformersEstimator,
TransformersEstimatorModelSelection,
SparkLGBMEstimator,
XGBoostLimitDepthEstimator,
XGBoostSklearnEstimator,
)
self._estimators = {
@@ -70,6 +78,7 @@ class GenericTask(Task):
"rf": RandomForestEstimator,
"lgbm": LGBMEstimator,
"lgbm_spark": SparkLGBMEstimator,
"rf_spark": SparkRandomForestEstimator,
"lrl1": LRL1Classifier,
"lrl2": LRL2Classifier,
"catboost": CatBoostEstimator,
@@ -77,6 +86,17 @@ class GenericTask(Task):
"kneighbor": KNeighborsEstimator,
"transformer": TransformersEstimator,
"transformer_ms": TransformersEstimatorModelSelection,
"histgb": HistGradientBoostingEstimator,
"svc": SVCEstimator,
"sgd": SGDEstimator,
"nb_spark": SparkNaiveBayesEstimator,
"enet": ElasticNetEstimator,
"lassolars": LassoLarsEstimator,
"glr_spark": SparkGLREstimator,
"lr_spark": SparkLinearRegressionEstimator,
"svc_spark": SparkLinearSVCEstimator,
"gbt_spark": SparkGBTEstimator,
"aft_spark": SparkAFTSurvivalRegressionEstimator,
}
return self._estimators
@@ -268,8 +288,8 @@ class GenericTask(Task):
seed=RANDOM_SEED,
)
columns_to_drop = [c for c in df_all_train.columns if c in [stratify_column, "sample_weight"]]
X_train = df_all_train.drop(columns_to_drop)
X_val = df_all_val.drop(columns_to_drop)
X_train = df_all_train.drop(columns=columns_to_drop)
X_val = df_all_val.drop(columns=columns_to_drop)
y_train = df_all_train[stratify_column]
y_val = df_all_val[stratify_column]
@@ -345,6 +365,465 @@ class GenericTask(Task):
X_train, X_val, y_train, y_val = GenericTask._split_pyspark(state, X, y, split_ratio, stratify)
return X_train, X_val, y_train, y_val
def _handle_missing_labels_fast(
self,
state,
X_train,
X_val,
y_train,
y_val,
X_train_all,
y_train_all,
is_spark_dataframe,
data_is_df,
):
"""Handle missing labels by adding first instance to the set with missing label.
This is the faster version that may create some overlap but ensures all labels
are present in both sets. If a label is missing from train, it adds the first
instance to train. If a label is missing from val, it adds the first instance to val.
If no labels are missing, no instances are duplicated.
Args:
state: The state object containing fit parameters
X_train, X_val: Training and validation features
y_train, y_val: Training and validation labels
X_train_all, y_train_all: Complete dataset
is_spark_dataframe: Whether data is pandas_on_spark
data_is_df: Whether data is DataFrame/Series
Returns:
Tuple of (X_train, X_val, y_train, y_val) with missing labels added
"""
# Check which labels are present in train and val sets
if is_spark_dataframe:
label_set_train, _ = unique_pandas_on_spark(y_train)
label_set_val, _ = unique_pandas_on_spark(y_val)
label_set_all, first = unique_value_first_index(y_train_all)
else:
label_set_all, first = unique_value_first_index(y_train_all)
label_set_train = np.unique(y_train)
label_set_val = np.unique(y_val)
# Find missing labels
missing_in_train = np.setdiff1d(label_set_all, label_set_train)
missing_in_val = np.setdiff1d(label_set_all, label_set_val)
# Add first instance of missing labels to train set
if len(missing_in_train) > 0:
missing_train_indices = []
for label in missing_in_train:
label_matches = np.where(label_set_all == label)[0]
if len(label_matches) > 0 and label_matches[0] < len(first):
missing_train_indices.append(first[label_matches[0]])
if len(missing_train_indices) > 0:
X_missing_train = (
iloc_pandas_on_spark(X_train_all, missing_train_indices)
if is_spark_dataframe
else X_train_all.iloc[missing_train_indices]
if data_is_df
else X_train_all[missing_train_indices]
)
y_missing_train = (
iloc_pandas_on_spark(y_train_all, missing_train_indices)
if is_spark_dataframe
else y_train_all.iloc[missing_train_indices]
if isinstance(y_train_all, (pd.Series, psSeries))
else y_train_all[missing_train_indices]
)
X_train = concat(X_missing_train, X_train)
y_train = concat(y_missing_train, y_train) if data_is_df else np.concatenate([y_missing_train, y_train])
# Handle sample_weight if present
if "sample_weight" in state.fit_kwargs:
sample_weight_source = (
state.sample_weight_all
if hasattr(state, "sample_weight_all")
else state.fit_kwargs.get("sample_weight")
)
if sample_weight_source is not None and max(missing_train_indices) < len(sample_weight_source):
missing_weights = (
sample_weight_source[missing_train_indices]
if isinstance(sample_weight_source, np.ndarray)
else sample_weight_source.iloc[missing_train_indices]
)
state.fit_kwargs["sample_weight"] = concat(missing_weights, state.fit_kwargs["sample_weight"])
# Add first instance of missing labels to val set
if len(missing_in_val) > 0:
missing_val_indices = []
for label in missing_in_val:
label_matches = np.where(label_set_all == label)[0]
if len(label_matches) > 0 and label_matches[0] < len(first):
missing_val_indices.append(first[label_matches[0]])
if len(missing_val_indices) > 0:
X_missing_val = (
iloc_pandas_on_spark(X_train_all, missing_val_indices)
if is_spark_dataframe
else X_train_all.iloc[missing_val_indices]
if data_is_df
else X_train_all[missing_val_indices]
)
y_missing_val = (
iloc_pandas_on_spark(y_train_all, missing_val_indices)
if is_spark_dataframe
else y_train_all.iloc[missing_val_indices]
if isinstance(y_train_all, (pd.Series, psSeries))
else y_train_all[missing_val_indices]
)
X_val = concat(X_missing_val, X_val)
y_val = concat(y_missing_val, y_val) if data_is_df else np.concatenate([y_missing_val, y_val])
# Handle sample_weight if present
if (
"sample_weight" in state.fit_kwargs
and hasattr(state, "weight_val")
and state.weight_val is not None
):
sample_weight_source = (
state.sample_weight_all
if hasattr(state, "sample_weight_all")
else state.fit_kwargs.get("sample_weight")
)
if sample_weight_source is not None and max(missing_val_indices) < len(sample_weight_source):
missing_weights = (
sample_weight_source[missing_val_indices]
if isinstance(sample_weight_source, np.ndarray)
else sample_weight_source.iloc[missing_val_indices]
)
state.weight_val = concat(missing_weights, state.weight_val)
return X_train, X_val, y_train, y_val
def _handle_missing_labels_no_overlap(
self,
state,
X_train,
X_val,
y_train,
y_val,
X_train_all,
y_train_all,
is_spark_dataframe,
data_is_df,
split_ratio,
):
"""Handle missing labels intelligently to avoid overlap when possible.
This is the slower but more precise version that:
- For single-instance classes: Adds to both sets (unavoidable overlap)
- For multi-instance classes: Re-splits them properly to avoid overlap
Args:
state: The state object containing fit parameters
X_train, X_val: Training and validation features
y_train, y_val: Training and validation labels
X_train_all, y_train_all: Complete dataset
is_spark_dataframe: Whether data is pandas_on_spark
data_is_df: Whether data is DataFrame/Series
split_ratio: The ratio for splitting
Returns:
Tuple of (X_train, X_val, y_train, y_val) with missing labels handled
"""
# Check which labels are present in train and val sets
if is_spark_dataframe:
label_set_train, _ = unique_pandas_on_spark(y_train)
label_set_val, _ = unique_pandas_on_spark(y_val)
label_set_all, first = unique_value_first_index(y_train_all)
else:
label_set_all, first = unique_value_first_index(y_train_all)
label_set_train = np.unique(y_train)
label_set_val = np.unique(y_val)
# Find missing labels
missing_in_train = np.setdiff1d(label_set_all, label_set_train)
missing_in_val = np.setdiff1d(label_set_all, label_set_val)
# Handle missing labels intelligently
# For classes with only 1 instance: add to both sets (unavoidable overlap)
# For classes with multiple instances: move/split them properly to avoid overlap
if len(missing_in_train) > 0:
# Process missing labels in training set
for label in missing_in_train:
# Find all indices for this label in the original data
if is_spark_dataframe:
label_indices = np.where(y_train_all.to_numpy() == label)[0].tolist()
else:
label_indices = np.where(np.asarray(y_train_all) == label)[0].tolist()
num_instances = len(label_indices)
if num_instances == 1:
# Single instance: must add to both train and val (unavoidable overlap)
X_single = (
iloc_pandas_on_spark(X_train_all, label_indices)
if is_spark_dataframe
else X_train_all.iloc[label_indices]
if data_is_df
else X_train_all[label_indices]
)
y_single = (
iloc_pandas_on_spark(y_train_all, label_indices)
if is_spark_dataframe
else y_train_all.iloc[label_indices]
if isinstance(y_train_all, (pd.Series, psSeries))
else y_train_all[label_indices]
)
X_train = concat(X_single, X_train)
y_train = concat(y_single, y_train) if data_is_df else np.concatenate([y_single, y_train])
# Handle sample_weight
if "sample_weight" in state.fit_kwargs:
sample_weight_source = (
state.sample_weight_all
if hasattr(state, "sample_weight_all")
else state.fit_kwargs.get("sample_weight")
)
if sample_weight_source is not None and label_indices[0] < len(sample_weight_source):
single_weight = (
sample_weight_source[label_indices]
if isinstance(sample_weight_source, np.ndarray)
else sample_weight_source.iloc[label_indices]
)
state.fit_kwargs["sample_weight"] = concat(single_weight, state.fit_kwargs["sample_weight"])
else:
# Multiple instances: move some from val to train (no overlap needed)
# Calculate how many to move to train (leave at least 1 in val)
num_to_train = max(1, min(num_instances - 1, int(num_instances * (1 - split_ratio))))
indices_to_move = label_indices[:num_to_train]
X_to_move = (
iloc_pandas_on_spark(X_train_all, indices_to_move)
if is_spark_dataframe
else X_train_all.iloc[indices_to_move]
if data_is_df
else X_train_all[indices_to_move]
)
y_to_move = (
iloc_pandas_on_spark(y_train_all, indices_to_move)
if is_spark_dataframe
else y_train_all.iloc[indices_to_move]
if isinstance(y_train_all, (pd.Series, psSeries))
else y_train_all[indices_to_move]
)
# Add to train
X_train = concat(X_to_move, X_train)
y_train = concat(y_to_move, y_train) if data_is_df else np.concatenate([y_to_move, y_train])
# Remove from val (they are currently all in val)
if is_spark_dataframe:
val_mask = ~y_val.isin([label])
X_val = X_val[val_mask]
y_val = y_val[val_mask]
else:
val_mask = np.asarray(y_val) != label
if data_is_df:
X_val = X_val[val_mask]
y_val = y_val[val_mask]
else:
X_val = X_val[val_mask]
y_val = y_val[val_mask]
# Add remaining instances back to val
remaining_indices = label_indices[num_to_train:]
if len(remaining_indices) > 0:
X_remaining = (
iloc_pandas_on_spark(X_train_all, remaining_indices)
if is_spark_dataframe
else X_train_all.iloc[remaining_indices]
if data_is_df
else X_train_all[remaining_indices]
)
y_remaining = (
iloc_pandas_on_spark(y_train_all, remaining_indices)
if is_spark_dataframe
else y_train_all.iloc[remaining_indices]
if isinstance(y_train_all, (pd.Series, psSeries))
else y_train_all[remaining_indices]
)
X_val = concat(X_remaining, X_val)
y_val = concat(y_remaining, y_val) if data_is_df else np.concatenate([y_remaining, y_val])
# Handle sample_weight
if "sample_weight" in state.fit_kwargs:
sample_weight_source = (
state.sample_weight_all
if hasattr(state, "sample_weight_all")
else state.fit_kwargs.get("sample_weight")
)
if sample_weight_source is not None and max(indices_to_move) < len(sample_weight_source):
weights_to_move = (
sample_weight_source[indices_to_move]
if isinstance(sample_weight_source, np.ndarray)
else sample_weight_source.iloc[indices_to_move]
)
state.fit_kwargs["sample_weight"] = concat(
weights_to_move, state.fit_kwargs["sample_weight"]
)
if (
len(remaining_indices) > 0
and hasattr(state, "weight_val")
and state.weight_val is not None
):
# Remove and re-add weights for val
if isinstance(state.weight_val, np.ndarray):
state.weight_val = state.weight_val[val_mask]
else:
state.weight_val = state.weight_val[val_mask]
if max(remaining_indices) < len(sample_weight_source):
remaining_weights = (
sample_weight_source[remaining_indices]
if isinstance(sample_weight_source, np.ndarray)
else sample_weight_source.iloc[remaining_indices]
)
state.weight_val = concat(remaining_weights, state.weight_val)
if len(missing_in_val) > 0:
# Process missing labels in validation set
for label in missing_in_val:
# Find all indices for this label in the original data
if is_spark_dataframe:
label_indices = np.where(y_train_all.to_numpy() == label)[0].tolist()
else:
label_indices = np.where(np.asarray(y_train_all) == label)[0].tolist()
num_instances = len(label_indices)
if num_instances == 1:
# Single instance: must add to both train and val (unavoidable overlap)
X_single = (
iloc_pandas_on_spark(X_train_all, label_indices)
if is_spark_dataframe
else X_train_all.iloc[label_indices]
if data_is_df
else X_train_all[label_indices]
)
y_single = (
iloc_pandas_on_spark(y_train_all, label_indices)
if is_spark_dataframe
else y_train_all.iloc[label_indices]
if isinstance(y_train_all, (pd.Series, psSeries))
else y_train_all[label_indices]
)
X_val = concat(X_single, X_val)
y_val = concat(y_single, y_val) if data_is_df else np.concatenate([y_single, y_val])
# Handle sample_weight
if "sample_weight" in state.fit_kwargs and hasattr(state, "weight_val"):
sample_weight_source = (
state.sample_weight_all
if hasattr(state, "sample_weight_all")
else state.fit_kwargs.get("sample_weight")
)
if sample_weight_source is not None and label_indices[0] < len(sample_weight_source):
single_weight = (
sample_weight_source[label_indices]
if isinstance(sample_weight_source, np.ndarray)
else sample_weight_source.iloc[label_indices]
)
if state.weight_val is not None:
state.weight_val = concat(single_weight, state.weight_val)
else:
# Multiple instances: move some from train to val (no overlap needed)
# Calculate how many to move to val (leave at least 1 in train)
num_to_val = max(1, min(num_instances - 1, int(num_instances * split_ratio)))
indices_to_move = label_indices[:num_to_val]
X_to_move = (
iloc_pandas_on_spark(X_train_all, indices_to_move)
if is_spark_dataframe
else X_train_all.iloc[indices_to_move]
if data_is_df
else X_train_all[indices_to_move]
)
y_to_move = (
iloc_pandas_on_spark(y_train_all, indices_to_move)
if is_spark_dataframe
else y_train_all.iloc[indices_to_move]
if isinstance(y_train_all, (pd.Series, psSeries))
else y_train_all[indices_to_move]
)
# Add to val
X_val = concat(X_to_move, X_val)
y_val = concat(y_to_move, y_val) if data_is_df else np.concatenate([y_to_move, y_val])
# Remove from train (they are currently all in train)
if is_spark_dataframe:
train_mask = ~y_train.isin([label])
X_train = X_train[train_mask]
y_train = y_train[train_mask]
else:
train_mask = np.asarray(y_train) != label
if data_is_df:
X_train = X_train[train_mask]
y_train = y_train[train_mask]
else:
X_train = X_train[train_mask]
y_train = y_train[train_mask]
# Add remaining instances back to train
remaining_indices = label_indices[num_to_val:]
if len(remaining_indices) > 0:
X_remaining = (
iloc_pandas_on_spark(X_train_all, remaining_indices)
if is_spark_dataframe
else X_train_all.iloc[remaining_indices]
if data_is_df
else X_train_all[remaining_indices]
)
y_remaining = (
iloc_pandas_on_spark(y_train_all, remaining_indices)
if is_spark_dataframe
else y_train_all.iloc[remaining_indices]
if isinstance(y_train_all, (pd.Series, psSeries))
else y_train_all[remaining_indices]
)
X_train = concat(X_remaining, X_train)
y_train = concat(y_remaining, y_train) if data_is_df else np.concatenate([y_remaining, y_train])
# Handle sample_weight
if "sample_weight" in state.fit_kwargs:
sample_weight_source = (
state.sample_weight_all
if hasattr(state, "sample_weight_all")
else state.fit_kwargs.get("sample_weight")
)
if sample_weight_source is not None and max(indices_to_move) < len(sample_weight_source):
weights_to_move = (
sample_weight_source[indices_to_move]
if isinstance(sample_weight_source, np.ndarray)
else sample_weight_source.iloc[indices_to_move]
)
if hasattr(state, "weight_val") and state.weight_val is not None:
state.weight_val = concat(weights_to_move, state.weight_val)
if len(remaining_indices) > 0:
# Remove and re-add weights for train
if isinstance(state.fit_kwargs["sample_weight"], np.ndarray):
state.fit_kwargs["sample_weight"] = state.fit_kwargs["sample_weight"][train_mask]
else:
state.fit_kwargs["sample_weight"] = state.fit_kwargs["sample_weight"][train_mask]
if max(remaining_indices) < len(sample_weight_source):
remaining_weights = (
sample_weight_source[remaining_indices]
if isinstance(sample_weight_source, np.ndarray)
else sample_weight_source.iloc[remaining_indices]
)
state.fit_kwargs["sample_weight"] = concat(
remaining_weights, state.fit_kwargs["sample_weight"]
)
return X_train, X_val, y_train, y_val
def prepare_data(
self,
state,
@@ -357,6 +836,7 @@ class GenericTask(Task):
n_splits,
data_is_df,
sample_weight_full,
allow_label_overlap=True,
) -> int:
X_val, y_val = state.X_val, state.y_val
if issparse(X_val):
@@ -422,8 +902,8 @@ class GenericTask(Task):
X_train_all, y_train_all = shuffle(X_train_all, y_train_all, random_state=RANDOM_SEED)
if data_is_df:
X_train_all.reset_index(drop=True, inplace=True)
if isinstance(y_train_all, pd.Series):
y_train_all.reset_index(drop=True, inplace=True)
if isinstance(y_train_all, pd.Series):
y_train_all.reset_index(drop=True, inplace=True)
X_train, y_train = X_train_all, y_train_all
state.groups_all = state.groups
@@ -485,31 +965,47 @@ class GenericTask(Task):
elif self.is_classification():
# for classification, make sure the labels are complete in both
# training and validation data
label_set, first = unique_value_first_index(y_train_all)
rest = []
last = 0
first.sort()
for i in range(len(first)):
rest.extend(range(last, first[i]))
last = first[i] + 1
rest.extend(range(last, len(y_train_all)))
X_first = X_train_all.iloc[first] if data_is_df else X_train_all[first]
X_rest = X_train_all.iloc[rest] if data_is_df else X_train_all[rest]
y_rest = (
y_train_all[rest]
if isinstance(y_train_all, np.ndarray)
else iloc_pandas_on_spark(y_train_all, rest)
if is_spark_dataframe
else y_train_all.iloc[rest]
)
stratify = y_rest if split_type == "stratified" else None
stratify = y_train_all if split_type == "stratified" else None
X_train, X_val, y_train, y_val = self._train_test_split(
state, X_rest, y_rest, first, rest, split_ratio, stratify
state, X_train_all, y_train_all, split_ratio=split_ratio, stratify=stratify
)
X_train = concat(X_first, X_train)
y_train = concat(label_set, y_train) if data_is_df else np.concatenate([label_set, y_train])
X_val = concat(X_first, X_val)
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
# Handle missing labels using the appropriate strategy
if allow_label_overlap:
# Fast version: adds first instance to set with missing label (may create overlap)
X_train, X_val, y_train, y_val = self._handle_missing_labels_fast(
state,
X_train,
X_val,
y_train,
y_val,
X_train_all,
y_train_all,
is_spark_dataframe,
data_is_df,
)
else:
# Precise version: avoids overlap when possible (slower)
X_train, X_val, y_train, y_val = self._handle_missing_labels_no_overlap(
state,
X_train,
X_val,
y_train,
y_val,
X_train_all,
y_train_all,
is_spark_dataframe,
data_is_df,
split_ratio,
)
if isinstance(y_train, (psDataFrame, pd.DataFrame)) and y_train.shape[1] == 1:
y_train = y_train[y_train.columns[0]]
y_val = y_val[y_val.columns[0]]
# Only set name if y_train_all is a Series (not a DataFrame)
if isinstance(y_train_all, (pd.Series, psSeries)):
y_train.name = y_val.name = y_train_all.name
elif self.is_regression():
X_train, X_val, y_train, y_val = self._train_test_split(
state, X_train_all, y_train_all, split_ratio=split_ratio
@@ -656,7 +1152,6 @@ class GenericTask(Task):
fit_kwargs = {}
if cv_score_agg_func is None:
cv_score_agg_func = default_cv_score_agg_func
start_time = time.time()
val_loss_folds = []
log_metric_folds = []
metric = None
@@ -698,7 +1193,10 @@ class GenericTask(Task):
elif isinstance(kf, TimeSeriesSplit):
kf = kf.split(X_train_split, y_train_split)
else:
kf = kf.split(X_train_split)
try:
kf = kf.split(X_train_split)
except TypeError:
kf = kf.split(X_train_split, y_train_split)
for train_index, val_index in kf:
if shuffle:
@@ -721,10 +1219,10 @@ class GenericTask(Task):
if not is_spark_dataframe:
y_train, y_val = y_train_split[train_index], y_train_split[val_index]
if weight is not None:
fit_kwargs["sample_weight"], weight_val = (
weight[train_index],
weight[val_index],
fit_kwargs["sample_weight"] = (
weight[train_index] if isinstance(weight, np.ndarray) else weight.iloc[train_index]
)
weight_val = weight[val_index] if isinstance(weight, np.ndarray) else weight.iloc[val_index]
if groups is not None:
fit_kwargs["groups"] = (
groups[train_index] if isinstance(groups, np.ndarray) else groups.iloc[train_index]
@@ -763,8 +1261,6 @@ class GenericTask(Task):
if is_spark_dataframe:
X_train.spark.unpersist() # uncache data to free memory
X_val.spark.unpersist() # uncache data to free memory
if budget and time.time() - start_time >= budget:
break
val_loss, metric = cv_score_agg_func(val_loss_folds, log_metric_folds)
n = total_fold_num
pred_time /= n
@@ -807,27 +1303,23 @@ class GenericTask(Task):
elif self.is_ts_forecastpanel():
estimator_list = ["tft"]
else:
estimator_list = [
"lgbm",
"rf",
"xgboost",
"extra_tree",
"xgb_limitdepth",
"lgbm_spark",
"rf_spark",
"sgd",
]
try:
import catboost
estimator_list = [
"lgbm",
"rf",
"catboost",
"xgboost",
"extra_tree",
"xgb_limitdepth",
"lgbm_spark",
]
estimator_list += ["catboost"]
except ImportError:
estimator_list = [
"lgbm",
"rf",
"xgboost",
"extra_tree",
"xgb_limitdepth",
"lgbm_spark",
]
pass
# if self.is_ts_forecast():
# # catboost is removed because it has a `name` parameter, making it incompatible with hcrystalball
# if "catboost" in estimator_list:
@@ -859,9 +1351,7 @@ class GenericTask(Task):
return metric
if self.is_nlp():
from flaml.automl.nlp.utils import (
load_default_huggingface_metric_for_task,
)
from flaml.automl.nlp.utils import load_default_huggingface_metric_for_task
return load_default_huggingface_metric_for_task(self.name)
elif self.is_binary():

View File

@@ -1,6 +1,8 @@
from abc import ABC, abstractmethod
from typing import TYPE_CHECKING, List, Optional, Tuple, Union
import numpy as np
from flaml.automl.data import DataFrame, Series, psDataFrame, psSeries
if TYPE_CHECKING:
@@ -190,7 +192,7 @@ class Task(ABC):
* Valid str options depend on different tasks.
For classification tasks, valid choices are
["auto", 'stratified', 'uniform', 'time', 'group']. "auto" -> stratified.
For regression tasks, valid choices are ["auto", 'uniform', 'time'].
For regression tasks, valid choices are ["auto", 'uniform', 'time', 'group'].
"auto" -> uniform.
For time series forecast tasks, must be "auto" or 'time'.
For ranking task, must be "auto" or 'group'.

View File

@@ -2,26 +2,25 @@ import logging
import time
from typing import List
import pandas as pd
import numpy as np
import pandas as pd
from scipy.sparse import issparse
from sklearn.model_selection import (
GroupKFold,
TimeSeriesSplit,
)
from flaml.automl.ml import get_val_loss, default_cv_score_agg_func
from flaml.automl.time_series.ts_data import (
TimeSeriesDataset,
DataTransformerTS,
normalize_ts_data,
)
from flaml.automl.ml import default_cv_score_agg_func, get_val_loss
from flaml.automl.task.task import (
Task,
get_classification_objective,
TS_FORECAST,
TS_FORECASTPANEL,
Task,
get_classification_objective,
)
from flaml.automl.time_series.ts_data import (
DataTransformerTS,
TimeSeriesDataset,
normalize_ts_data,
)
logger = logging.getLogger(__name__)
@@ -33,18 +32,24 @@ class TimeSeriesTask(Task):
if self._estimators is None:
# put this into a function to avoid circular dependency
from flaml.automl.time_series import (
ARIMA,
LGBM_TS,
RF_TS,
SARIMAX,
Average,
CatBoost_TS,
ExtraTrees_TS,
HoltWinters,
LassoLars_TS,
Naive,
Orbit,
Prophet,
SeasonalAverage,
SeasonalNaive,
TCNEstimator,
TemporalFusionTransformerEstimator,
XGBoost_TS,
XGBoostLimitDepth_TS,
RF_TS,
LGBM_TS,
ExtraTrees_TS,
CatBoost_TS,
Prophet,
Orbit,
ARIMA,
SARIMAX,
TemporalFusionTransformerEstimator,
HoltWinters,
)
self._estimators = {
@@ -58,8 +63,19 @@ class TimeSeriesTask(Task):
"holt-winters": HoltWinters,
"catboost": CatBoost_TS,
"tft": TemporalFusionTransformerEstimator,
"lassolars": LassoLars_TS,
"tcn": TCNEstimator,
"snaive": SeasonalNaive,
"naive": Naive,
"savg": SeasonalAverage,
"avg": Average,
}
if self._estimators["tcn"] is None:
# remove TCN if import failed
del self._estimators["tcn"]
logger.info("Couldn't import pytorch_lightning, skipping TCN estimator")
try:
from prophet import Prophet as foo
@@ -72,7 +88,7 @@ class TimeSeriesTask(Task):
self._estimators["orbit"] = Orbit
except ImportError:
logger.info("Couldn't import Prophet, skipping")
logger.info("Couldn't import orbit, skipping")
return self._estimators
@@ -135,7 +151,7 @@ class TimeSeriesTask(Task):
raise ValueError("Must supply either X_train_all and y_train_all, or dataframe and label")
try:
dataframe[self.time_col] = pd.to_datetime(dataframe[self.time_col])
dataframe.loc[:, self.time_col] = pd.to_datetime(dataframe[self.time_col])
except Exception:
raise ValueError(
f"For '{TS_FORECAST}' task, time column {self.time_col} must contain timestamp values."
@@ -370,9 +386,8 @@ class TimeSeriesTask(Task):
return X
def preprocess(self, X, transformer=None):
if isinstance(X, pd.DataFrame) or isinstance(X, np.ndarray) or isinstance(X, pd.Series):
X = X.copy()
X = normalize_ts_data(X, self.target_names, self.time_col)
if isinstance(X, (pd.DataFrame, np.ndarray, pd.Series)):
X = normalize_ts_data(X.copy(), self.target_names, self.time_col)
return self._preprocess(X, transformer)
elif isinstance(X, int):
return X
@@ -513,7 +528,7 @@ def remove_ts_duplicates(
duplicates = X.duplicated()
if any(duplicates):
logger.warning("Duplicate timestamp values found in timestamp column. " f"\n{X.loc[duplicates, X][time_col]}")
logger.warning("Duplicate timestamp values found in timestamp column. " f"\n{X.loc[duplicates, time_col]}")
X = X.drop_duplicates()
logger.warning("Removed duplicate rows based on all columns")
assert (

View File

@@ -1,17 +1,27 @@
from .ts_model import (
Prophet,
Orbit,
ARIMA,
SARIMAX,
HoltWinters,
LGBM_TS,
XGBoost_TS,
RF_TS,
ExtraTrees_TS,
XGBoostLimitDepth_TS,
CatBoost_TS,
TimeSeriesEstimator,
)
from .tft import TemporalFusionTransformerEstimator
from .ts_model import (
ARIMA,
LGBM_TS,
RF_TS,
SARIMAX,
Average,
CatBoost_TS,
ExtraTrees_TS,
HoltWinters,
LassoLars_TS,
Naive,
Orbit,
Prophet,
SeasonalAverage,
SeasonalNaive,
TimeSeriesEstimator,
XGBoost_TS,
XGBoostLimitDepth_TS,
)
try:
from .tcn import TCNEstimator
except ImportError:
TCNEstimator = None
from .ts_data import TimeSeriesDataset

View File

@@ -1,5 +1,5 @@
import math
import datetime
import math
from functools import lru_cache
import pandas as pd

View File

@@ -12,29 +12,35 @@ except ImportError:
DataFrame = Series = None
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
def make_lag_features(X: pd.DataFrame, y: pd.Series, lags: int):
"""Transform input data X, y into autoregressive form - shift
them appropriately based on horizon and create `lags` columns.
"""Transform input data X, y into autoregressive form by creating `lags` columns.
This function is called automatically by FLAML during the training process
to convert time series data into a format suitable for sklearn-based regression
models (e.g., lgbm, rf, xgboost). Users do NOT need to manually call this function
or create lagged features themselves.
Parameters
----------
X : pandas.DataFrame
Input features.
Input feature DataFrame, which may contain temporal features and/or exogenous variables.
y : array_like, (1d)
Target vector.
Target vector (time series values to forecast).
horizon : int
length of X for `predict` method
lags : int
Number of lagged time steps to use as features.
Returns
-------
pandas.DataFrame
shifted dataframe with `lags` columns
Shifted dataframe with `lags` columns for each original feature.
The target variable y is also lagged to prevent data leakage
(i.e., we use y(t-1), y(t-2), ..., y(t-lags) to predict y(t)).
"""
lag_features = []
@@ -55,6 +61,17 @@ def make_lag_features(X: pd.DataFrame, y: pd.Series, lags: int):
class SklearnWrapper:
"""Wrapper class for using sklearn-based models for time series forecasting.
This wrapper automatically handles the transformation of time series data into
a supervised learning format by creating lagged features. It trains separate
models for each step in the forecast horizon.
Users typically don't interact with this class directly - it's used internally
by FLAML when sklearn-based estimators (lgbm, rf, xgboost, etc.) are selected
for time series forecasting tasks.
"""
def __init__(
self,
model_class: type,
@@ -76,6 +93,8 @@ class SklearnWrapper:
self.pca = None
def fit(self, X: pd.DataFrame, y: pd.Series, **kwargs):
if "is_retrain" in kwargs:
kwargs.pop("is_retrain")
self._X = X
self._y = y
@@ -92,7 +111,14 @@ class SklearnWrapper:
for i, model in enumerate(self.models):
offset = i + self.lags
model.fit(X_trans[: len(X) - offset], y[offset:], **fit_params)
if len(X) - offset > 2:
# series with length 2 will meet All features are either constant or ignored.
# TODO: see why the non-constant features are ignored. Selector?
model.fit(X_trans[: len(X) - offset], y[offset:], **fit_params)
elif len(X) > offset and "catboost" not in str(model).lower():
model.fit(X_trans[: len(X) - offset], y[offset:], **fit_params)
else:
print("[INFO]: Length of data should longer than period + lags.")
return self
def predict(self, X, X_train=None, y_train=None):

View File

@@ -0,0 +1,286 @@
# This file is adapted from
# https://github.com/locuslab/TCN/blob/master/TCN/tcn.py
# https://github.com/locuslab/TCN/blob/master/TCN/adding_problem/add_test.py
import datetime
import logging
import time
import pandas as pd
import pytorch_lightning as pl
import torch
import torch.nn as nn
import torch.optim as optim
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
from pytorch_lightning.loggers import TensorBoardLogger
from torch.nn.utils import weight_norm
from torch.utils.data import DataLoader, TensorDataset
from flaml import tune
from flaml.automl.data import add_time_idx_col
from flaml.automl.logger import logger, logger_formatter
from flaml.automl.time_series.ts_data import TimeSeriesDataset
from flaml.automl.time_series.ts_model import TimeSeriesEstimator
class Chomp1d(nn.Module):
def __init__(self, chomp_size):
super().__init__()
self.chomp_size = chomp_size
def forward(self, x):
return x[:, :, : -self.chomp_size].contiguous()
class TemporalBlock(nn.Module):
def __init__(self, n_inputs, n_outputs, kernel_size, stride, dilation, padding, dropout=0.2):
super().__init__()
self.conv1 = weight_norm(
nn.Conv1d(n_inputs, n_outputs, kernel_size, stride=stride, padding=padding, dilation=dilation)
)
self.chomp1 = Chomp1d(padding)
self.relu1 = nn.ReLU()
self.dropout1 = nn.Dropout(dropout)
self.conv2 = weight_norm(
nn.Conv1d(n_outputs, n_outputs, kernel_size, stride=stride, padding=padding, dilation=dilation)
)
self.chomp2 = Chomp1d(padding)
self.relu2 = nn.ReLU()
self.dropout2 = nn.Dropout(dropout)
self.net = nn.Sequential(
self.conv1, self.chomp1, self.relu1, self.dropout1, self.conv2, self.chomp2, self.relu2, self.dropout2
)
self.downsample = nn.Conv1d(n_inputs, n_outputs, 1) if n_inputs != n_outputs else None
self.relu = nn.ReLU()
self.init_weights()
def init_weights(self):
self.conv1.weight.data.normal_(0, 0.01)
self.conv2.weight.data.normal_(0, 0.01)
if self.downsample is not None:
self.downsample.weight.data.normal_(0, 0.01)
def forward(self, x):
out = self.net(x)
res = x if self.downsample is None else self.downsample(x)
return self.relu(out + res)
class TCNForecaster(nn.Module):
def __init__(
self,
input_feature_num,
num_outputs,
num_channels,
kernel_size=2,
dropout=0.2,
):
super().__init__()
layers = []
num_levels = len(num_channels)
for i in range(num_levels):
dilation_size = 2**i
in_channels = input_feature_num if i == 0 else num_channels[i - 1]
out_channels = num_channels[i]
layers += [
TemporalBlock(
in_channels,
out_channels,
kernel_size,
stride=1,
dilation=dilation_size,
padding=(kernel_size - 1) * dilation_size,
dropout=dropout,
)
]
self.network = nn.Sequential(*layers)
self.linear = nn.Linear(num_channels[-1], num_outputs)
def forward(self, x):
y1 = self.network(x)
return self.linear(y1[:, :, -1])
class TCNForecasterLightningModule(pl.LightningModule):
def __init__(self, model: TCNForecaster, learning_rate: float = 1e-3):
super().__init__()
self.model = model
self.learning_rate = learning_rate
self.loss_fn = nn.MSELoss()
def forward(self, x):
return self.model(x)
def step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = self.loss_fn(y_hat, y)
return loss
def training_step(self, batch, batch_idx):
loss = self.step(batch, batch_idx)
self.log("train_loss", loss)
return loss
def validation_step(self, batch, batch_idx):
loss = self.step(batch, batch_idx)
self.log("val_loss", loss)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=self.learning_rate)
class DataframeDataset(torch.utils.data.Dataset):
def __init__(self, dataframe, target_column, features_columns, sequence_length, train=True):
self.data = torch.tensor(dataframe[features_columns].to_numpy(), dtype=torch.float)
self.sequence_length = sequence_length
if train:
self.labels = torch.tensor(dataframe[target_column].to_numpy(), dtype=torch.float)
self.is_train = train
def __len__(self):
return len(self.data) - self.sequence_length + 1
def __getitem__(self, idx):
data = self.data[idx : idx + self.sequence_length]
data = data.permute(1, 0)
if self.is_train:
label = self.labels[idx : idx + self.sequence_length]
return data, label
else:
return data
class TCNEstimator(TimeSeriesEstimator):
"""The class for tuning TCN Forecaster"""
@classmethod
def search_space(cls, data, task, pred_horizon, **params):
space = {
"num_levels": {
"domain": tune.randint(lower=4, upper=20), # hidden = 2^num_hidden
"init_value": 4,
},
"num_hidden": {
"domain": tune.randint(lower=4, upper=8), # hidden = 2^num_hidden
"init_value": 5,
},
"kernel_size": {
"domain": tune.choice([2, 3, 5, 7]), # common choices for kernel size
"init_value": 3,
},
"dropout": {
"domain": tune.uniform(lower=0.0, upper=0.5), # standard range for dropout
"init_value": 0.1,
},
"learning_rate": {
"domain": tune.loguniform(lower=1e-4, upper=1e-1), # typical range for learning rate
"init_value": 1e-3,
},
}
return space
def __init__(self, task="ts_forecast", n_jobs=1, **params):
super().__init__(task, **params)
logging.getLogger("pytorch_lightning").setLevel(logging.WARNING)
def fit(self, X_train: TimeSeriesDataset, y_train=None, budget=None, **kwargs):
start_time = time.time()
if budget is not None:
deltabudget = datetime.timedelta(seconds=budget)
else:
deltabudget = None
X_train = self.enrich(X_train)
super().fit(X_train, y_train, budget, **kwargs)
self.batch_size = kwargs.get("batch_size", 64)
self.horizon = kwargs.get("period", 1)
self.feature_cols = X_train.time_varying_known_reals
self.target_col = X_train.target_names[0]
train_dataset = DataframeDataset(
X_train.train_data,
self.target_col,
self.feature_cols,
self.horizon,
)
train_loader = DataLoader(train_dataset, batch_size=self.batch_size, shuffle=False)
if not X_train.test_data.empty:
val_dataset = DataframeDataset(
X_train.test_data,
self.target_col,
self.feature_cols,
self.horizon,
)
else:
val_dataset = DataframeDataset(
X_train.train_data.sample(frac=0.2, random_state=kwargs.get("random_state", 0)),
self.target_col,
self.feature_cols,
self.horizon,
)
val_loader = DataLoader(val_dataset, batch_size=self.batch_size, shuffle=False)
model = TCNForecaster(
len(self.feature_cols),
self.horizon,
[2 ** self.params["num_hidden"]] * self.params["num_levels"],
self.params["kernel_size"],
self.params["dropout"],
)
pl_module = TCNForecasterLightningModule(model, self.params["learning_rate"])
# Training loop
# gpus is deprecated in v1.7 and removed in v2.0
# accelerator="auto" can cast all condition.
trainer = pl.Trainer(
max_epochs=kwargs.get("max_epochs", 10),
accelerator="auto",
callbacks=[
EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=10, verbose=False, mode="min"),
LearningRateMonitor(),
],
logger=TensorBoardLogger(kwargs.get("log_dir", "logs/lightning_logs")), # logging results to a tensorboard
max_time=deltabudget,
enable_model_summary=False,
enable_progress_bar=False,
)
trainer.fit(
pl_module,
train_dataloaders=train_loader,
val_dataloaders=val_loader,
)
best_model = trainer.model
self._model = best_model
train_time = time.time() - start_time
return train_time
def predict(self, X):
X = self.enrich(X)
if isinstance(X, TimeSeriesDataset):
# Use X_train if X_val is empty (e.g., when computing training metrics)
df = X.X_val if len(X.test_data) > 0 else X.X_train
else:
df = X
dataset = DataframeDataset(
df,
self.target_col,
self.feature_cols,
self.horizon,
train=False,
)
data_loader = DataLoader(dataset, batch_size=self.batch_size, shuffle=False)
self._model.eval()
raw_preds = []
for batch_x in data_loader:
raw_pred = self._model(batch_x)
raw_preds.append(raw_pred)
raw_preds = torch.cat(raw_preds, dim=0)
preds = pd.Series(raw_preds.detach().numpy().ravel())
return preds

View File

@@ -1,3 +1,4 @@
import inspect
import time
try:
@@ -105,12 +106,18 @@ class TemporalFusionTransformerEstimator(TimeSeriesEstimator):
def fit(self, X_train, y_train, budget=None, **kwargs):
import warnings
import pytorch_lightning as pl
try:
import lightning.pytorch as pl
from lightning.pytorch.callbacks import EarlyStopping, LearningRateMonitor
from lightning.pytorch.loggers import TensorBoardLogger
except ImportError:
import pytorch_lightning as pl
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
from pytorch_lightning.loggers import TensorBoardLogger
import torch
from pytorch_forecasting import TemporalFusionTransformer
from pytorch_forecasting.metrics import QuantileLoss
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
from pytorch_lightning.loggers import TensorBoardLogger
# a bit of monkey patching to fix the MacOS test
# all the log_prediction method appears to do is plot stuff, which ?breaks github tests
@@ -131,12 +138,26 @@ class TemporalFusionTransformerEstimator(TimeSeriesEstimator):
lr_logger = LearningRateMonitor() # log the learning rate
logger = TensorBoardLogger(kwargs.get("log_dir", "lightning_logs")) # logging results to a tensorboard
default_trainer_kwargs = dict(
gpus=self._kwargs.get("gpu_per_trial", [0]) if torch.cuda.is_available() else None,
max_epochs=max_epochs,
gradient_clip_val=gradient_clip_val,
callbacks=[lr_logger, early_stop_callback],
logger=logger,
)
# PyTorch Lightning >=2.0 replaced `gpus` with `accelerator`/`devices`.
# Also, passing `gpus=None` is not accepted on newer versions.
trainer_sig_params = inspect.signature(pl.Trainer.__init__).parameters
if torch.cuda.is_available() and "gpus" in trainer_sig_params:
gpus = self._kwargs.get("gpu_per_trial", None)
if gpus is not None:
default_trainer_kwargs["gpus"] = gpus
elif torch.cuda.is_available() and "devices" in trainer_sig_params:
devices = self._kwargs.get("gpu_per_trial", None)
if devices == -1:
devices = "auto"
if devices is not None:
default_trainer_kwargs["accelerator"] = "gpu"
default_trainer_kwargs["devices"] = devices
trainer = pl.Trainer(
**default_trainer_kwargs,
)
@@ -156,7 +177,14 @@ class TemporalFusionTransformerEstimator(TimeSeriesEstimator):
val_dataloaders=val_dataloader,
)
best_model_path = trainer.checkpoint_callback.best_model_path
best_tft = TemporalFusionTransformer.load_from_checkpoint(best_model_path)
# PyTorch 2.6 changed `torch.load` default `weights_only` from False -> True.
# Some Lightning checkpoints (including those produced here) can require full unpickling.
# This path is generated locally during training, so it's trusted.
load_sig_params = inspect.signature(TemporalFusionTransformer.load_from_checkpoint).parameters
if "weights_only" in load_sig_params:
best_tft = TemporalFusionTransformer.load_from_checkpoint(best_model_path, weights_only=False)
else:
best_tft = TemporalFusionTransformer.load_from_checkpoint(best_model_path)
train_time = time.time() - current_time
self._model = best_tft
return train_time
@@ -169,7 +197,11 @@ class TemporalFusionTransformerEstimator(TimeSeriesEstimator):
last_data_cols = self.group_ids.copy()
last_data_cols.append(self.target_names[0])
last_data = self.data[lambda x: x.time_idx == x.time_idx.max()][last_data_cols]
decoder_data = X.X_val if isinstance(X, TimeSeriesDataset) else X
# Use X_train if test_data is empty (e.g., when computing training metrics)
if isinstance(X, TimeSeriesDataset):
decoder_data = X.X_val if len(X.test_data) > 0 else X.X_train
else:
decoder_data = X
if "time_idx" not in decoder_data:
decoder_data = add_time_idx_col(decoder_data)
decoder_data["time_idx"] += encoder_data["time_idx"].max() + 1 - decoder_data["time_idx"].min()

View File

@@ -2,17 +2,18 @@ import copy
import datetime
import math
from dataclasses import dataclass, field
from typing import List, Optional, Callable, Dict, Generator, Union
from typing import Callable, Dict, Generator, List, Optional, Union
import numpy as np
try:
import pandas as pd
from pandas import DataFrame, Series, to_datetime
from pandas.api.types import is_datetime64_any_dtype
from scipy.sparse import issparse
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from .feature import monthly_fourier_features
except ImportError:
@@ -26,6 +27,8 @@ except ImportError:
DataFrame = Series = None
# dataclass will remove empty default value even with field(default_factory=lambda: [])
# Change into default=None to place the attr
@dataclass
class TimeSeriesDataset:
train_data: pd.DataFrame
@@ -34,10 +37,10 @@ class TimeSeriesDataset:
target_names: List[str]
frequency: str
test_data: pd.DataFrame
time_varying_known_categoricals: List[str] = field(default_factory=lambda: [])
time_varying_known_reals: List[str] = field(default_factory=lambda: [])
time_varying_unknown_categoricals: List[str] = field(default_factory=lambda: [])
time_varying_unknown_reals: List[str] = field(default_factory=lambda: [])
time_varying_known_categoricals: List[str] = field(default=None)
time_varying_known_reals: List[str] = field(default=None)
time_varying_unknown_categoricals: List[str] = field(default=None)
time_varying_unknown_reals: List[str] = field(default=None)
def __init__(
self,
@@ -118,7 +121,12 @@ class TimeSeriesDataset:
@property
def X_all(self) -> pd.DataFrame:
return pd.concat([self.X_train, self.X_val], axis=0)
# Remove empty or all-NA columns before concatenation
X_train_filtered = self.X_train.dropna(axis=1, how="all")
X_val_filtered = self.X_val.dropna(axis=1, how="all")
# Concatenate the filtered DataFrames
return pd.concat([X_train_filtered, X_val_filtered], axis=0)
@property
def y_train(self) -> pd.DataFrame:
@@ -390,8 +398,17 @@ class DataTransformerTS:
assert len(self.num_columns) == 0, "Trying to call fit() twice, something is wrong"
for column in X.columns:
# Never treat the time column as a feature for sklearn preprocessing
if column == self.time_col:
continue
# Robust datetime detection (covers datetime64[ms/us/ns], tz-aware, etc.)
if is_datetime64_any_dtype(X[column]):
self.datetime_columns.append(column)
continue
# sklearn/utils/validation.py needs int/float values
if X[column].dtype.name in ("object", "category"):
if X[column].dtype.name in ("object", "category", "string"):
if (
# drop columns where all values are the same
X[column].nunique() == 1
@@ -403,7 +420,7 @@ class DataTransformerTS:
self.cat_columns.append(column)
elif X[column].nunique(dropna=True) < 2:
self.drop_columns.append(column)
elif X[column].dtype.name == "datetime64[ns]":
elif X[column].dtype.name in ["datetime64[ns]", "datetime64[s]"]:
pass # these will be processed at model level,
# so they can also be done in the predict method
else:
@@ -460,7 +477,7 @@ class DataTransformerTS:
if "__NAN__" not in X[col].cat.categories:
X[col] = X[col].cat.add_categories("__NAN__").fillna("__NAN__")
else:
X[col] = X[col].fillna("__NAN__")
X[col] = X[col].fillna("__NAN__").infer_objects(copy=False)
X[col] = X[col].astype("category")
for column in self.num_columns:
@@ -529,14 +546,12 @@ def normalize_ts_data(X_train_all, target_names, time_col, y_train_all=None):
def validate_data_basic(X_train_all, y_train_all):
assert isinstance(X_train_all, np.ndarray) or issparse(X_train_all) or isinstance(X_train_all, pd.DataFrame), (
"X_train_all must be a numpy array, a pandas dataframe, " "or Scipy sparse matrix."
)
assert isinstance(X_train_all, (np.ndarray, DataFrame)) or issparse(
X_train_all
), "X_train_all must be a numpy array, a pandas dataframe, or Scipy sparse matrix."
assert (
isinstance(y_train_all, np.ndarray)
or isinstance(y_train_all, pd.Series)
or isinstance(y_train_all, pd.DataFrame)
assert isinstance(
y_train_all, (np.ndarray, pd.Series, pd.DataFrame)
), "y_train_all must be a numpy array or a pandas series or DataFrame."
assert X_train_all.size != 0 and y_train_all.size != 0, "Input data must not be empty, use None if no data"

View File

@@ -1,8 +1,8 @@
import time
import logging
import os
from datetime import datetime
import math
import os
import time
from datetime import datetime
from typing import List, Optional, Union
try:
@@ -22,26 +22,27 @@ except ImportError:
import numpy as np
from flaml import tune
from flaml.model import (
suppress_stdout_stderr,
SKLearnEstimator,
logger,
LGBMEstimator,
XGBoostSklearnEstimator,
RandomForestEstimator,
ExtraTreesEstimator,
XGBoostLimitDepthEstimator,
from flaml.automl.data import TS_TIMESTAMP_COL, TS_VALUE_COL
from flaml.automl.model import (
CatBoostEstimator,
)
from flaml.data import TS_TIMESTAMP_COL, TS_VALUE_COL
from flaml.automl.time_series.ts_data import (
TimeSeriesDataset,
enrich_dataset,
enrich_dataframe,
normalize_ts_data,
create_forward_frame,
ExtraTreesEstimator,
LassoLarsEstimator,
LGBMEstimator,
RandomForestEstimator,
SKLearnEstimator,
XGBoostLimitDepthEstimator,
XGBoostSklearnEstimator,
logger,
suppress_stdout_stderr,
)
from flaml.automl.task import Task
from flaml.automl.time_series.ts_data import (
TimeSeriesDataset,
create_forward_frame,
enrich_dataframe,
enrich_dataset,
normalize_ts_data,
)
class TimeSeriesEstimator(SKLearnEstimator):
@@ -143,6 +144,7 @@ class TimeSeriesEstimator(SKLearnEstimator):
def score(self, X_val: DataFrame, y_val: Series, **kwargs):
from sklearn.metrics import r2_score
from ..ml import metric_loss_score
y_pred = self.predict(X_val, **kwargs)
@@ -192,7 +194,13 @@ class Orbit(TimeSeriesEstimator):
elif isinstance(X, TimeSeriesDataset):
data = X
X = data.test_data[[self.time_col] + X.regressors]
# By default we predict on the dataset's test partition.
# Some internal call paths (e.g., training-metric logging) may pass a
# dataset whose test partition is empty; fall back to train partition.
if data.test_data is not None and len(data.test_data):
X = data.test_data[data.regressors + [data.time_col]]
else:
X = data.train_data[data.regressors + [data.time_col]]
if self._model is not None:
forecast = self._model.predict(X, **kwargs)
@@ -299,7 +307,13 @@ class Prophet(TimeSeriesEstimator):
if isinstance(X, TimeSeriesDataset):
data = X
X = data.test_data[data.regressors + [data.time_col]]
# By default we predict on the dataset's test partition.
# Some internal call paths (e.g., training-metric logging) may pass a
# dataset whose test partition is empty; fall back to train partition.
if data.test_data is not None and len(data.test_data):
X = data.test_data[data.regressors + [data.time_col]]
else:
X = data.train_data[data.regressors + [data.time_col]]
X = X.rename(columns={self.time_col: "ds"})
if self._model is not None:
@@ -325,11 +339,19 @@ class StatsModelsEstimator(TimeSeriesEstimator):
if isinstance(X, TimeSeriesDataset):
data = X
X = data.test_data[data.regressors + [data.time_col]]
# By default we predict on the dataset's test partition.
# Some internal call paths (e.g., training-metric logging) may pass a
# dataset whose test partition is empty; fall back to train partition.
if data.test_data is not None and len(data.test_data):
X = data.test_data[data.regressors + [data.time_col]]
else:
X = data.train_data[data.regressors + [data.time_col]]
else:
X = X[self.regressors + [self.time_col]]
if isinstance(X, DataFrame):
if X.shape[0] == 0:
return pd.Series([], name=self.target_names[0], dtype=float)
start = X[self.time_col].iloc[0]
end = X[self.time_col].iloc[-1]
if len(self.regressors):
@@ -610,15 +632,13 @@ class HoltWinters(StatsModelsEstimator):
): # this would prevent heuristic initialization to work properly
self.params["seasonal"] = None
if (
self.params["seasonal"] == "mul" and (train_df.y == 0).sum() > 0
self.params["seasonal"] == "mul" and (train_df[target_col] == 0).sum() > 0
): # cannot have multiplicative seasonality in this case
self.params["seasonal"] = "add"
if self.params["trend"] == "mul" and (train_df.y == 0).sum() > 0:
if self.params["trend"] == "mul" and (train_df[target_col] == 0).sum() > 0:
self.params["trend"] = "add"
if not self.params["seasonal"] or self.params["trend"] not in ["mul", "add"]:
self.params["damped_trend"] = False
model = HWExponentialSmoothing(
train_df[[target_col]],
damped_trend=self.params["damped_trend"],
@@ -632,6 +652,125 @@ class HoltWinters(StatsModelsEstimator):
return train_time
class SimpleForecaster(StatsModelsEstimator):
"""Base class for Naive Forecaster like Seasonal Naive, Naive, Seasonal Average, Average"""
@classmethod
def _search_space(cls, data: TimeSeriesDataset, task: Task, pred_horizon: int, **params):
return {
"season": {
"domain": tune.randint(1, pred_horizon),
"init_value": pred_horizon,
}
}
def joint_preprocess(self, X_train, y_train=None):
X_train = self.enrich(X_train)
self.regressors = []
if isinstance(X_train, TimeSeriesDataset):
data = X_train
target_col = data.target_names[0]
# this class only supports univariate regression
train_df = data.train_data[self.regressors + [target_col]]
train_df.index = to_datetime(data.train_data[data.time_col])
else:
target_col = TS_VALUE_COL
train_df = self._join(X_train, y_train)
self.time_col = data.time_col
self.target_names = data.target_names
train_df = self._preprocess(train_df)
return train_df, target_col
def fit(self, X_train, y_train=None, budget=None, **kwargs):
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
self.season = self.params.get("season", 1)
current_time = time.time()
super().fit(X_train, y_train, budget=budget, **kwargs)
train_df, target_col = self.joint_preprocess(X_train, y_train)
model = SimpleExpSmoothing(
train_df[[target_col]],
)
with suppress_stdout_stderr():
model = model.fit(smoothing_level=self.smoothing_level)
train_time = time.time() - current_time
self._model = model
return train_time
class SeasonalNaive(SimpleForecaster):
smoothing_level = 1.0
def predict(self, X, **kwargs):
if isinstance(X, int):
forecasts = []
for i in range(X):
forecast = self._model.forecast(steps=self.season)[0]
forecasts.append(forecast)
return pd.Series(forecasts)
else:
return super().predict(X, **kwargs)
class Naive(SimpleForecaster):
smoothing_level = 0.0
@classmethod
def _search_space(cls, data: TimeSeriesDataset, task: Task, pred_horizon: int, **params):
return {}
def predict(self, X, **kwargs):
if isinstance(X, int):
last_observation = self._model.params["initial_level"]
return pd.Series([last_observation] * X)
else:
return super().predict(X, **kwargs)
class SeasonalAverage(SimpleForecaster):
def fit(self, X_train, y_train=None, budget=None, **kwargs):
from statsmodels.tsa.ar_model import AutoReg, ar_select_order
start_time = time.time()
self.season = kwargs.get("season", 1) # seasonality period
train_df, target_col = self.joint_preprocess(X_train, y_train)
selection_res = ar_select_order(train_df[target_col], maxlag=self.season)
# Fit autoregressive model with optimal order
model = AutoReg(train_df[target_col], lags=selection_res.ar_lags)
self._model = model.fit()
end_time = time.time()
return end_time - start_time
class Average(SimpleForecaster):
@classmethod
def _search_space(cls, data: TimeSeriesDataset, task: Task, pred_horizon: int, **params):
return {}
def fit(self, X_train, y_train=None, budget=None, **kwargs):
from statsmodels.tsa.ar_model import AutoReg
start_time = time.time()
train_df, target_col = self.joint_preprocess(X_train, y_train)
model = AutoReg(train_df[target_col], lags=0)
self._model = model.fit()
end_time = time.time()
return end_time - start_time
class TS_SKLearn(TimeSeriesEstimator):
"""The class for tuning SKLearn Regressors for time-series forecasting"""
@@ -710,6 +849,13 @@ class TS_SKLearn(TimeSeriesEstimator):
if isinstance(X, TimeSeriesDataset):
data = X
X = data.test_data
# By default we predict on the dataset's test partition.
# Some internal call paths (e.g., training-metric logging) may pass a
# dataset whose test partition is empty; fall back to train partition.
if data.test_data is not None and len(data.test_data):
X = data.test_data
else:
X = data.train_data
if self._model is not None:
X = X[self.regressors]
@@ -758,3 +904,7 @@ class XGBoostLimitDepth_TS(TS_SKLearn):
# catboost regressor is invalid because it has a `name` parameter, making it incompatible with hcrystalball
class CatBoost_TS(TS_SKLearn):
base_class = CatBoostEstimator
class LassoLars_TS(TS_SKLearn):
base_class = LassoLarsEstimator

View File

@@ -4,14 +4,14 @@
"""
import json
from typing import IO
from contextlib import contextmanager
import logging
from contextlib import contextmanager
from typing import IO
logger = logging.getLogger("flaml.automl")
class TrainingLogRecord(object):
class TrainingLogRecord:
def __init__(
self,
record_id: int,
@@ -52,7 +52,7 @@ class TrainingLogCheckPoint(TrainingLogRecord):
self.curr_best_record_id = curr_best_record_id
class TrainingLogWriter(object):
class TrainingLogWriter:
def __init__(self, output_filename: str):
self.output_filename = output_filename
self.file = None
@@ -79,7 +79,7 @@ class TrainingLogWriter(object):
sample_size,
):
if self.file is None:
raise IOError("Call open() to open the output file first.")
raise OSError("Call open() to open the output file first.")
if validation_loss is None:
raise ValueError("TEST LOSS NONE ERROR!!!")
record = TrainingLogRecord(
@@ -109,7 +109,7 @@ class TrainingLogWriter(object):
def checkpoint(self):
if self.file is None:
raise IOError("Call open() to open the output file first.")
raise OSError("Call open() to open the output file first.")
if self.current_best_loss_record_id is None:
logger.warning("flaml.training_log: checkpoint() called before any record is written, skipped.")
return
@@ -124,7 +124,7 @@ class TrainingLogWriter(object):
self.file = None # for pickle
class TrainingLogReader(object):
class TrainingLogReader:
def __init__(self, filename: str):
self.filename = filename
self.file = None
@@ -134,7 +134,7 @@ class TrainingLogReader(object):
def records(self):
if self.file is None:
raise IOError("Call open() before reading log file.")
raise OSError("Call open() before reading log file.")
for line in self.file:
data = json.loads(line)
if len(data) == 1:
@@ -149,7 +149,7 @@ class TrainingLogReader(object):
def get_record(self, record_id) -> TrainingLogRecord:
if self.file is None:
raise IOError("Call open() before reading log file.")
raise OSError("Call open() before reading log file.")
for rec in self.records():
if rec.record_id == record_id:
return rec

View File

@@ -1,9 +0,0 @@
import warnings
from flaml.automl.data import *
warnings.warn(
"Importing from `flaml.data` is deprecated. Please use `flaml.automl.data`.",
DeprecationWarning,
)

View File

@@ -14,7 +14,6 @@ estimator.fit(X_train, y_train)
estimator.predict(X_test, y_test)
```
1. Use AutoML.fit(). set `starting_points="data"` and `max_iter=0`.
```python
@@ -36,10 +35,17 @@ automl.fit(X_train, y_train, **automl_settings)
from flaml.default import preprocess_and_suggest_hyperparams
X, y = load_iris(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
hyperparams, estimator_class, X_transformed, y_transformed, feature_transformer, label_transformer = preprocess_and_suggest_hyperparams(
"classification", X_train, y_train, "lgbm"
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42
)
(
hyperparams,
estimator_class,
X_transformed,
y_transformed,
feature_transformer,
label_transformer,
) = preprocess_and_suggest_hyperparams("classification", X_train, y_train, "lgbm")
model = estimator_class(**hyperparams) # estimator_class is LGBMClassifier
model.fit(X_transformed, y_train) # LGBMClassifier can handle raw labels
X_test = feature_transformer.transform(X_test) # preprocess test data
@@ -172,7 +178,7 @@ Change "binary" into "multiclass" or "regression" for the other tasks.
For more technical details, please check our research paper.
* [Mining Robust Default Configurations for Resource-constrained AutoML](https://arxiv.org/abs/2202.09927). Moe Kayali, Chi Wang. arXiv preprint arXiv:2202.09927 (2022).
- [Mining Robust Default Configurations for Resource-constrained AutoML](https://arxiv.org/abs/2202.09927). Moe Kayali, Chi Wang. arXiv preprint arXiv:2202.09927 (2022).
```bibtex
@article{Kayali2022default,

View File

@@ -1,18 +1,18 @@
from .suggest import (
suggest_config,
suggest_learner,
suggest_hyperparams,
preprocess_and_suggest_hyperparams,
meta_feature,
)
from .estimator import (
flamlize_estimator,
LGBMClassifier,
LGBMRegressor,
XGBClassifier,
XGBRegressor,
RandomForestClassifier,
RandomForestRegressor,
ExtraTreesClassifier,
ExtraTreesRegressor,
LGBMClassifier,
LGBMRegressor,
RandomForestClassifier,
RandomForestRegressor,
XGBClassifier,
XGBRegressor,
flamlize_estimator,
)
from .suggest import (
meta_feature,
preprocess_and_suggest_hyperparams,
suggest_config,
suggest_hyperparams,
suggest_learner,
)

View File

@@ -1,5 +1,7 @@
from functools import wraps
from flaml.automl.task.task import CLASSIFICATION
from .suggest import preprocess_and_suggest_hyperparams
DEFAULT_LOCATION = "default_location"
@@ -93,6 +95,27 @@ def flamlize_estimator(super_class, name: str, task: str, alternatives=None):
def fit(self, X, y, *args, **params):
hyperparams, estimator_name, X, y_transformed = self.suggest_hyperparams(X, y)
self.set_params(**hyperparams)
# Transform eval_set if present
if "eval_set" in params and params["eval_set"] is not None:
transformed_eval_set = []
for eval_X, eval_y in params["eval_set"]:
# Transform features
eval_X_transformed = self._feature_transformer.transform(eval_X)
# Transform labels if applicable
if self._label_transformer and estimator_name in [
"rf",
"extra_tree",
"xgboost",
"xgb_limitdepth",
"choose_xgb",
]:
eval_y_transformed = self._label_transformer.transform(eval_y)
transformed_eval_set.append((eval_X_transformed, eval_y_transformed))
else:
transformed_eval_set.append((eval_X_transformed, eval_y))
params["eval_set"] = transformed_eval_set
if self._label_transformer and estimator_name in [
"rf",
"extra_tree",
@@ -105,7 +128,12 @@ def flamlize_estimator(super_class, name: str, task: str, alternatives=None):
# if hasattr(self, "_classes"):
# self._classes = self._label_transformer.classes_
# else:
self.classes_ = self._label_transformer.classes_
try:
self.classes_ = self._label_transformer.classes_
except AttributeError:
# xgboost 2: AttributeError: can't set attribute
if "xgb" not in estimator_name:
raise
if "xgb" not in estimator_name:
# rf and et would do inverse transform automatically; xgb doesn't
self._label_transformer = None

View File

@@ -1,7 +1,7 @@
import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import pairwise_distances
from sklearn.preprocessing import RobustScaler
def _augment(row):
@@ -12,7 +12,7 @@ def _augment(row):
def construct_portfolio(regret_matrix, meta_features, regret_bound):
"""The portfolio construction algorithm.
(Reference)[https://arxiv.org/abs/2202.09927].
Reference: [Mining Robust Default Configurations for Resource-constrained AutoML](https://arxiv.org/abs/2202.09927).
Args:
regret_matrix: A dataframe of regret matrix.
@@ -32,6 +32,7 @@ def construct_portfolio(regret_matrix, meta_features, regret_bound):
if meta_features is not None:
scaler = RobustScaler()
meta_features = meta_features.loc[tasks]
meta_features = meta_features.astype(float)
meta_features.loc[:, :] = scaler.fit_transform(meta_features)
nearest_task = {}
for t in tasks:

View File

@@ -1,11 +1,13 @@
import pandas as pd
import numpy as np
import argparse
from pathlib import Path
import json
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler
from flaml.default import greedy
from flaml.default.regret import load_result, build_regret
from flaml.default.regret import build_regret, load_result
from flaml.version import __version__
regret_bound = 0.01
@@ -24,6 +26,7 @@ def config_predictor_tuple(tasks, configs, meta_features, regret_matrix):
# pre-processing
scaler = RobustScaler()
meta_features_norm = meta_features.loc[tasks] # this makes a copy
meta_features_norm = meta_features_norm.astype(float)
meta_features_norm.loc[:, :] = scaler.fit_transform(meta_features_norm)
proc = {
@@ -67,7 +70,7 @@ def build_portfolio(meta_features, regret, strategy):
def load_json(filename):
"""Returns the contents of json file filename."""
with open(filename, "r") as f:
with open(filename) as f:
return json.load(f)

View File

@@ -1,5 +1,6 @@
import argparse
from os import path
import pandas as pd

View File

@@ -1,11 +1,13 @@
import numpy as np
import json
import logging
import pathlib
import json
import numpy as np
from flaml.automl.data import DataTransformer
from flaml.automl.task.task import CLASSIFICATION, get_classification_objective
from flaml.automl.task.generic_task import len_labels
from flaml.automl.task.factory import task_factory
from flaml.automl.task.generic_task import len_labels
from flaml.automl.task.task import CLASSIFICATION, get_classification_objective
from flaml.version import __version__
try:
@@ -41,7 +43,7 @@ def meta_feature(task, X_train, y_train, meta_feature_names):
# 'numpy.ndarray' object has no attribute 'select_dtypes'
this_feature.append(1) # all features are numeric
else:
raise ValueError("Feature {} not implemented. ".format(each_feature_name))
raise ValueError(f"Feature {each_feature_name} not implemented. ")
return this_feature
@@ -55,7 +57,7 @@ def load_config_predictor(estimator_name, task, location=None):
task = "multiclass" if task == "multi" else task # TODO: multi -> multiclass?
try:
location = location or LOCATION
with open(f"{location}/{estimator_name}/{task}.json", "r") as f:
with open(f"{location}/{estimator_name}/{task}.json") as f:
CONFIG_PREDICTORS[key] = predictor = json.load(f)
except FileNotFoundError:
raise FileNotFoundError(f"Portfolio has not been built for {estimator_name} on {task} task.")

0
flaml/fabric/__init__.py Normal file
View File

1039
flaml/fabric/mlflow.py Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -2,7 +2,6 @@ import warnings
from flaml.automl.ml import *
warnings.warn(
"Importing from `flaml.ml` is deprecated. Please use `flaml.automl.ml`.",
DeprecationWarning,

View File

@@ -1,9 +0,0 @@
import warnings
from flaml.automl.model import *
warnings.warn(
"Importing from `flaml.model` is deprecated. Please use `flaml.automl.model`.",
DeprecationWarning,
)

View File

@@ -1,10 +1,11 @@
# ChaCha for Online AutoML
FLAML includes *ChaCha* which is an automatic hyperparameter tuning solution for online machine learning. Online machine learning has the following properties: (1) data comes in sequential order; and (2) the performance of the machine learning model is evaluated online, i.e., at every iteration. *ChaCha* performs online AutoML respecting the aforementioned properties of online learning, and at the same time respecting the following constraints: (1) only a small constant number of 'live' models are allowed to perform online learning at the same time; and (2) no model persistence or offline training is allowed, which means that once we decide to replace a 'live' model with a new one, the replaced model can no longer be retrieved.
FLAML includes *ChaCha* which is an automatic hyperparameter tuning solution for online machine learning. Online machine learning has the following properties: (1) data comes in sequential order; and (2) the performance of the machine learning model is evaluated online, i.e., at every iteration. *ChaCha* performs online AutoML respecting the aforementioned properties of online learning, and at the same time respecting the following constraints: (1) only a small constant number of 'live' models are allowed to perform online learning at the same time; and (2) no model persistence or offline training is allowed, which means that once we decide to replace a 'live' model with a new one, the replaced model can no longer be retrieved.
For more technical details about *ChaCha*, please check our paper.
* [ChaCha for Online AutoML](https://www.microsoft.com/en-us/research/publication/chacha-for-online-automl/). Qingyun Wu, Chi Wang, John Langford, Paul Mineiro and Marco Rossi. ICML 2021.
- [ChaCha for Online AutoML](https://www.microsoft.com/en-us/research/publication/chacha-for-online-automl/). Qingyun Wu, Chi Wang, John Langford, Paul Mineiro and Marco Rossi. ICML 2021.
```
@inproceedings{wu2021chacha,
title={ChaCha for online AutoML},
@@ -23,8 +24,9 @@ An example of online namespace interactions tuning in VW:
```python
# require: pip install flaml[vw]
from flaml import AutoVW
'''create an AutoVW instance for tuning namespace interactions'''
autovw = AutoVW(max_live_model_num=5, search_space={'interactions': AutoVW.AUTOMATIC})
"""create an AutoVW instance for tuning namespace interactions"""
autovw = AutoVW(max_live_model_num=5, search_space={"interactions": AutoVW.AUTOMATIC})
```
An example of online tuning of both namespace interactions and learning rate in VW:
@@ -33,12 +35,18 @@ An example of online tuning of both namespace interactions and learning rate in
# require: pip install flaml[vw]
from flaml import AutoVW
from flaml.tune import loguniform
''' create an AutoVW instance for tuning namespace interactions and learning rate'''
""" create an AutoVW instance for tuning namespace interactions and learning rate"""
# set up the search space and init config
search_space_nilr = {'interactions': AutoVW.AUTOMATIC, 'learning_rate': loguniform(lower=2e-10, upper=1.0)}
init_config_nilr = {'interactions': set(), 'learning_rate': 0.5}
search_space_nilr = {
"interactions": AutoVW.AUTOMATIC,
"learning_rate": loguniform(lower=2e-10, upper=1.0),
}
init_config_nilr = {"interactions": set(), "learning_rate": 0.5}
# create an AutoVW instance
autovw = AutoVW(max_live_model_num=5, search_space=search_space_nilr, init_config=init_config_nilr)
autovw = AutoVW(
max_live_model_num=5, search_space=search_space_nilr, init_config=init_config_nilr
)
```
A user can use the resulting AutoVW instances `autovw` in a similar way to a vanilla Vowpal Wabbit instance, i.e., `pyvw.vw`, to perform online learning by iteratively calling its `predict(data_example)` and `learn(data_example)` functions at each data example.

View File

@@ -1,16 +1,17 @@
from typing import Optional, Union
import logging
from typing import Optional, Union
from flaml.onlineml import OnlineTrialRunner
from flaml.onlineml.trial import get_ns_feature_dim_from_vw_example
from flaml.tune import (
Trial,
Categorical,
Float,
PolynomialExpansionSet,
Trial,
polynomial_expansion_set,
)
from flaml.onlineml import OnlineTrialRunner
from flaml.tune.scheduler import ChaChaScheduler
from flaml.tune.searcher import ChampionFrontierSearcher
from flaml.onlineml.trial import get_ns_feature_dim_from_vw_example
logger = logging.getLogger(__name__)
@@ -140,7 +141,7 @@ class AutoVW:
max_live_model_num=self._max_live_model_num,
searcher=searcher,
scheduler=scheduler,
**self._automl_runner_args
**self._automl_runner_args,
)
def predict(self, data_sample):

View File

@@ -1,14 +1,16 @@
import numpy as np
import logging
import time
import math
import copy
import collections
import copy
import logging
import math
import time
from typing import Optional, Union
import numpy as np
from flaml.tune import Trial
try:
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.metrics import mean_absolute_error, mean_squared_error
except ImportError:
pass

View File

@@ -1,10 +1,11 @@
import numpy as np
import logging
import math
import numpy as np
from flaml.tune import Trial
from flaml.tune.scheduler import TrialScheduler
import logging
logger = logging.getLogger(__name__)

View File

@@ -5,45 +5,47 @@ It can be used standalone, or together with ray tune or nni. Please find detaile
Below are some quick examples.
* Example for sequential tuning (recommended when compute resource is limited and each trial can consume all the resources):
- Example for sequential tuning (recommended when compute resource is limited and each trial can consume all the resources):
```python
# require: pip install flaml[blendsearch]
from flaml import tune
import time
def evaluate_config(config):
'''evaluate a hyperparameter configuration'''
"""evaluate a hyperparameter configuration"""
# we uss a toy example with 2 hyperparameters
metric = (round(config['x'])-85000)**2 - config['x']/config['y']
metric = (round(config["x"]) - 85000) ** 2 - config["x"] / config["y"]
# usually the evaluation takes an non-neglible cost
# and the cost could be related to certain hyperparameters
# in this example, we assume it's proportional to x
time.sleep(config['x']/100000)
time.sleep(config["x"] / 100000)
# use tune.report to report the metric to optimize
tune.report(metric=metric)
analysis = tune.run(
evaluate_config, # the function to evaluate a config
evaluate_config, # the function to evaluate a config
config={
'x': tune.lograndint(lower=1, upper=100000),
'y': tune.randint(lower=1, upper=100000)
}, # the search space
low_cost_partial_config={'x':1}, # a initial (partial) config with low cost
metric='metric', # the name of the metric used for optimization
mode='min', # the optimization mode, 'min' or 'max'
num_samples=-1, # the maximal number of configs to try, -1 means infinite
time_budget_s=60, # the time budget in seconds
local_dir='logs/', # the local directory to store logs
"x": tune.lograndint(lower=1, upper=100000),
"y": tune.randint(lower=1, upper=100000),
}, # the search space
low_cost_partial_config={"x": 1}, # a initial (partial) config with low cost
metric="metric", # the name of the metric used for optimization
mode="min", # the optimization mode, 'min' or 'max'
num_samples=-1, # the maximal number of configs to try, -1 means infinite
time_budget_s=60, # the time budget in seconds
local_dir="logs/", # the local directory to store logs
# verbose=0, # verbosity
# use_ray=True, # uncomment when performing parallel tuning using ray
)
)
print(analysis.best_trial.last_result) # the best trial's result
print(analysis.best_config) # the best config
print(analysis.best_config) # the best config
```
* Example for using ray tune's API:
- Example for using ray tune's API:
```python
# require: pip install flaml[blendsearch,ray]
@@ -51,36 +53,39 @@ from ray import tune as raytune
from flaml import CFO, BlendSearch
import time
def evaluate_config(config):
'''evaluate a hyperparameter configuration'''
"""evaluate a hyperparameter configuration"""
# we use a toy example with 2 hyperparameters
metric = (round(config['x'])-85000)**2 - config['x']/config['y']
metric = (round(config["x"]) - 85000) ** 2 - config["x"] / config["y"]
# usually the evaluation takes a non-neglible cost
# and the cost could be related to certain hyperparameters
# in this example, we assume it's proportional to x
time.sleep(config['x']/100000)
time.sleep(config["x"] / 100000)
# use tune.report to report the metric to optimize
tune.report(metric=metric)
# provide a time budget (in seconds) for the tuning process
time_budget_s = 60
# provide the search space
config_search_space = {
'x': tune.lograndint(lower=1, upper=100000),
'y': tune.randint(lower=1, upper=100000)
}
"x": tune.lograndint(lower=1, upper=100000),
"y": tune.randint(lower=1, upper=100000),
}
# provide the low cost partial config
low_cost_partial_config={'x':1}
low_cost_partial_config = {"x": 1}
# set up CFO
cfo = CFO(low_cost_partial_config=low_cost_partial_config)
# set up BlendSearch
blendsearch = BlendSearch(
metric="metric", mode="min",
metric="metric",
mode="min",
space=config_search_space,
low_cost_partial_config=low_cost_partial_config,
time_budget_s=time_budget_s
time_budget_s=time_budget_s,
)
# NOTE: when using BlendSearch as a search_alg in ray tune, you need to
# configure the 'time_budget_s' for BlendSearch accordingly such that
@@ -89,28 +94,28 @@ blendsearch = BlendSearch(
# automatically in flaml.
analysis = raytune.run(
evaluate_config, # the function to evaluate a config
evaluate_config, # the function to evaluate a config
config=config_search_space,
metric='metric', # the name of the metric used for optimization
mode='min', # the optimization mode, 'min' or 'max'
num_samples=-1, # the maximal number of configs to try, -1 means infinite
time_budget_s=time_budget_s, # the time budget in seconds
local_dir='logs/', # the local directory to store logs
search_alg=blendsearch # or cfo
metric="metric", # the name of the metric used for optimization
mode="min", # the optimization mode, 'min' or 'max'
num_samples=-1, # the maximal number of configs to try, -1 means infinite
time_budget_s=time_budget_s, # the time budget in seconds
local_dir="logs/", # the local directory to store logs
search_alg=blendsearch, # or cfo
)
print(analysis.best_trial.last_result) # the best trial's result
print(analysis.best_config) # the best config
```
* Example for using NNI: An example of using BlendSearch with NNI can be seen in [test](https://github.com/microsoft/FLAML/tree/main/test/nni). CFO can be used as well in a similar manner. To run the example, first make sure you have [NNI](https://nni.readthedocs.io/en/stable/) installed, then run:
- Example for using NNI: An example of using BlendSearch with NNI can be seen in [test](https://github.com/microsoft/FLAML/tree/main/test/nni). CFO can be used as well in a similar manner. To run the example, first make sure you have [NNI](https://nni.readthedocs.io/en/stable/) installed, then run:
```shell
$nnictl create --config ./config.yml
```
* For more examples, please check out
[notebooks](https://github.com/microsoft/FLAML/tree/main/notebook/).
- For more examples, please check out
[notebooks](https://github.com/microsoft/FLAML/tree/main/notebook/).
`flaml` offers two HPO methods: CFO and BlendSearch.
`flaml.tune` uses BlendSearch by default.
@@ -185,16 +190,16 @@ tune.run(...
)
```
* Recommended scenario: cost-related hyperparameters exist, a low-cost
initial point is known, and the search space is complex such that local search
is prone to be stuck at local optima.
- Recommended scenario: cost-related hyperparameters exist, a low-cost
initial point is known, and the search space is complex such that local search
is prone to be stuck at local optima.
* Suggestion about using larger search space in BlendSearch:
In hyperparameter optimization, a larger search space is desirable because it is more likely to include the optimal configuration (or one of the optimal configurations) in hindsight. However the performance (especially anytime performance) of most existing HPO methods is undesirable if the cost of the configurations in the search space has a large variation. Thus hand-crafted small search spaces (with relatively homogeneous cost) are often used in practice for these methods, which is subject to idiosyncrasy. BlendSearch combines the benefits of local search and global search, which enables a smart (economical) way of deciding where to explore in the search space even though it is larger than necessary. This allows users to specify a larger search space in BlendSearch, which is often easier and a better practice than narrowing down the search space by hand.
- Suggestion about using larger search space in BlendSearch:
In hyperparameter optimization, a larger search space is desirable because it is more likely to include the optimal configuration (or one of the optimal configurations) in hindsight. However the performance (especially anytime performance) of most existing HPO methods is undesirable if the cost of the configurations in the search space has a large variation. Thus hand-crafted small search spaces (with relatively homogeneous cost) are often used in practice for these methods, which is subject to idiosyncrasy. BlendSearch combines the benefits of local search and global search, which enables a smart (economical) way of deciding where to explore in the search space even though it is larger than necessary. This allows users to specify a larger search space in BlendSearch, which is often easier and a better practice than narrowing down the search space by hand.
For more technical details, please check our papers.
* [Frugal Optimization for Cost-related Hyperparameters](https://arxiv.org/abs/2005.01571). Qingyun Wu, Chi Wang, Silu Huang. AAAI 2021.
- [Frugal Optimization for Cost-related Hyperparameters](https://arxiv.org/abs/2005.01571). Qingyun Wu, Chi Wang, Silu Huang. AAAI 2021.
```bibtex
@inproceedings{wu2021cfo,
@@ -205,7 +210,7 @@ For more technical details, please check our papers.
}
```
* [Economical Hyperparameter Optimization With Blended Search Strategy](https://www.microsoft.com/en-us/research/publication/economical-hyperparameter-optimization-with-blended-search-strategy/). Chi Wang, Qingyun Wu, Silu Huang, Amin Saied. ICLR 2021.
- [Economical Hyperparameter Optimization With Blended Search Strategy](https://www.microsoft.com/en-us/research/publication/economical-hyperparameter-optimization-with-blended-search-strategy/). Chi Wang, Qingyun Wu, Silu Huang, Amin Saied. ICLR 2021.
```bibtex
@inproceedings{wang2021blendsearch,

View File

@@ -3,16 +3,16 @@ try:
assert ray_version >= "1.10.0"
from ray.tune import (
uniform,
lograndint,
loguniform,
qlograndint,
qloguniform,
qrandint,
qrandn,
quniform,
randint,
qrandint,
randn,
qrandn,
loguniform,
qloguniform,
lograndint,
qlograndint,
uniform,
)
if ray_version.startswith("1."):
@@ -20,21 +20,20 @@ try:
else:
from ray.tune.search import sample
except (ImportError, AssertionError):
from . import sample
from .sample import (
uniform,
lograndint,
loguniform,
qlograndint,
qloguniform,
qrandint,
qrandn,
quniform,
randint,
qrandint,
randn,
qrandn,
loguniform,
qloguniform,
lograndint,
qlograndint,
uniform,
)
from . import sample
from .tune import run, report, INCUMBENT_RESULT
from .sample import polynomial_expansion_set
from .sample import PolynomialExpansionSet, Categorical, Float
from .sample import Categorical, Float, PolynomialExpansionSet, polynomial_expansion_set
from .trial import Trial
from .tune import INCUMBENT_RESULT, report, run
from .utils import choice

View File

@@ -15,10 +15,12 @@
# This source file is adapted here because ray does not fully support Windows.
# Copyright (c) Microsoft Corporation.
from typing import Dict, Optional
import numpy as np
from .trial import Trial
import logging
from typing import Dict, Optional
import numpy as np
from .trial import Trial
logger = logging.getLogger(__name__)

37
flaml/tune/logger.py Normal file
View File

@@ -0,0 +1,37 @@
import logging
import os
class ColoredFormatter(logging.Formatter):
# ANSI escape codes for colors
COLORS = {
# logging.DEBUG: "\033[36m", # Cyan
# logging.INFO: "\033[32m", # Green
logging.WARNING: "\033[33m", # Yellow
logging.ERROR: "\033[31m", # Red
logging.CRITICAL: "\033[1;31m", # Bright Red
}
RESET = "\033[0m" # Reset to default
def __init__(self, fmt, datefmt, use_color=True):
super().__init__(fmt, datefmt)
self.use_color = use_color
def format(self, record):
formatted = super().format(record)
if self.use_color:
color = self.COLORS.get(record.levelno, "")
if color:
return f"{color}{formatted}{self.RESET}"
return formatted
logger = logging.getLogger(__name__)
use_color = True
if os.getenv("FLAML_LOG_NO_COLOR"):
use_color = False
logger_formatter = ColoredFormatter(
"[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s", "%m-%d %H:%M:%S", use_color
)
logger.propagate = False

View File

@@ -19,6 +19,7 @@ import logging
from copy import copy
from math import isclose
from typing import Any, Dict, List, Optional, Sequence, Union
import numpy as np
# Backwards compatibility

View File

@@ -1,6 +1,6 @@
from .trial_scheduler import TrialScheduler
from .online_scheduler import (
ChaChaScheduler,
OnlineScheduler,
OnlineSuccessiveDoublingScheduler,
ChaChaScheduler,
)
from .trial_scheduler import TrialScheduler

View File

@@ -1,9 +1,12 @@
import numpy as np
import logging
from typing import Dict
from flaml.tune.scheduler import TrialScheduler
import numpy as np
from flaml.tune import Trial
from .trial_scheduler import TrialScheduler
logger = logging.getLogger(__name__)

View File

@@ -2,10 +2,11 @@
# * Copyright (c) Microsoft Corporation. All rights reserved.
# * Licensed under the MIT License. See LICENSE file in the
# * project root for license information.
from typing import Dict, Optional, List, Tuple, Callable, Union
import numpy as np
import time
import pickle
import time
from typing import Callable, Dict, List, Optional, Tuple, Union
import numpy as np
try:
from ray import __version__ as ray_version
@@ -18,17 +19,17 @@ try:
from ray.tune.search import Searcher
from ray.tune.search.optuna import OptunaSearch as GlobalSearch
except (ImportError, AssertionError):
from .suggestion import Searcher
from .suggestion import OptunaSearch as GlobalSearch
from ..trial import unflatten_dict, flatten_dict
from .. import INCUMBENT_RESULT
from .search_thread import SearchThread
from .flow2 import FLOW2
from ..space import add_cost_to_space, indexof, normalize, define_by_run_func
from ..result import TIME_TOTAL_S
from .suggestion import Searcher
import logging
from .. import INCUMBENT_RESULT
from ..result import TIME_TOTAL_S
from ..space import add_cost_to_space, define_by_run_func, indexof, normalize
from ..trial import flatten_dict, unflatten_dict
from .flow2 import FLOW2
from .search_thread import SearchThread
SEARCH_THREAD_EPS = 1.0
PENALTY = 1e10 # penalty term for constraints
logger = logging.getLogger(__name__)
@@ -216,7 +217,24 @@ class BlendSearch(Searcher):
if global_search_alg is not None:
self._gs = global_search_alg
elif getattr(self, "__name__", None) != "CFO":
if space and self._ls.hierarchical:
# Use define-by-run for OptunaSearch when needed:
# - Hierarchical/conditional spaces are best supported via define-by-run.
# - Ray Tune domain/grid specs can trigger an "unresolved search space" warning
# unless we switch to define-by-run.
use_define_by_run = bool(getattr(self._ls, "hierarchical", False))
if (not use_define_by_run) and isinstance(space, dict) and space:
try:
from .variant_generator import parse_spec_vars
_, domain_vars, grid_vars = parse_spec_vars(space)
use_define_by_run = bool(domain_vars or grid_vars)
except Exception:
# Be conservative: if we can't determine whether the space is
# unresolved, fall back to the original behavior.
use_define_by_run = False
self._use_define_by_run = use_define_by_run
if use_define_by_run:
from functools import partial
gs_space = partial(define_by_run_func, space=space)
@@ -243,13 +261,32 @@ class BlendSearch(Searcher):
evaluated_rewards=evaluated_rewards,
)
except (AssertionError, ValueError):
self._gs = GlobalSearch(
space=gs_space,
metric=metric,
mode=mode,
seed=gs_seed,
sampler=sampler,
)
try:
self._gs = GlobalSearch(
space=gs_space,
metric=metric,
mode=mode,
seed=gs_seed,
sampler=sampler,
)
except ValueError:
# Ray Tune's OptunaSearch converts Tune domains into Optuna
# distributions. Optuna disallows integer log distributions
# with step != 1 (e.g., qlograndint with q>1), which can
# raise here. Fall back to FLAML's OptunaSearch wrapper,
# which handles these spaces more permissively.
if getattr(GlobalSearch, "__module__", "").startswith("ray.tune"):
from .suggestion import OptunaSearch as _FallbackOptunaSearch
self._gs = _FallbackOptunaSearch(
space=gs_space,
metric=metric,
mode=mode,
seed=gs_seed,
sampler=sampler,
)
else:
raise
self._gs.space = space
else:
self._gs = None
@@ -467,7 +504,7 @@ class BlendSearch(Searcher):
self._ls_bound_max,
self._subspace.get(trial_id, self._ls.space),
)
if self._gs is not None and self._experimental and (not self._ls.hierarchical):
if self._gs is not None and self._experimental and (not getattr(self, "_use_define_by_run", False)):
self._gs.add_evaluated_point(flatten_dict(config), objective)
# TODO: recover when supported
# converted = convert_key(config, self._gs.space)
@@ -931,27 +968,27 @@ try:
assert ray_version >= "1.10.0"
from ray.tune import (
uniform,
quniform,
choice,
randint,
qrandint,
randn,
qrandn,
loguniform,
qloguniform,
qrandint,
qrandn,
quniform,
randint,
randn,
uniform,
)
except (ImportError, AssertionError):
from ..sample import (
uniform,
quniform,
choice,
randint,
qrandint,
randn,
qrandn,
loguniform,
qloguniform,
qrandint,
qrandn,
quniform,
randint,
randn,
uniform,
)
try:
@@ -978,7 +1015,7 @@ class BlendSearchTuner(BlendSearch, NNITuner):
result = {
"config": parameters,
self._metric: extract_scalar_reward(value),
self.cost_attr: 1 if isinstance(value, float) else value.get(self.cost_attr, value.get("sequence", 1))
self.cost_attr: 1 if isinstance(value, float) else value.get(self.cost_attr, value.get("sequence", 1)),
# if nni does not report training cost,
# using sequence as an approximation.
# if no sequence, using a constant 1

View File

@@ -2,8 +2,8 @@
# * Copyright (c) Microsoft Corporation. All rights reserved.
# * Licensed under the MIT License. See LICENSE file in the
# * project root for license information.
from .flow2 import FLOW2
from .blendsearch import CFO
from .flow2 import FLOW2
class FLOW2Cat(FLOW2):

View File

@@ -2,31 +2,34 @@
# * Copyright (c) Microsoft Corporation. All rights reserved.
# * Licensed under the MIT License. See LICENSE file in the
# * project root for license information.
from typing import Dict, Optional, Tuple
import numpy as np
import logging
from collections import defaultdict
from typing import Dict, Optional, Tuple
import numpy as np
try:
from ray import __version__ as ray_version
assert ray_version >= "1.0.0"
if ray_version.startswith("1."):
from ray.tune.suggest import Searcher
from ray.tune import sample
from ray.tune.suggest import Searcher
else:
from ray.tune.search import Searcher, sample
from ray.tune.utils.util import flatten_dict, unflatten_dict
except (ImportError, AssertionError):
from .suggestion import Searcher
from flaml.tune import sample
from ..trial import flatten_dict, unflatten_dict
from .suggestion import Searcher
from flaml.config import SAMPLE_MULTIPLY_FACTOR
from ..space import (
complete_config,
denormalize,
normalize,
generate_variants_compatible,
normalize,
)
logger = logging.getLogger(__name__)
@@ -106,7 +109,7 @@ class FLOW2(Searcher):
else:
mode = "min"
super(FLOW2, self).__init__(metric=metric, mode=mode)
super().__init__(metric=metric, mode=mode)
# internally minimizes, so "max" => -1
if mode == "max":
self.metric_op = -1.0
@@ -135,7 +138,7 @@ class FLOW2(Searcher):
self.max_resource = max_resource
self._resource = None
self._f_best = None # only use for lexico_comapre. It represent the best value achieved by lexico_flow.
self._step_lb = np.Inf
self._step_lb = np.inf
self._histories = None # only use for lexico_comapre. It records the result of historical configurations.
if space is not None:
self._init_search()
@@ -347,7 +350,7 @@ class FLOW2(Searcher):
else:
assert (
self.lexico_objectives["tolerances"][k_metric][-1] == "%"
), "String tolerance of {} should use %% as the suffix".format(k_metric)
), f"String tolerance of {k_metric} should use %% as the suffix"
tolerance_bound = self._f_best[k_metric] * (
1 + 0.01 * float(self.lexico_objectives["tolerances"][k_metric].replace("%", ""))
)
@@ -382,7 +385,7 @@ class FLOW2(Searcher):
else:
assert (
self.lexico_objectives["tolerances"][k_metric][-1] == "%"
), "String tolerance of {} should use %% as the suffix".format(k_metric)
), f"String tolerance of {k_metric} should use %% as the suffix"
tolerance_bound = self._f_best[k_metric] * (
1 + 0.01 * float(self.lexico_objectives["tolerances"][k_metric].replace("%", ""))
)
@@ -638,8 +641,10 @@ class FLOW2(Searcher):
else:
# key must be in space
domain = space[key]
if self.hierarchical and not (
domain is None or type(domain) in (str, int, float) or isinstance(domain, sample.Domain)
if (
self.hierarchical
and domain is not None
and not isinstance(domain, (str, int, float, sample.Domain))
):
# not domain or hashable
# get rid of list type for hierarchical search space.

View File

@@ -1,9 +1,11 @@
import numpy as np
import logging
import itertools
from typing import Dict, Optional, List
from flaml.tune import Categorical, Float, PolynomialExpansionSet, Trial
import logging
from typing import Dict, List, Optional
import numpy as np
from flaml.onlineml import VowpalWabbitTrial
from flaml.tune import Categorical, Float, PolynomialExpansionSet, Trial
from flaml.tune.searcher import CFO
logger = logging.getLogger(__name__)
@@ -64,7 +66,7 @@ class ChampionFrontierSearcher(BaseSearcher):
POLY_EXPANSION_ADDITION_NUM = 1
# the order of polynomial expansions to add based on the given seed interactions
EXPANSION_ORDER = 2
# the number of new challengers with new numerical hyperparamter configs
# the number of new challengers with new numerical hyperparameter configs
NUMERICAL_NUM = 2
# In order to use CFO, a loss name and loss values of configs are need
@@ -78,7 +80,7 @@ class ChampionFrontierSearcher(BaseSearcher):
CFO_SEARCHER_METRIC_NAME = "pseudo_loss"
CFO_SEARCHER_LARGE_LOSS = 1e6
# the random seed used in generating numerical hyperparamter configs (when CFO is not used)
# the random seed used in generating numerical hyperparameter configs (when CFO is not used)
NUM_RANDOM_SEED = 111
CHAMPION_TRIAL_NAME = "champion_trial"
@@ -205,7 +207,7 @@ class ChampionFrontierSearcher(BaseSearcher):
hyperparameter_config_groups.append(partial_new_configs)
# does not have searcher_trial_ids
searcher_trial_ids_groups.append([])
elif isinstance(config_domain, Float) or isinstance(config_domain, Categorical):
elif isinstance(config_domain, (Float, Categorical)):
# otherwise we need to deal with them in group
nonpoly_config[k] = v
if k not in self._space_of_nonpoly_hp:
@@ -317,7 +319,7 @@ class ChampionFrontierSearcher(BaseSearcher):
candidate_configs = [set(seed_interactions) | set(item) for item in space]
final_candidate_configs = []
for c in candidate_configs:
new_c = set([e for e in c if len(e) > 1])
new_c = {e for e in c if len(e) > 1}
final_candidate_configs.append(new_c)
return final_candidate_configs

View File

@@ -3,6 +3,7 @@
# * Licensed under the MIT License. See LICENSE file in the
# * project root for license information.
from typing import Dict, Optional
import numpy as np
try:
@@ -15,14 +16,40 @@ try:
from ray.tune.search import Searcher
except (ImportError, AssertionError):
from .suggestion import Searcher
from .flow2 import FLOW2
from ..space import add_cost_to_space, unflatten_hierarchical
from ..result import TIME_TOTAL_S
import logging
from ..result import TIME_TOTAL_S
from ..space import add_cost_to_space, unflatten_hierarchical
from .flow2 import FLOW2
logger = logging.getLogger(__name__)
def _recursive_dict_update(target: Dict, source: Dict) -> None:
"""Recursively update target dictionary with source dictionary.
Unlike dict.update(), this function merges nested dictionaries instead of
replacing them entirely. This is crucial for configurations with nested
structures (e.g., XGBoost params).
Args:
target: The dictionary to be updated (modified in place).
source: The dictionary containing values to merge into target.
Example:
>>> target = {'params': {'eta': 0.1, 'max_depth': 3}}
>>> source = {'params': {'verbosity': 0}}
>>> _recursive_dict_update(target, source)
>>> target
{'params': {'eta': 0.1, 'max_depth': 3, 'verbosity': 0}}
"""
for key, value in source.items():
if isinstance(value, dict) and key in target and isinstance(target[key], dict):
_recursive_dict_update(target[key], value)
else:
target[key] = value
class SearchThread:
"""Class of global or local search thread."""
@@ -63,7 +90,7 @@ class SearchThread:
try:
config = self._search_alg.suggest(trial_id)
if isinstance(self._search_alg._space, dict):
config.update(self._const)
_recursive_dict_update(config, self._const)
else:
# define by run
config, self.space = unflatten_hierarchical(config, self._space)

View File

@@ -15,15 +15,17 @@
# This source file is adapted here because ray does not fully support Windows.
# Copyright (c) Microsoft Corporation.
import time
import functools
import warnings
import copy
import numpy as np
import functools
import logging
from typing import Any, Dict, Optional, Union, List, Tuple, Callable
import pickle
from .variant_generator import parse_spec_vars
import time
import warnings
from collections import defaultdict
from typing import Any, Callable, Dict, List, Optional, Tuple, Union
import numpy as np
from ..sample import (
Categorical,
Domain,
@@ -33,8 +35,75 @@ from ..sample import (
Quantized,
Uniform,
)
# If Ray is installed, flaml.tune may re-export Ray Tune sampling functions.
# In that case, the search space contains Ray Tune Domain/Sampler objects,
# which should be accepted by our Optuna search-space conversion.
try:
from ray import __version__ as _ray_version # type: ignore
if str(_ray_version).startswith("1."):
from ray.tune.sample import ( # type: ignore
Categorical as _RayCategorical,
)
from ray.tune.sample import (
Domain as _RayDomain,
)
from ray.tune.sample import (
Float as _RayFloat,
)
from ray.tune.sample import (
Integer as _RayInteger,
)
from ray.tune.sample import (
LogUniform as _RayLogUniform,
)
from ray.tune.sample import (
Quantized as _RayQuantized,
)
from ray.tune.sample import (
Uniform as _RayUniform,
)
else:
from ray.tune.search.sample import ( # type: ignore
Categorical as _RayCategorical,
)
from ray.tune.search.sample import (
Domain as _RayDomain,
)
from ray.tune.search.sample import (
Float as _RayFloat,
)
from ray.tune.search.sample import (
Integer as _RayInteger,
)
from ray.tune.search.sample import (
LogUniform as _RayLogUniform,
)
from ray.tune.search.sample import (
Quantized as _RayQuantized,
)
from ray.tune.search.sample import (
Uniform as _RayUniform,
)
_FLOAT_TYPES = (Float, _RayFloat)
_INTEGER_TYPES = (Integer, _RayInteger)
_CATEGORICAL_TYPES = (Categorical, _RayCategorical)
_DOMAIN_TYPES = (Domain, _RayDomain)
_QUANTIZED_TYPES = (Quantized, _RayQuantized)
_UNIFORM_TYPES = (Uniform, _RayUniform)
_LOGUNIFORM_TYPES = (LogUniform, _RayLogUniform)
except Exception: # pragma: no cover
_FLOAT_TYPES = (Float,)
_INTEGER_TYPES = (Integer,)
_CATEGORICAL_TYPES = (Categorical,)
_DOMAIN_TYPES = (Domain,)
_QUANTIZED_TYPES = (Quantized,)
_UNIFORM_TYPES = (Uniform,)
_LOGUNIFORM_TYPES = (LogUniform,)
from ..trial import flatten_dict, unflatten_dict
from collections import defaultdict
from .variant_generator import parse_spec_vars
logger = logging.getLogger(__name__)
@@ -183,13 +252,13 @@ class ConcurrencyLimiter(Searcher):
"""
def __init__(self, searcher: Searcher, max_concurrent: int, batch: bool = False):
assert type(max_concurrent) is int and max_concurrent > 0
assert isinstance(max_concurrent, int) and max_concurrent > 0
self.searcher = searcher
self.max_concurrent = max_concurrent
self.batch = batch
self.live_trials = set()
self.cached_results = {}
super(ConcurrencyLimiter, self).__init__(metric=self.searcher.metric, mode=self.searcher.mode)
super().__init__(metric=self.searcher.metric, mode=self.searcher.mode)
def suggest(self, trial_id: str) -> Optional[Dict]:
assert trial_id not in self.live_trials, f"Trial ID {trial_id} must be unique: already found in set."
@@ -252,8 +321,8 @@ try:
import optuna as ot
from optuna.distributions import BaseDistribution as OptunaDistribution
from optuna.samplers import BaseSampler
from optuna.trial import TrialState as OptunaTrialState
from optuna.trial import Trial as OptunaTrial
from optuna.trial import TrialState as OptunaTrialState
except ImportError:
ot = None
OptunaDistribution = None
@@ -283,25 +352,21 @@ def validate_warmstart(
"""
if points_to_evaluate:
if not isinstance(points_to_evaluate, list):
raise TypeError("points_to_evaluate expected to be a list, got {}.".format(type(points_to_evaluate)))
raise TypeError(f"points_to_evaluate expected to be a list, got {type(points_to_evaluate)}.")
for point in points_to_evaluate:
if not isinstance(point, (dict, list)):
raise TypeError(f"points_to_evaluate expected to include list or dict, " f"got {point}.")
if validate_point_name_lengths and (not len(point) == len(parameter_names)):
raise ValueError(
"Dim of point {}".format(point)
+ " and parameter_names {}".format(parameter_names)
+ " do not match."
)
raise ValueError(f"Dim of point {point}" + f" and parameter_names {parameter_names}" + " do not match.")
if points_to_evaluate and evaluated_rewards:
if not isinstance(evaluated_rewards, list):
raise TypeError("evaluated_rewards expected to be a list, got {}.".format(type(evaluated_rewards)))
raise TypeError(f"evaluated_rewards expected to be a list, got {type(evaluated_rewards)}.")
if not len(evaluated_rewards) == len(points_to_evaluate):
raise ValueError(
"Dim of evaluated_rewards {}".format(evaluated_rewards)
+ " and points_to_evaluate {}".format(points_to_evaluate)
f"Dim of evaluated_rewards {evaluated_rewards}"
+ f" and points_to_evaluate {points_to_evaluate}"
+ " do not match."
)
@@ -545,7 +610,7 @@ class OptunaSearch(Searcher):
evaluated_rewards: Optional[List] = None,
):
assert ot is not None, "Optuna must be installed! Run `pip install optuna`."
super(OptunaSearch, self).__init__(metric=metric, mode=mode)
super().__init__(metric=metric, mode=mode)
if isinstance(space, dict) and space:
resolved_vars, domain_vars, grid_vars = parse_spec_vars(space)
@@ -559,7 +624,15 @@ class OptunaSearch(Searcher):
self._space = space
self._points_to_evaluate = points_to_evaluate or []
self._evaluated_rewards = evaluated_rewards
# rewards should be a list of floats, not a dict
# After Optuna > 3.5.0, there is a check for NaN in the list "any(math.isnan(x) for x in self._values)"
# which will raise an error when encountering a dict
if evaluated_rewards is not None:
self._evaluated_rewards = [
list(item.values())[0] if isinstance(item, dict) else item for item in evaluated_rewards
]
else:
self._evaluated_rewards = evaluated_rewards
self._study_name = "optuna" # Fixed study name for in-memory storage
@@ -844,19 +917,22 @@ class OptunaSearch(Searcher):
def resolve_value(domain: Domain) -> ot.distributions.BaseDistribution:
quantize = None
sampler = domain.get_sampler()
if isinstance(sampler, Quantized):
# Ray Tune Domains and FLAML Domains both provide get_sampler(), but
# fall back to the .sampler attribute for robustness.
sampler = domain.get_sampler() if hasattr(domain, "get_sampler") else getattr(domain, "sampler", None)
if isinstance(sampler, _QUANTIZED_TYPES) or type(sampler).__name__ == "Quantized":
quantize = sampler.q
sampler = sampler.sampler
if isinstance(sampler, LogUniform):
sampler = getattr(sampler, "sampler", None) or sampler.get_sampler()
if isinstance(sampler, _LOGUNIFORM_TYPES) or type(sampler).__name__ == "LogUniform":
logger.warning(
"Optuna does not handle quantization in loguniform "
"sampling. The parameter will be passed but it will "
"probably be ignored."
)
if isinstance(domain, Float):
if isinstance(sampler, LogUniform):
if isinstance(domain, _FLOAT_TYPES) or type(domain).__name__ == "Float":
if isinstance(sampler, _LOGUNIFORM_TYPES) or type(sampler).__name__ == "LogUniform":
if quantize:
logger.warning(
"Optuna does not support both quantization and "
@@ -864,17 +940,17 @@ class OptunaSearch(Searcher):
)
return ot.distributions.LogUniformDistribution(domain.lower, domain.upper)
elif isinstance(sampler, Uniform):
elif isinstance(sampler, _UNIFORM_TYPES) or type(sampler).__name__ == "Uniform":
if quantize:
return ot.distributions.DiscreteUniformDistribution(domain.lower, domain.upper, quantize)
return ot.distributions.UniformDistribution(domain.lower, domain.upper)
elif isinstance(domain, Integer):
if isinstance(sampler, LogUniform):
return ot.distributions.IntLogUniformDistribution(
domain.lower, domain.upper - 1, step=quantize or 1
)
elif isinstance(sampler, Uniform):
elif isinstance(domain, _INTEGER_TYPES) or type(domain).__name__ == "Integer":
if isinstance(sampler, _LOGUNIFORM_TYPES) or type(sampler).__name__ == "LogUniform":
# ``step`` argument Deprecated in v2.0.0. ``step`` argument should be 1 in Log Distribution
# The removal of this feature is currently scheduled for v4.0.0,
return ot.distributions.IntLogUniformDistribution(domain.lower, domain.upper - 1, step=1)
elif isinstance(sampler, _UNIFORM_TYPES) or type(sampler).__name__ == "Uniform":
# Upper bound should be inclusive for quantization and
# exclusive otherwise
return ot.distributions.IntUniformDistribution(
@@ -882,16 +958,16 @@ class OptunaSearch(Searcher):
domain.upper - int(bool(not quantize)),
step=quantize or 1,
)
elif isinstance(domain, Categorical):
if isinstance(sampler, Uniform):
elif isinstance(domain, _CATEGORICAL_TYPES) or type(domain).__name__ == "Categorical":
if isinstance(sampler, _UNIFORM_TYPES) or type(sampler).__name__ == "Uniform":
return ot.distributions.CategoricalDistribution(domain.categories)
raise ValueError(
"Optuna search does not support parameters of type "
"`{}` with samplers of type `{}`".format(type(domain).__name__, type(domain.sampler).__name__)
"`{}` with samplers of type `{}`".format(type(domain).__name__, type(sampler).__name__)
)
# Parameter name is e.g. "a/b/c" for nested dicts
values = {"/".join(path): resolve_value(domain) for path, domain in domain_vars}
return values
return values

View File

@@ -17,9 +17,11 @@
# Copyright (c) Microsoft Corporation.
import copy
import logging
from typing import Any, Dict, Generator, List, Tuple
import numpy
import random
from typing import Any, Dict, Generator, List, Tuple
import numpy
from ..sample import Categorical, Domain, RandomState
try:
@@ -250,7 +252,7 @@ def _try_resolve(v) -> Tuple[bool, Any]:
# Grid search values
grid_values = v["grid_search"]
if not isinstance(grid_values, list):
raise TuneError("Grid search expected list of values, got: {}".format(grid_values))
raise TuneError(f"Grid search expected list of values, got: {grid_values}")
return False, Categorical(grid_values).grid()
return True, v
@@ -300,13 +302,13 @@ def has_unresolved_values(spec: Dict) -> bool:
class _UnresolvedAccessGuard(dict):
def __init__(self, *args, **kwds):
super(_UnresolvedAccessGuard, self).__init__(*args, **kwds)
super().__init__(*args, **kwds)
self.__dict__ = self
def __getattribute__(self, item):
value = dict.__getattribute__(self, item)
if not _is_resolved(value):
raise RecursiveDependencyError("`{}` recursively depends on {}".format(item, value))
raise RecursiveDependencyError(f"`{item}` recursively depends on {value}")
elif isinstance(value, dict):
return _UnresolvedAccessGuard(value)
else:

View File

@@ -11,9 +11,10 @@ try:
except (ImportError, AssertionError):
from . import sample
from .searcher.variant_generator import generate_variants
from typing import Dict, Optional, Any, Tuple, Generator, List, Union
import numpy as np
import logging
from typing import Any, Dict, Generator, List, Optional, Tuple, Union
import numpy as np
logger = logging.getLogger(__name__)
@@ -260,7 +261,7 @@ def add_cost_to_space(space: Dict, low_cost_point: Dict, choice_cost: Dict):
low_cost[i] = point
if len(low_cost) > len(domain.categories):
if domain.ordered:
low_cost[-1] = int(np.where(ind == low_cost[-1])[0])
low_cost[-1] = int(np.where(ind == low_cost[-1])[0].item())
domain.low_cost_point = low_cost[-1]
return
if low_cost:
@@ -489,7 +490,7 @@ def complete_config(
elif domain.bounded:
up, low, gauss_std = 1, 0, 1.0
else:
up, low, gauss_std = np.Inf, -np.Inf, 1.0
up, low, gauss_std = np.inf, -np.inf, 1.0
if domain.bounded:
if isinstance(up, list):
up[-1] = min(up[-1], 1)

View File

@@ -1,8 +1,8 @@
from flaml.tune.spark.utils import (
broadcast_code,
check_spark,
get_n_cpus,
with_parameters,
broadcast_code,
)
__all__ = ["check_spark", "get_n_cpus", "with_parameters", "broadcast_code"]

View File

@@ -5,7 +5,6 @@ import threading
import time
from functools import lru_cache, partial
logger = logging.getLogger(__name__)
logger_formatter = logging.Formatter(
"[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s", "%m-%d %H:%M:%S"
@@ -13,10 +12,10 @@ logger_formatter = logging.Formatter(
logger.propagate = False
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
try:
import py4j
import pyspark
from pyspark.sql import SparkSession
from pyspark.util import VersionUtils
import py4j
except ImportError:
_have_spark = False
py4j = None
@@ -163,6 +162,10 @@ def broadcast_code(custom_code="", file_name="mylearner"):
assert isinstance(MyLargeLGBM(), LGBMEstimator)
```
"""
# Check if Spark is available
spark_available, _ = check_spark()
# Write to local driver file system
flaml_path = os.path.dirname(os.path.abspath(__file__))
custom_code = textwrap.dedent(custom_code)
custom_path = os.path.join(flaml_path, file_name + ".py")
@@ -170,6 +173,24 @@ def broadcast_code(custom_code="", file_name="mylearner"):
with open(custom_path, "w") as f:
f.write(custom_code)
# If using Spark, broadcast the code content to executors
if spark_available:
spark = SparkSession.builder.getOrCreate()
bc_code = spark.sparkContext.broadcast(custom_code)
# Execute a job to ensure the code is distributed to all executors
def _write_code(bc):
code = bc.value
import os
module_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), file_name + ".py")
os.makedirs(os.path.dirname(module_path), exist_ok=True)
with open(module_path, "w") as f:
f.write(code)
return True
spark.sparkContext.parallelize(range(1)).map(lambda _: _write_code(bc_code)).collect()
return custom_path
@@ -286,6 +307,7 @@ class PySparkOvertimeMonitor:
def __exit__(self, exc_type, exc_value, exc_traceback):
"""Exit the context manager.
This will wait for the monitor thread to nicely exit."""
logger.debug(f"monitor exited: {exc_type}, {exc_value}, {exc_traceback}")
if self._force_cancel and _have_spark:
self._finished_flag = True
self._monitor_daemon.join()
@@ -296,6 +318,11 @@ class PySparkOvertimeMonitor:
if not exc_type:
return True
elif exc_type == py4j.protocol.Py4JJavaError:
logger.debug("Py4JJavaError Exception: %s", exc_value)
return True
elif exc_type == TypeError:
# When force cancel, joblib>1.2.0 will raise joblib.externals.loky.process_executor._ExceptionWithTraceback
logger.debug("TypeError Exception: %s", exc_value)
return True
else:
return False

View File

@@ -15,10 +15,10 @@
# This source file is adapted here because ray does not fully support Windows.
# Copyright (c) Microsoft Corporation.
import uuid
import time
from numbers import Number
import uuid
from collections import deque
from numbers import Number
def flatten_dict(dt, delimiter="/", prevent_delimiter=False):
@@ -110,7 +110,7 @@ class Trial:
}
self.metric_n_steps[metric] = {}
for n in self.n_steps:
key = "last-{:d}-avg".format(n)
key = f"last-{n:d}-avg"
self.metric_analysis[metric][key] = value
# Store n as string for correct restore.
self.metric_n_steps[metric][str(n)] = deque([value], maxlen=n)
@@ -124,7 +124,7 @@ class Trial:
self.metric_analysis[metric]["last"] = value
for n in self.n_steps:
key = "last-{:d}-avg".format(n)
key = f"last-{n:d}-avg"
self.metric_n_steps[metric][str(n)].append(value)
self.metric_analysis[metric][key] = sum(self.metric_n_steps[metric][str(n)]) / len(
self.metric_n_steps[metric][str(n)]

Some files were not shown because too many files have changed in this diff Show More