Compare commits

...

69 Commits

Author SHA1 Message Date
dependabot[bot]
13aec414ea Bump brace-expansion from 1.1.11 to 1.1.12 in /website (#1453)
Bumps [brace-expansion](https://github.com/juliangruber/brace-expansion) from 1.1.11 to 1.1.12.
- [Release notes](https://github.com/juliangruber/brace-expansion/releases)
- [Commits](https://github.com/juliangruber/brace-expansion/compare/1.1.11...v1.1.12)

---
updated-dependencies:
- dependency-name: brace-expansion
  dependency-version: 1.1.12
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2025-08-14 10:50:51 +08:00
Li Jiang
bb16dcde93 Bump version to 2.3.6 (#1451) 2025-08-05 14:29:36 +08:00
Li Jiang
be81a76da9 Fix TypeError of customized kfold method which needs 'y' (#1450) 2025-08-02 08:05:50 +08:00
Li Jiang
2d16089529 Improve FAQ docs (#1448)
* Fix settings usage error

* Add new code example
2025-07-09 18:33:10 +08:00
Li Jiang
01c3c83653 Install wheel and setuptools (#1443) 2025-05-28 12:56:48 +08:00
Li Jiang
9b66103f7c Fix typo, add quotes to python-version (#1442) 2025-05-28 12:24:00 +08:00
Li Jiang
48dfd72e64 Fix CD actions (#1441)
* Fix CD actions

* Skip Build if no relevant changes
2025-05-28 10:45:27 +08:00
Li Jiang
dec92e5b02 Upgrade python 3.8 to 3.10 in github actions (#1440) 2025-05-27 21:34:21 +08:00
Li Jiang
22911ea1ef Merged PR 1685054: Add more logs and function wait_futures for easier post analysis (#1438)
- Add function wait_futures for easier post analysis
- Use logger instead of print

----
#### AI description  (iteration 1)
#### PR Classification
A code enhancement for debugging asynchronous mlflow logging and improving post-run analysis.

#### PR Summary
This PR adds detailed debug logging to the mlflow integration and introduces a new `wait_futures` function to streamline the collection of asynchronous task results for improved analysis.
- `flaml/fabric/mlflow.py`: Added debug log statements around starting and ending mlflow runs to trace run IDs and execution flow.
- `flaml/automl/automl.py`: Implemented the `wait_futures` function to handle asynchronous task results and replaced a print call with `logger.info` for consistent logging.
<!-- GitOpsUserAgent=GitOps.Apps.Server.pullrequestcopilot -->

Related work items: #4029592
2025-05-27 15:32:56 +08:00
murunlin
12183e5f73 Add the detailed info for parameter 'verbose' (#1435)
* explain-verbose-parameter

* concise-verbose-docstring

* explain-verbose-parameter

* explain-verbose-parameter

* test-ignore

* test-ignore

* sklearn-version-califonia

* submit-0526

---------

Co-authored-by: Runlin Mu (FESCO Adecco Human Resources) <v-runlinmu@microsoft.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2025-05-27 10:01:01 +08:00
Li Jiang
c2b25310fc Sync Fabric till 2cd1c3da (#1433)
* Sync Fabric till 2cd1c3da

* Remove synapseml from tag names

* Fix 'NoneType' object has no attribute 'DataFrame'

* Deprecated 3.8 support

* Fix 'NoneType' object has no attribute 'DataFrame'

* Still use python 3.8 for pydoc

* Don't run tests in parallel

* Remove autofe and lowcode
2025-05-23 10:19:31 +08:00
murunlin
0f9420590d fix: best_model_for_estimator returns inconsistent feature_importances_ compared to automl.model (#1429)
* mrl-issue1422-0513

* fix version dependency

* fix datasets version

* test completion

---------

Co-authored-by: Runlin Mu (FESCO Adecco Human Resources) <v-runlinmu@microsoft.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2025-05-15 09:37:34 +08:00
hexiang-x
5107c506b4 fix:When use_spark = True and mlflow_logging = True are set, an error is reported when logging the best model: 'NoneType' object has no attribute 'save' bug Something isn't working (#1432) 2025-05-14 19:34:06 +08:00
dependabot[bot]
9e219ef8dc Bump http-proxy-middleware from 2.0.7 to 2.0.9 in /website (#1425)
Bumps [http-proxy-middleware](https://github.com/chimurai/http-proxy-middleware) from 2.0.7 to 2.0.9.
- [Release notes](https://github.com/chimurai/http-proxy-middleware/releases)
- [Changelog](https://github.com/chimurai/http-proxy-middleware/blob/v2.0.9/CHANGELOG.md)
- [Commits](https://github.com/chimurai/http-proxy-middleware/compare/v2.0.7...v2.0.9)

---
updated-dependencies:
- dependency-name: http-proxy-middleware
  dependency-version: 2.0.9
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2025-04-23 14:22:12 +08:00
Li Jiang
6e4083743b Revert "Numpy 2.x is not supported yet. (#1424)" (#1426)
This reverts commit 17e95edd9e.
2025-04-22 21:31:44 +08:00
Li Jiang
17e95edd9e Numpy 2.x is not supported yet. (#1424) 2025-04-22 12:11:27 +08:00
Stickic-cyber
468bc62d27 Fix issue with "list index out of range" when max_iter=1 (#1419) 2025-04-09 21:54:17 +08:00
dependabot[bot]
437c239c11 Bump @babel/helpers from 7.20.1 to 7.26.10 in /website (#1413)
Bumps [@babel/helpers](https://github.com/babel/babel/tree/HEAD/packages/babel-helpers) from 7.20.1 to 7.26.10.
- [Release notes](https://github.com/babel/babel/releases)
- [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md)
- [Commits](https://github.com/babel/babel/commits/v7.26.10/packages/babel-helpers)

---
updated-dependencies:
- dependency-name: "@babel/helpers"
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2025-03-14 15:51:06 +08:00
dependabot[bot]
8e753f1092 Bump @babel/runtime from 7.20.1 to 7.26.10 in /website (#1414)
Bumps [@babel/runtime](https://github.com/babel/babel/tree/HEAD/packages/babel-runtime) from 7.20.1 to 7.26.10.
- [Release notes](https://github.com/babel/babel/releases)
- [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md)
- [Commits](https://github.com/babel/babel/commits/v7.26.10/packages/babel-runtime)

---
updated-dependencies:
- dependency-name: "@babel/runtime"
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2025-03-13 21:34:02 +08:00
dependabot[bot]
a3b57e11d4 Bump prismjs from 1.29.0 to 1.30.0 in /website (#1411)
Bumps [prismjs](https://github.com/PrismJS/prism) from 1.29.0 to 1.30.0.
- [Release notes](https://github.com/PrismJS/prism/releases)
- [Changelog](https://github.com/PrismJS/prism/blob/master/CHANGELOG.md)
- [Commits](https://github.com/PrismJS/prism/compare/v1.29.0...v1.30.0)

---
updated-dependencies:
- dependency-name: prismjs
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2025-03-13 14:06:41 +08:00
dependabot[bot]
a80dcf9925 Bump @babel/runtime-corejs3 from 7.20.1 to 7.26.10 in /website (#1412)
Bumps [@babel/runtime-corejs3](https://github.com/babel/babel/tree/HEAD/packages/babel-runtime-corejs3) from 7.20.1 to 7.26.10.
- [Release notes](https://github.com/babel/babel/releases)
- [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md)
- [Commits](https://github.com/babel/babel/commits/v7.26.10/packages/babel-runtime-corejs3)

---
updated-dependencies:
- dependency-name: "@babel/runtime-corejs3"
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-13 10:04:03 +08:00
SkBlaz
7157af44e0 Improved error handling in case no scikit present (#1402)
* Improved error handling in case no scikit present

Currently there is no description for when this error is thrown. Being explicit seems of value.

* Update histgb.py

---------

Co-authored-by: Li Jiang <bnujli@gmail.com>
2025-03-03 15:39:43 +08:00
Li Jiang
1798c4591e Upgrade setuptools (#1410) 2025-03-01 08:05:51 +08:00
Li Jiang
dd26263330 Bump version to 2.3.5 (#1409) 2025-02-17 22:26:59 +08:00
Li Jiang
2ba5f8bed1 Fix params pop error (#1408) 2025-02-17 15:06:05 +08:00
Daniel Grindrod
d0a11958a5 fix: Fixed bug where group folds and sample weights couldn't be used in the same automl instance (#1405) 2025-02-15 10:41:27 +08:00
dependabot[bot]
0ef9b00a75 Bump serialize-javascript from 6.0.0 to 6.0.2 in /website (#1407)
Bumps [serialize-javascript](https://github.com/yahoo/serialize-javascript) from 6.0.0 to 6.0.2.
- [Release notes](https://github.com/yahoo/serialize-javascript/releases)
- [Commits](https://github.com/yahoo/serialize-javascript/compare/v6.0.0...v6.0.2)

---
updated-dependencies:
- dependency-name: serialize-javascript
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2025-02-14 12:36:49 +08:00
Will Charles
840f76e5e5 Changed tune.report import for ray>=2 (#1392)
* Changed tune.report import for ray>=2

* env: Changed pydantic restriction in env

* Reverted Pydantic install conditions

* Reverted Pydantic install conditions

* test: Check if GPU is available

* tests: uncommented a line

* tests: Better fix for Ray GPU checking

* tests: Added timeout to dataset loading

* tests: Deleted _test_hf_data()

* test: Reduce lrl2 dataset size

* bug: timeout error

* bug: timeout error

* fix: Added threading check for timout issue

* Undo old commits

* Timeout fix from #1406

---------

Co-authored-by: Daniel Grindrod <dannycg1996@gmail.com>
2025-02-14 09:38:33 +08:00
Li Jiang
d8b7d25b80 Fix test hang issue (#1406)
* Add try except to resource.setrlimit

* Set time limit only in main thread

* Check only test model

* Pytest debug

* Test separately

* Move test_model.py to automl folder
2025-02-13 19:50:35 +08:00
Li Jiang
6d53929803 Bump version to 2.3.4 (#1389) 2024-12-18 12:49:59 +08:00
Daniel Grindrod
c038fbca07 fix: KeyError no longer occurs when using groupfolds for regression tasks. (#1385)
* fix: Now resetting indexes for regression datasets when using group folds

* refactor: Simplified if statement to include all fold types

* docs: Updated docs to make it clear that group folds can be used for regression tasks

---------

Co-authored-by: Daniel Grindrod <daniel.grindrod@evotec.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-12-18 10:06:58 +08:00
dependabot[bot]
6a99202492 Bump nanoid from 3.3.6 to 3.3.8 in /website (#1387)
Bumps [nanoid](https://github.com/ai/nanoid) from 3.3.6 to 3.3.8.
- [Release notes](https://github.com/ai/nanoid/releases)
- [Changelog](https://github.com/ai/nanoid/blob/main/CHANGELOG.md)
- [Commits](https://github.com/ai/nanoid/compare/3.3.6...3.3.8)

---
updated-dependencies:
- dependency-name: nanoid
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-12-17 19:26:34 +08:00
Daniel Grindrod
42d1dcfa0e fix: Fixed bug with catboost and groups (#1383)
Co-authored-by: Daniel Grindrod <daniel.grindrod@evotec.com>
2024-12-17 13:54:49 +08:00
EgorKraevTransferwise
b83c8a7d3b Pass cost_attr and cost_budget from flaml.tune.run() to the search algo (#1382) 2024-12-04 20:50:15 +08:00
dependabot[bot]
b9194cdcf2 Bump cross-spawn from 7.0.3 to 7.0.6 in /website (#1379)
Bumps [cross-spawn](https://github.com/moxystudio/node-cross-spawn) from 7.0.3 to 7.0.6.
- [Changelog](https://github.com/moxystudio/node-cross-spawn/blob/master/CHANGELOG.md)
- [Commits](https://github.com/moxystudio/node-cross-spawn/compare/v7.0.3...v7.0.6)

---
updated-dependencies:
- dependency-name: cross-spawn
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-20 15:48:39 +08:00
Li Jiang
9a1f6b0291 Bump version to 2.3.3 (#1378) 2024-11-13 11:44:34 +08:00
kernelmethod
07f4413aae Fix logging nuisances that can arise when importing flaml (#1377) 2024-11-13 07:49:55 +08:00
Daniel Grindrod
5a74227bc3 Flaml: fix lgbm reproducibility (#1369)
* fix: Fixed bug where every underlying LGBMRegressor or LGBMClassifier had n_estimators = 1

* test: Added test showing case where FLAMLised CatBoostModel result isn't reproducible

* fix: Fixing issue where callbacks cause LGBM results to not be reproducible

* Update test/automl/test_regression.py

Co-authored-by: Li Jiang <bnujli@gmail.com>

* fix: Adding back the LGBM EarlyStopping

* refactor: Fix tweaked to ensure other models aren't likely to be affected

* test: Fixed test to allow reproduced results to be better than the FLAML results, when LGBM earlystopping is involved

---------

Co-authored-by: Daniel Grindrod <Daniel.Grindrod@evotec.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-11-01 10:06:15 +08:00
Ranuga
7644958e21 Add documentation for automl.model.estimator usage (#1311)
* Added documentation for automl.model.estimator usage

Updated documentation across various examples and the model.py file to include information about automl.model.estimator. This addition enhances the clarity and usability of FLAML by providing users with clear guidance on how to utilize this feature in their AutoML workflows. These changes aim to improve the overall user experience and facilitate easier understanding of FLAML's capabilities.

* fix: Ran pre-commit hook on docs

---------

Co-authored-by: Li Jiang <bnujli@gmail.com>
Co-authored-by: Daniel Grindrod <dannycg1996@gmail.com>
Co-authored-by: Daniel Grindrod <Daniel.Grindrod@evotec.com>
2024-10-31 20:53:54 +08:00
Daniel Grindrod
a316f84fe1 fix: LinearSVC results now reproducible (#1376)
Co-authored-by: Daniel Grindrod <Daniel.Grindrod@evotec.com>
2024-10-31 14:02:16 +08:00
Daniel Grindrod
72881d3a2b fix: Fixing the random state of ElasticNetClassifier by default, to ensure reproduciblity. Also included elasticnet in reproducibility tests (#1374)
Co-authored-by: Daniel Grindrod <Daniel.Grindrod@evotec.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-10-29 14:21:43 +08:00
Li Jiang
69da685d1e Fix data transform issue, spark log_loss metric compute error and json dumps TypeError (Sync Fabric till 3c545e67) (#1371)
* Merged PR 1444697: Fix json dumps TypeError

Fix json dumps TypeError

----
Bug fix to address a `TypeError` in `json.dumps`.

This pull request fixes a `TypeError` encountered when using `json.dumps` on `automl._automl_user_configurations` by introducing a safe JSON serialization function.
- Added `safe_json_dumps` function in `flaml/fabric/mlflow.py` to handle non-serializable objects.
- Updated `MLflowIntegration` class in `flaml/fabric/mlflow.py` to use `safe_json_dumps` for JSON serialization.
- Modified `test/automl/test_multiclass.py` to test the new `safe_json_dumps` function.

Related work items: #3439408

* Fix data transform issue and spark log_loss metric compute error
2024-10-29 11:58:40 +08:00
Li Jiang
c01c3910eb Update version.py (#1372) 2024-10-29 09:33:23 +08:00
dependabot[bot]
98d3fd2f48 Bump http-proxy-middleware from 2.0.6 to 2.0.7 in /website (#1370)
Bumps [http-proxy-middleware](https://github.com/chimurai/http-proxy-middleware) from 2.0.6 to 2.0.7.
- [Release notes](https://github.com/chimurai/http-proxy-middleware/releases)
- [Changelog](https://github.com/chimurai/http-proxy-middleware/blob/v2.0.7/CHANGELOG.md)
- [Commits](https://github.com/chimurai/http-proxy-middleware/compare/v2.0.6...v2.0.7)

---
updated-dependencies:
- dependency-name: http-proxy-middleware
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-28 10:43:28 +08:00
Li Jiang
9724c626cc Remove outdated comment (#1366) 2024-10-24 12:17:21 +08:00
smty2018
0d92400200 Included that retrain_full = True does not include the user provided validation data in the docs. #1228 (#1245)
* Update Task-Oriented-AutoML.md

* Update Task-Oriented-AutoML.md

* Update marker

* Fix format

---------

Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-10-23 16:48:45 +08:00
Daniel Grindrod
d224218ecf fix: FLAML catboost metrics arent reproducible (#1364)
* fix: CatBoostRegressors metrics are now reproducible

* test: Made tests live, which ensure the reproducibility of catboost models

* fix: Added defunct line of code as a comment

* fix: Re-adding removed if statement, and test to show one issue that if statement can cause

* fix: Stopped ending CatBoost training early when time budget is running out

---------

Co-authored-by: Daniel Grindrod <Daniel.Grindrod@evotec.com>
2024-10-23 13:51:23 +08:00
Daniel Grindrod
a2a5e1abb9 test: Adding tests to verify model reproducibility (#1362) 2024-10-12 09:53:16 +08:00
Daniel Grindrod
5c0f18b7bc fix: Cross validation process isn't always run to completion (#1360) 2024-10-01 08:24:53 +08:00
dependabot[bot]
e5d95f5674 Bump express from 4.19.2 to 4.21.0 in /website (#1357) 2024-09-22 11:01:00 +08:00
Li Jiang
49ba962d47 Support logger_formatter without automl dependencies (#1356) 2024-09-21 20:04:46 +08:00
Li Jiang
8e171bc402 Remove temporary pickle files (#1354)
* Remove temporary pickle files

* Update version to 2.3.1

* Use TemporaryDirectory for pickle and log_artifact

* Fix 'CatBoostClassifier' object has no attribute '_get_param_names'
2024-09-21 15:46:32 +08:00
dependabot[bot]
c90946f303 Bump webpack from 5.76.1 to 5.94.0 in /website (#1342)
Bumps [webpack](https://github.com/webpack/webpack) from 5.76.1 to 5.94.0.
- [Release notes](https://github.com/webpack/webpack/releases)
- [Commits](https://github.com/webpack/webpack/compare/v5.76.1...v5.94.0)

---
updated-dependencies:
- dependency-name: webpack
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-06 11:56:42 +08:00
dependabot[bot]
64f30af603 Bump micromatch from 4.0.5 to 4.0.8 in /website (#1343)
Bumps [micromatch](https://github.com/micromatch/micromatch) from 4.0.5 to 4.0.8.
- [Release notes](https://github.com/micromatch/micromatch/releases)
- [Changelog](https://github.com/micromatch/micromatch/blob/master/CHANGELOG.md)
- [Commits](https://github.com/micromatch/micromatch/compare/4.0.5...4.0.8)

---
updated-dependencies:
- dependency-name: micromatch
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-09-05 15:18:26 +08:00
Li Jiang
f45582d3c7 Add info of tutorial automl 2024 (#1344)
* Add info of tutorial automl 2024

* Add notebooks

* Fix links

* Update usage of built-in LLMs
2024-09-04 19:35:09 +08:00
Li Jiang
bf4bca2195 Add contributors wall (#1341)
* Add contributors wall

* code format
2024-08-30 22:33:44 +08:00
Li Jiang
efaba26d2e Update version and readme (#1338)
* Update version and readme

* Update pr template
2024-08-22 22:33:23 +00:00
Li Jiang
62194f321d Update issue templates (#1337) 2024-08-21 10:00:48 +00:00
Li Jiang
5bfa0b1cd3 Improve mlflow integration and add more models (#1331)
* Add more spark models and improved mlflow integration

* Update test_extra_models, setup and gitignore

* Remove autofe

* Remove autofe

* Remove autofe

* Sync changes in internal

* Fix test for env without pyspark

* Fix import errors

* Fix tests

* Fix typos

* Fix pytorch-forecasting version

* Remove internal funcs, rename _mlflow.py

* Fix import error

* Fix dependency

* Fix experiment name setting

* Fix dependency

* Update pandas version

* Update pytorch-forecasting version

* Add warning message for not has_automl

* Fix test errors with nltk 3.8.2

* Don't enable mlflow logging w/o an active run

* Fix pytorch-forecasting can't be pickled issue

* Update pyspark tests condition

* Update synapseml

* Update synapseml

* No parent run, no logging for OSS

* Log when autolog is enabled

* upgrade code

* Enable autolog for tune

* Increase time budget for test

* End run before start a new run

* Update parent run

* Fix import error

* clean up

* skip macos and win

* Update notes

* Update default value of model_history
2024-08-13 07:53:47 +00:00
dependabot[bot]
bd34b4e75a Bump express from 4.18.2 to 4.19.2 in /website (#1293)
Bumps [express](https://github.com/expressjs/express) from 4.18.2 to 4.19.2.
- [Release notes](https://github.com/expressjs/express/releases)
- [Changelog](https://github.com/expressjs/express/blob/master/History.md)
- [Commits](https://github.com/expressjs/express/compare/4.18.2...4.19.2)

---
updated-dependencies:
- dependency-name: express
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-08-12 12:55:25 +00:00
dependabot[bot]
7670945298 Bump follow-redirects from 1.15.4 to 1.15.6 in /website (#1291)
Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.15.4 to 1.15.6.
- [Release notes](https://github.com/follow-redirects/follow-redirects/releases)
- [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.15.4...v1.15.6)

---
updated-dependencies:
- dependency-name: follow-redirects
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-08-12 12:52:11 +00:00
dependabot[bot]
43537cb539 Bump webpack-dev-middleware from 5.3.3 to 5.3.4 in /website (#1292)
Bumps [webpack-dev-middleware](https://github.com/webpack/webpack-dev-middleware) from 5.3.3 to 5.3.4.
- [Release notes](https://github.com/webpack/webpack-dev-middleware/releases)
- [Changelog](https://github.com/webpack/webpack-dev-middleware/blob/v5.3.4/CHANGELOG.md)
- [Commits](https://github.com/webpack/webpack-dev-middleware/compare/v5.3.3...v5.3.4)

---
updated-dependencies:
- dependency-name: webpack-dev-middleware
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-08-12 12:50:17 +00:00
Gökhan Geyik
f913b79225 Fix(doc): Page Not Found (#1296)
- Fix the redirect link that received a page not found error.

Co-authored-by: Li Jiang <bnujli@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2024-08-12 12:01:46 +00:00
dependabot[bot]
a092a39b5e Bump braces from 3.0.2 to 3.0.3 in /website (#1336)
Bumps [braces](https://github.com/micromatch/braces) from 3.0.2 to 3.0.3.
- [Changelog](https://github.com/micromatch/braces/blob/master/CHANGELOG.md)
- [Commits](https://github.com/micromatch/braces/compare/3.0.2...3.0.3)

---
updated-dependencies:
- dependency-name: braces
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-08-12 08:37:56 +00:00
Jirka Borovec
04bf1b8741 update py versions, sourced from PyPI (#1332)
* update py versions, sourced from PyPI

* lint

---------

Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-08-12 04:53:48 +00:00
Jirka Borovec
b348cb1136 configure & apply pyupgrade with py3.8+ (#1333)
* configure pyupgrade with `py3.8+`

* apply update

---------

Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-08-12 02:54:18 +00:00
Jirka Borovec
cd0e88e383 fix missing req. arg for new datasets package (#1334)
Co-authored-by: Li Jiang <bnujli@gmail.com>
2024-08-12 02:19:11 +00:00
Li Jiang
a17c6e392e Fix test errors of nltk and numpy (#1335)
* Fix test errors with nltk 3.8.2

* Fix test errors with numpy large

* Fix test errors with numpy large
2024-08-12 00:14:21 +00:00
Li Jiang
52627ff14b Add 3.11 icon (#1330) 2024-08-08 06:18:49 +00:00
113 changed files with 14447 additions and 889 deletions

73
.github/ISSUE_TEMPLATE.md vendored Normal file
View File

@@ -0,0 +1,73 @@
### Description
<!-- A clear and concise description of the issue or feature request. -->
### Environment
- FLAML version: <!-- Specify the FLAML version (e.g., v0.2.0) -->
- Python version: <!-- Specify the Python version (e.g., 3.8) -->
- Operating System: <!-- Specify the OS (e.g., Windows 10, Ubuntu 20.04) -->
### Steps to Reproduce (for bugs)
<!-- Provide detailed steps to reproduce the issue. Include code snippets, configuration files, or any other relevant information. -->
1. Step 1
1. Step 2
1. ...
### Expected Behavior
<!-- Describe what you expected to happen. -->
### Actual Behavior
<!-- Describe what actually happened. Include any error messages, stack traces, or unexpected behavior. -->
### Screenshots / Logs (if applicable)
<!-- If relevant, include screenshots or logs that help illustrate the issue. -->
### Additional Information
<!-- Include any additional information that might be helpful, such as specific configurations, data samples, or context about the environment. -->
### Possible Solution (if you have one)
<!-- If you have suggestions on how to address the issue, provide them here. -->
### Is this a Bug or Feature Request?
<!-- Choose one: Bug | Feature Request -->
### Priority
<!-- Choose one: High | Medium | Low -->
### Difficulty
<!-- Choose one: Easy | Moderate | Hard -->
### Any related issues?
<!-- If this is related to another issue, reference it here. -->
### Any relevant discussions?
<!-- If there are any discussions or forum threads related to this issue, provide links. -->
### Checklist
<!-- Please check the items that you have completed -->
- [ ] I have searched for similar issues and didn't find any duplicates.
- [ ] I have provided a clear and concise description of the issue.
- [ ] I have included the necessary environment details.
- [ ] I have outlined the steps to reproduce the issue.
- [ ] I have included any relevant logs or screenshots.
- [ ] I have indicated whether this is a bug or a feature request.
- [ ] I have set the priority and difficulty levels.
### Additional Comments
<!-- Any additional comments or context that you think would be helpful. -->

53
.github/ISSUE_TEMPLATE/bug_report.yml vendored Normal file
View File

@@ -0,0 +1,53 @@
name: Bug Report
description: File a bug report
title: "[Bug]: "
labels: ["bug"]
body:
- type: textarea
id: description
attributes:
label: Describe the bug
description: A clear and concise description of what the bug is.
placeholder: What went wrong?
- type: textarea
id: reproduce
attributes:
label: Steps to reproduce
description: |
Steps to reproduce the behavior:
1. Step 1
2. Step 2
3. ...
4. See error
placeholder: How can we replicate the issue?
- type: textarea
id: modelused
attributes:
label: Model Used
description: A description of the model that was used when the error was encountered
placeholder: gpt-4, mistral-7B etc
- type: textarea
id: expected_behavior
attributes:
label: Expected Behavior
description: A clear and concise description of what you expected to happen.
placeholder: What should have happened?
- type: textarea
id: screenshots
attributes:
label: Screenshots and logs
description: If applicable, add screenshots and logs to help explain your problem.
placeholder: Add screenshots here
- type: textarea
id: additional_information
attributes:
label: Additional Information
description: |
- FLAML Version: <!-- Specify the FLAML version (e.g., v0.2.0) -->
- Operating System: <!-- Specify the OS (e.g., Windows 10, Ubuntu 20.04) -->
- Python Version: <!-- Specify the Python version (e.g., 3.8) -->
- Related Issues: <!-- Link to any related issues here (e.g., #1) -->
- Any other relevant information.
placeholder: Any additional details

1
.github/ISSUE_TEMPLATE/config.yml vendored Normal file
View File

@@ -0,0 +1 @@
blank_issues_enabled: true

View File

@@ -0,0 +1,26 @@
name: Feature Request
description: File a feature request
labels: ["enhancement"]
title: "[Feature Request]: "
body:
- type: textarea
id: problem_description
attributes:
label: Is your feature request related to a problem? Please describe.
description: A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
placeholder: What problem are you trying to solve?
- type: textarea
id: solution_description
attributes:
label: Describe the solution you'd like
description: A clear and concise description of what you want to happen.
placeholder: How do you envision the solution?
- type: textarea
id: additional_context
attributes:
label: Additional context
description: Add any other context or screenshots about the feature request here.
placeholder: Any additional information

View File

@@ -0,0 +1,41 @@
name: General Issue
description: File a general issue
title: "[Issue]: "
labels: []
body:
- type: textarea
id: description
attributes:
label: Describe the issue
description: A clear and concise description of what the issue is.
placeholder: What went wrong?
- type: textarea
id: reproduce
attributes:
label: Steps to reproduce
description: |
Steps to reproduce the behavior:
1. Step 1
2. Step 2
3. ...
4. See error
placeholder: How can we replicate the issue?
- type: textarea
id: screenshots
attributes:
label: Screenshots and logs
description: If applicable, add screenshots and logs to help explain your problem.
placeholder: Add screenshots here
- type: textarea
id: additional_information
attributes:
label: Additional Information
description: |
- FLAML Version: <!-- Specify the FLAML version (e.g., v0.2.0) -->
- Operating System: <!-- Specify the OS (e.g., Windows 10, Ubuntu 20.04) -->
- Python Version: <!-- Specify the Python version (e.g., 3.8) -->
- Related Issues: <!-- Link to any related issues here (e.g., #1) -->
- Any other relevant information.
placeholder: Any additional details

View File

@@ -12,8 +12,7 @@
## Checks
<!-- - I've used [pre-commit](https://microsoft.github.io/FLAML/docs/Contribute#pre-commit) to lint the changes in this PR (note the same in integrated in our CI checks). -->
- [ ] I've used [pre-commit](https://microsoft.github.io/FLAML/docs/Contribute#pre-commit) to lint the changes in this PR (note the same in integrated in our CI checks).
- [ ] I've included any doc changes needed for https://microsoft.github.io/FLAML/. See https://microsoft.github.io/FLAML/docs/Contribute#documentation to build and test documentation locally.
- [ ] I've added tests (if relevant) corresponding to the changes introduced in this PR.
- [ ] I've made sure all auto checks have passed.

View File

@@ -12,26 +12,17 @@ jobs:
deploy:
strategy:
matrix:
os: ['ubuntu-latest']
python-version: [3.8]
os: ["ubuntu-latest"]
python-version: ["3.10"]
runs-on: ${{ matrix.os }}
environment: package
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Cache conda
uses: actions/cache@v3
uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
path: ~/conda_pkgs_dir
key: conda-${{ matrix.os }}-python-${{ matrix.python-version }}-${{ hashFiles('environment.yml') }}
- name: Setup Miniconda
uses: conda-incubator/setup-miniconda@v2
with:
auto-update-conda: true
auto-activate-base: false
activate-environment: hcrystalball
python-version: ${{ matrix.python-version }}
use-only-tar-bz2: true
- name: Install from source
# This is required for the pre-commit tests
shell: pwsh
@@ -42,7 +33,7 @@ jobs:
- name: Build
shell: pwsh
run: |
pip install twine
pip install twine wheel setuptools
python setup.py sdist bdist_wheel
- name: Publish to PyPI
env:

View File

@@ -37,11 +37,11 @@ jobs:
- name: setup python
uses: actions/setup-python@v4
with:
python-version: "3.8"
python-version: "3.10"
- name: pydoc-markdown install
run: |
python -m pip install --upgrade pip
pip install pydoc-markdown==4.5.0
pip install pydoc-markdown==4.7.0
- name: pydoc-markdown run
run: |
pydoc-markdown
@@ -73,11 +73,11 @@ jobs:
- name: setup python
uses: actions/setup-python@v4
with:
python-version: "3.8"
python-version: "3.10"
- name: pydoc-markdown install
run: |
python -m pip install --upgrade pip
pip install pydoc-markdown==4.5.0
pip install pydoc-markdown==4.7.0
- name: pydoc-markdown run
run: |
pydoc-markdown

View File

@@ -14,6 +14,12 @@ on:
- 'setup.py'
pull_request:
branches: ['main']
paths:
- 'flaml/**'
- 'test/**'
- 'notebook/**'
- '.github/workflows/python-package.yml'
- 'setup.py'
merge_group:
types: [checks_requested]
@@ -29,8 +35,8 @@ jobs:
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest, windows-2019]
python-version: ["3.8", "3.9", "3.10", "3.11"]
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: ["3.9", "3.10", "3.11"]
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
@@ -50,14 +56,19 @@ jobs:
export LDFLAGS="$LDFLAGS -Wl,-rpath,/usr/local/opt/libomp/lib -L/usr/local/opt/libomp/lib -lomp"
- name: Install packages and dependencies
run: |
python -m pip install --upgrade pip wheel
python -m pip install --upgrade pip wheel setuptools
pip install -e .
python -c "import flaml"
pip install -e .[test]
- name: On Ubuntu python 3.8, install pyspark 3.2.3
if: matrix.python-version == '3.8' && matrix.os == 'ubuntu-latest'
- name: On Ubuntu python 3.10, install pyspark 3.4.1
if: matrix.python-version == '3.10' && matrix.os == 'ubuntu-latest'
run: |
pip install pyspark==3.2.3
pip install pyspark==3.4.1
pip list | grep "pyspark"
- name: On Ubuntu python 3.11, install pyspark 3.5.1
if: matrix.python-version == '3.11' && matrix.os == 'ubuntu-latest'
run: |
pip install pyspark==3.5.1
pip list | grep "pyspark"
- name: If linux and python<3.11, install ray 2
if: matrix.os == 'ubuntu-latest' && matrix.python-version != '3.11'
@@ -77,20 +88,15 @@ jobs:
if: matrix.python-version == '3.8' || matrix.python-version == '3.9'
run: |
pip install -e .[vw]
- name: Uninstall pyspark on (python 3.9) or windows
if: matrix.python-version == '3.9' || matrix.os == 'windows-2019'
run: |
# Uninstall pyspark to test env without pyspark
pip uninstall -y pyspark
- name: Test with pytest
if: matrix.python-version != '3.10'
run: |
pytest test
pytest test/ --ignore=test/autogen
- name: Coverage
if: matrix.python-version == '3.10'
run: |
pip install coverage
coverage run -a -m pytest test
coverage run -a -m pytest test --ignore=test/autogen
coverage xml
- name: Upload coverage to Codecov
if: matrix.python-version == '3.10'

18
.gitignore vendored
View File

@@ -163,6 +163,24 @@ output/
flaml/tune/spark/mylearner.py
*.pkl
data/
benchmark/pmlb/csv_datasets
benchmark/*.csv
checkpoints/
test/default
test/housing.json
test/nlp/default/transformer_ms/seq-classification.json
flaml/fabric/fanova/_fanova.c
# local config files
*.config.local
local_debug/
patch.diff
# Test things
notebook/lightning_logs/
lightning_logs/
flaml/autogen/extensions/tmp/
test/autogen/my_tmp/

View File

@@ -23,6 +23,13 @@ repos:
- id: end-of-file-fixer
- id: no-commit-to-branch
- repo: https://github.com/asottile/pyupgrade
rev: v2.31.1
hooks:
- id: pyupgrade
args: [--py38-plus]
name: Upgrade code
- repo: https://github.com/psf/black
rev: 23.3.0
hooks:

View File

@@ -1,5 +1,5 @@
# basic setup
FROM mcr.microsoft.com/devcontainers/python:3.8
FROM mcr.microsoft.com/devcontainers/python:3.10
RUN apt-get update && apt-get -y update
RUN apt-get install -y sudo git npm

View File

@@ -1,7 +1,7 @@
[![PyPI version](https://badge.fury.io/py/FLAML.svg)](https://badge.fury.io/py/FLAML)
![Conda version](https://img.shields.io/conda/vn/conda-forge/flaml)
[![Build](https://github.com/microsoft/FLAML/actions/workflows/python-package.yml/badge.svg)](https://github.com/microsoft/FLAML/actions/workflows/python-package.yml)
![Python Version](https://img.shields.io/badge/3.8%20%7C%203.9%20%7C%203.10-blue)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/FLAML)](https://pypi.org/project/FLAML/)
[![Downloads](https://pepy.tech/badge/flaml)](https://pepy.tech/project/flaml)
[![](https://img.shields.io/discord/1025786666260111483?logo=discord&style=flat)](https://discord.gg/Cppx2vSPVP)
@@ -14,6 +14,8 @@
<br>
</p>
:fire: FLAML supports AutoML and Hyperparameter Tuning in [Microsoft Fabric Data Science](https://learn.microsoft.com/en-us/fabric/data-science/automated-machine-learning-fabric). In addition, we've introduced Python 3.11 support, along with a range of new estimators, and comprehensive integration with MLflow—thanks to contributions from the Microsoft Fabric product team.
:fire: Heads-up: We have migrated [AutoGen](https://microsoft.github.io/autogen/) into a dedicated [github repository](https://github.com/microsoft/autogen). Alongside this move, we have also launched a dedicated [Discord](https://discord.gg/pAbnFJrkgZ) server and a [website](https://microsoft.github.io/autogen/) for comprehensive documentation.
:fire: The automated multi-agent chat framework in [AutoGen](https://microsoft.github.io/autogen/) is in preview from v2.0.0.
@@ -22,8 +24,6 @@
:fire: [autogen](https://microsoft.github.io/autogen/) is released with support for ChatGPT and GPT-4, based on [Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference](https://arxiv.org/abs/2303.04673).
:fire: FLAML supports Code-First AutoML & Tuning Private Preview in [Microsoft Fabric Data Science](https://learn.microsoft.com/en-us/fabric/data-science/).
## What is FLAML
FLAML is a lightweight Python library for efficient automation of machine
@@ -40,7 +40,7 @@ FLAML has a .NET implementation in [ML.NET](http://dot.net/ml), an open-source,
## Installation
FLAML requires **Python version >= 3.8**. It can be installed from pip:
FLAML requires **Python version >= 3.9**. It can be installed from pip:
```bash
pip install flaml
@@ -154,3 +154,9 @@ provided by the bot. You will only need to do this once across all repos using o
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
## Contributors Wall
<a href="https://github.com/microsoft/flaml/graphs/contributors">
<img src="https://contrib.rocks/image?repo=microsoft/flaml&max=204" />
</a>

View File

@@ -1,10 +1,20 @@
import logging
import warnings
from flaml.automl import AutoML, logger_formatter
try:
from flaml.automl import AutoML, logger_formatter
has_automl = True
except ImportError:
has_automl = False
from flaml.onlineml.autovw import AutoVW
from flaml.tune.searcher import CFO, FLOW2, BlendSearch, BlendSearchTuner, RandomSearch
from flaml.version import __version__
# Set the root logger.
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
if logger.level == logging.NOTSET:
logger.setLevel(logging.INFO)
if not has_automl:
warnings.warn("flaml.automl is not available. Please install flaml[automl] to enable AutoML functionalities.")

View File

@@ -156,7 +156,7 @@ class MathUserProxyAgent(UserProxyAgent):
when the number of auto reply reaches the max_consecutive_auto_reply or when is_termination_msg is True.
default_auto_reply (str or dict or None): the default auto reply message when no code execution or llm based reply is generated.
max_invalid_q_per_step (int): (ADDED) the maximum number of invalid queries per step.
**kwargs (dict): other kwargs in [UserProxyAgent](user_proxy_agent#__init__).
**kwargs (dict): other kwargs in [UserProxyAgent](../user_proxy_agent#__init__).
"""
super().__init__(
name=name,

View File

@@ -123,7 +123,7 @@ class RetrieveUserProxyAgent(UserProxyAgent):
can be found at `https://www.sbert.net/docs/pretrained_models.html`. The default model is a
fast model. If you want to use a high performance model, `all-mpnet-base-v2` is recommended.
- customized_prompt (Optional, str): the customized prompt for the retrieve chat. Default is None.
**kwargs (dict): other kwargs in [UserProxyAgent](user_proxy_agent#__init__).
**kwargs (dict): other kwargs in [UserProxyAgent](../user_proxy_agent#__init__).
"""
super().__init__(
name=name,

View File

@@ -125,7 +125,7 @@ def improve_function(file_name, func_name, objective, **config):
"""(work in progress) Improve the function to achieve the objective."""
params = {**_IMPROVE_FUNCTION_CONFIG, **config}
# read the entire file into a str
with open(file_name, "r") as f:
with open(file_name) as f:
file_string = f.read()
response = oai.Completion.create(
{"func_name": func_name, "objective": objective, "file_string": file_string}, **params
@@ -158,7 +158,7 @@ def improve_code(files, objective, suggest_only=True, **config):
code = ""
for file_name in files:
# read the entire file into a string
with open(file_name, "r") as f:
with open(file_name) as f:
file_string = f.read()
code += f"""{file_name}:
{file_string}

View File

@@ -130,7 +130,7 @@ def _fix_a_slash_b(string: str) -> str:
try:
a = int(a_str)
b = int(b_str)
assert string == "{}/{}".format(a, b)
assert string == f"{a}/{b}"
new_string = "\\frac{" + str(a) + "}{" + str(b) + "}"
return new_string
except Exception:

View File

@@ -126,7 +126,7 @@ def split_files_to_chunks(
"""Split a list of files into chunks of max_tokens."""
chunks = []
for file in files:
with open(file, "r") as f:
with open(file) as f:
text = f.read()
chunks += split_text_to_chunks(text, max_tokens, chunk_mode, must_break_at_empty_line)
return chunks

View File

@@ -1,5 +1,9 @@
from flaml.automl.automl import AutoML, size
from flaml.automl.logger import logger_formatter
from flaml.automl.state import AutoMLState, SearchState
__all__ = ["AutoML", "AutoMLState", "SearchState", "logger_formatter", "size"]
try:
from flaml.automl.automl import AutoML, size
from flaml.automl.state import AutoMLState, SearchState
__all__ = ["AutoML", "AutoMLState", "SearchState", "logger_formatter", "size"]
except ImportError:
__all__ = ["logger_formatter"]

View File

@@ -7,8 +7,10 @@ from __future__ import annotations
import json
import logging
import os
import random
import sys
import time
from concurrent.futures import as_completed
from functools import partial
from typing import Callable, List, Optional, Union
@@ -16,7 +18,7 @@ import numpy as np
from flaml import tune
from flaml.automl.logger import logger, logger_formatter
from flaml.automl.ml import train_estimator
from flaml.automl.ml import huggingface_metric_to_mode, sklearn_metric_name_set, spark_metric_name_dict, train_estimator
from flaml.automl.spark import DataFrame, Series, psDataFrame, psSeries
from flaml.automl.state import AutoMLState, SearchState
from flaml.automl.task.factory import task_factory
@@ -45,6 +47,7 @@ ERROR = (
try:
from sklearn.base import BaseEstimator
from sklearn.pipeline import Pipeline
except ImportError:
BaseEstimator = object
ERROR = ERROR or ImportError("please install flaml[automl] option to use the flaml.automl package.")
@@ -54,6 +57,14 @@ try:
except ImportError:
mlflow = None
try:
from flaml.fabric.mlflow import MLflowIntegration, get_mlflow_log_latency, infer_signature, is_autolog_enabled
internal_mlflow = True
except ImportError:
internal_mlflow = False
try:
from ray import __version__ as ray_version
@@ -171,15 +182,22 @@ class AutoML(BaseEstimator):
'better' only logs configs with better loss than previos iters
'all' logs all the tried configs.
model_history: A boolean of whether to keep the best
model per estimator. Make sure memory is large enough if setting to True.
model per estimator. Make sure memory is large enough if setting to True. Default False.
log_training_metric: A boolean of whether to log the training
metric for each model.
mem_thres: A float of the memory size constraint in bytes.
pred_time_limit: A float of the prediction latency constraint in seconds.
It refers to the average prediction time per row in validation data.
train_time_limit: A float of the training time constraint in seconds.
train_time_limit: None or a float of the training time constraint in seconds for each trial.
Only valid for sequential search.
verbose: int, default=3 | Controls the verbosity, higher means more
messages.
verbose=0: logger level = CRITICAL
verbose=1: logger level = ERROR
verbose=2: logger level = WARNING
verbose=3: logger level = INFO
verbose=4: logger level = DEBUG
verbose>5: logger level = NOTSET
retrain_full: bool or str, default=True | whether to retrain the
selected model on the full training data when using holdout.
True - retrain only after search finishes; False - no retraining;
@@ -193,7 +211,7 @@ class AutoML(BaseEstimator):
* Valid str options depend on different tasks.
For classification tasks, valid choices are
["auto", 'stratified', 'uniform', 'time', 'group']. "auto" -> stratified.
For regression tasks, valid choices are ["auto", 'uniform', 'time'].
For regression tasks, valid choices are ["auto", 'uniform', 'time', 'group'].
"auto" -> uniform.
For time series forecast tasks, must be "auto" or 'time'.
For ranking task, must be "auto" or 'group'.
@@ -247,7 +265,10 @@ class AutoML(BaseEstimator):
search is considered to converge.
force_cancel: boolean, default=False | Whether to forcely cancel Spark jobs if the
search time exceeded the time budget.
append_log: boolean, default=False | Whether to directly append the log
mlflow_exp_name: str, default=None | The name of the mlflow experiment. This should be specified if
enable mlflow autologging on Spark. Otherwise it will log all the results into the experiment of the
same name as the basename of main entry file.
append_log: boolean, default=False | Whetehr to directly append the log
records to the input log file if it exists.
auto_augment: boolean, default=True | Whether to automatically
augment rare classes.
@@ -320,9 +341,7 @@ class AutoML(BaseEstimator):
}
}
```
mlflow_logging: boolean, default=True | Whether to log the training results to mlflow.
This requires mlflow to be installed and to have an active mlflow run.
FLAML will create nested runs.
mlflow_logging: boolean, default=True | Whether to log the training results to mlflow. Not valid if mlflow is not installed.
"""
if ERROR:
@@ -331,6 +350,8 @@ class AutoML(BaseEstimator):
self._state = AutoMLState()
self._state.learner_classes = {}
self._settings = settings
self._automl_user_configurations = settings.copy()
self._settings.pop("automl_user_configurations", None)
# no budget by default
settings["time_budget"] = settings.get("time_budget", -1)
settings["task"] = settings.get("task", "classification")
@@ -362,6 +383,7 @@ class AutoML(BaseEstimator):
settings["preserve_checkpoint"] = settings.get("preserve_checkpoint", True)
settings["early_stop"] = settings.get("early_stop", False)
settings["force_cancel"] = settings.get("force_cancel", False)
settings["mlflow_exp_name"] = settings.get("mlflow_exp_name", None)
settings["append_log"] = settings.get("append_log", False)
settings["min_sample_size"] = settings.get("min_sample_size", MIN_SAMPLE_TRAIN)
settings["use_ray"] = settings.get("use_ray", False)
@@ -377,6 +399,7 @@ class AutoML(BaseEstimator):
settings["mlflow_logging"] = settings.get("mlflow_logging", True)
self._estimator_type = "classifier" if settings["task"] in CLASSIFICATION else "regressor"
self.best_run_id = None
def get_params(self, deep: bool = False) -> dict:
return self._settings.copy()
@@ -409,6 +432,8 @@ class AutoML(BaseEstimator):
If `model_history` was set to True, then the returned model is trained.
"""
state = self._search_states.get(estimator_name)
if state and estimator_name == self._best_estimator:
return self.model
return state and getattr(state, "trained_estimator", None)
@property
@@ -475,14 +500,29 @@ class AutoML(BaseEstimator):
with open(filename, "w") as f:
json.dump(best, f)
@property
def supported_metrics(self):
"""
Returns a tuple of supported metrics for the task.
Returns:
metrics (Tuple): sklearn metrics from sklearn package;
huggingface metrics from datasets package;
spark metrics from pyspark package
"""
return sklearn_metric_name_set, huggingface_metric_to_mode.keys(), spark_metric_name_dict
@property
def feature_transformer(self):
"""Returns feature transformer which is used to preprocess data before applying training or inference."""
return getattr(self, "_transformer", None)
"""Returns AutoML Transformer"""
data_precessor = getattr(self, "_transformer", None)
return data_precessor
@property
def label_transformer(self):
"""Returns label transformer which is used to preprocess labels before scoring, and inverse transform labels after inference."""
"""Returns AutoML label transformer"""
return getattr(self, "_label_transformer", None)
@property
@@ -521,8 +561,8 @@ class AutoML(BaseEstimator):
def score(
self,
X: Union[DataFrame, psDataFrame],
y: Union[Series, psSeries],
X: DataFrame | psDataFrame,
y: Series | psSeries,
**kwargs,
):
estimator = getattr(self, "_trained_estimator", None)
@@ -536,7 +576,7 @@ class AutoML(BaseEstimator):
def predict(
self,
X: Union[np.array, DataFrame, List[str], List[List[str]], psDataFrame],
X: np.array | DataFrame | list[str] | list[list[str]] | psDataFrame,
**pred_kwargs,
):
"""Predict label from features.
@@ -611,7 +651,7 @@ class AutoML(BaseEstimator):
"""
self._state.learner_classes[learner_name] = learner_class
def get_estimator_from_log(self, log_file_name: str, record_id: int, task: Union[str, Task]):
def get_estimator_from_log(self, log_file_name: str, record_id: int, task: str | Task):
"""Get the estimator from log file.
Args:
@@ -653,7 +693,7 @@ class AutoML(BaseEstimator):
dataframe=None,
label=None,
time_budget=np.inf,
task: Optional[Union[str, Task]] = None,
task: str | Task | None = None,
eval_method=None,
split_ratio=None,
n_splits=None,
@@ -709,7 +749,7 @@ class AutoML(BaseEstimator):
* Valid str options depend on different tasks.
For classification tasks, valid choices are
["auto", 'stratified', 'uniform', 'time', 'group']. "auto" -> stratified.
For regression tasks, valid choices are ["auto", 'uniform', 'time'].
For regression tasks, valid choices are ["auto", 'uniform', 'time', 'group'].
"auto" -> uniform.
For time series forecast tasks, must be "auto" or 'time'.
For ranking task, must be "auto" or 'group'.
@@ -779,7 +819,7 @@ class AutoML(BaseEstimator):
max_epochs: int, default = 20 | Maximum number of epochs to run training,
only used by TemporalFusionTransformerEstimator.
batch_size: int, default = 64 | Batch size for training model, only
used by TemporalFusionTransformerEstimator.
used by TemporalFusionTransformerEstimator and TCNEstimator.
"""
task = task or self._settings.get("task")
if isinstance(task, str):
@@ -802,7 +842,7 @@ class AutoML(BaseEstimator):
)
task.validate_data(self, self._state, X_train, y_train, dataframe, label, groups=groups)
logger.info("log file name {}".format(log_file_name))
logger.info(f"log file name {log_file_name}")
best_config = None
best_val_loss = float("+inf")
@@ -855,9 +895,7 @@ class AutoML(BaseEstimator):
else:
self._state.fit_kwargs_by_estimator[best_estimator] = self._state.fit_kwargs
logger.info(
"estimator = {}, config = {}, #training instances = {}".format(best_estimator, best_config, sample_size)
)
logger.info(f"estimator = {best_estimator}, config = {best_config}, #training instances = {sample_size}")
# Partially copied from fit() function
# Initilize some attributes required for retrain_from_log
self._split_type = task.decide_split_type(
@@ -1028,7 +1066,7 @@ class AutoML(BaseEstimator):
return points
@property
def resource_attr(self) -> Optional[str]:
def resource_attr(self) -> str | None:
"""Attribute of the resource dimension.
Returns:
@@ -1038,7 +1076,7 @@ class AutoML(BaseEstimator):
return "FLAML_sample_size" if self._sample else None
@property
def min_resource(self) -> Optional[float]:
def min_resource(self) -> float | None:
"""Attribute for pruning.
Returns:
@@ -1047,7 +1085,7 @@ class AutoML(BaseEstimator):
return self._min_sample_size if self._sample else None
@property
def max_resource(self) -> Optional[float]:
def max_resource(self) -> float | None:
"""Attribute for pruning.
Returns:
@@ -1069,7 +1107,7 @@ class AutoML(BaseEstimator):
pickle.dump(self, f, pickle.HIGHEST_PROTOCOL)
@property
def trainable(self) -> Callable[[dict], Optional[float]]:
def trainable(self) -> Callable[[dict], float | None]:
"""Training function.
Returns:
A function that evaluates each config and returns the loss.
@@ -1155,7 +1193,7 @@ class AutoML(BaseEstimator):
dataframe=None,
label=None,
metric=None,
task: Optional[Union[str, Task]] = None,
task: str | Task | None = None,
n_jobs=None,
# gpu_per_trial=0,
log_file_name=None,
@@ -1203,6 +1241,7 @@ class AutoML(BaseEstimator):
skip_transform=None,
mlflow_logging=None,
fit_kwargs_by_estimator=None,
mlflow_exp_name=None,
**fit_kwargs,
):
"""Find a model for a given task.
@@ -1296,14 +1335,15 @@ class AutoML(BaseEstimator):
'all' logs all the tried configs.
model_history: A boolean of whether to keep the trained best
model per estimator. Make sure memory is large enough if setting to True.
Default value is False: best_model_for_estimator would return a
Default value is False. If False, best_model_for_estimator would return a
untrained model for non-best learner.
log_training_metric: A boolean of whether to log the training
metric for each model.
mem_thres: A float of the memory size constraint in bytes.
pred_time_limit: A float of the prediction latency constraint in seconds.
It refers to the average prediction time per row in validation data.
train_time_limit: None or a float of the training time constraint in seconds.
train_time_limit: None or a float of the training time constraint in seconds for each trial.
Only valid for sequential search.
X_val: None or a numpy array or a pandas dataframe of validation data.
y_val: None or a numpy array or a pandas series of validation labels.
sample_weight_val: None or a numpy array of the sample weight of
@@ -1316,6 +1356,12 @@ class AutoML(BaseEstimator):
for training data.
verbose: int, default=3 | Controls the verbosity, higher means more
messages.
verbose=0: logger level = CRITICAL
verbose=1: logger level = ERROR
verbose=2: logger level = WARNING
verbose=3: logger level = INFO
verbose=4: logger level = DEBUG
verbose>5: logger level = NOTSET
retrain_full: bool or str, default=True | whether to retrain the
selected model on the full training data when using holdout.
True - retrain only after search finishes; False - no retraining;
@@ -1329,7 +1375,7 @@ class AutoML(BaseEstimator):
* Valid str options depend on different tasks.
For classification tasks, valid choices are
["auto", 'stratified', 'uniform', 'time', 'group']. "auto" -> stratified.
For regression tasks, valid choices are ["auto", 'uniform', 'time'].
For regression tasks, valid choices are ["auto", 'uniform', 'time', 'group'].
"auto" -> uniform.
For time series forecast tasks, must be "auto" or 'time'.
For ranking task, must be "auto" or 'group'.
@@ -1382,7 +1428,10 @@ class AutoML(BaseEstimator):
early_stop: boolean, default=False | Whether to stop early if the
search is considered to converge.
force_cancel: boolean, default=False | Whether to forcely cancel the PySpark job if overtime.
append_log: boolean, default=False | Whether to directly append the log
mlflow_exp_name: str, default=None | The name of the mlflow experiment. This should be specified if
enable mlflow autologging on Spark. Otherwise it will log all the results into the experiment of the
same name as the basename of main entry file.
append_log: boolean, default=False | Whetehr to directly append the log
records to the input log file if it exists.
auto_augment: boolean, default=True | Whether to automatically
augment rare classes.
@@ -1467,9 +1516,7 @@ class AutoML(BaseEstimator):
skip_transform: boolean, default=False | Whether to pre-process data prior to modeling.
mlflow_logging: boolean, default=None | Whether to log the training results to mlflow.
Default value is None, which means the logging decision is made based on
AutoML.__init__'s mlflow_logging argument.
This requires mlflow to be installed and to have an active mlflow run.
FLAML will create nested runs.
AutoML.__init__'s mlflow_logging argument. Not valid if mlflow is not installed.
fit_kwargs_by_estimator: dict, default=None | The user specified keywords arguments, grouped by estimator name.
For TransformersEstimator, available fit_kwargs can be found from
[TrainingArgumentsForAuto](nlp/huggingface/training_args).
@@ -1519,7 +1566,7 @@ class AutoML(BaseEstimator):
max_epochs: int, default = 20 | Maximum number of epochs to run training,
only used by TemporalFusionTransformerEstimator.
batch_size: int, default = 64 | Batch size for training model, only
used by TemporalFusionTransformerEstimator.
used by TemporalFusionTransformerEstimator and TCNEstimator.
"""
self._state._start_time_flag = self._start_time_flag = time.time()
@@ -1570,6 +1617,7 @@ class AutoML(BaseEstimator):
)
early_stop = self._settings.get("early_stop") if early_stop is None else early_stop
force_cancel = self._settings.get("force_cancel") if force_cancel is None else force_cancel
mlflow_exp_name = self._settings.get("mlflow_exp_name") if mlflow_exp_name is None else mlflow_exp_name
# no search budget is provided?
no_budget = time_budget < 0 and max_iter is None and not early_stop
append_log = self._settings.get("append_log") if append_log is None else append_log
@@ -1592,6 +1640,13 @@ class AutoML(BaseEstimator):
_ch.setFormatter(logger_formatter)
logger.addHandler(_ch)
if model_history:
logger.warning(
"With `model_history` set to `True` by default, all intermediate models are retained in memory, "
"which may significantly increase memory usage and slow down training. "
"Consider setting `model_history=False` to optimize memory and accelerate the training process."
)
if not use_ray and not use_spark and n_concurrent_trials > 1:
if ray_available:
logger.warning(
@@ -1622,7 +1677,6 @@ class AutoML(BaseEstimator):
self._use_ray = use_ray
# use the following condition if we have an estimation of average_trial_time and average_trial_overhead
# self._use_ray = use_ray or n_concurrent_trials > ( average_trial_time + average_trial_overhead) / (average_trial_time)
if self._use_ray is not False:
import ray
@@ -1656,11 +1710,29 @@ class AutoML(BaseEstimator):
self._state.fit_kwargs = fit_kwargs
custom_hp = custom_hp or self._settings.get("custom_hp")
self._skip_transform = self._settings.get("skip_transform") if skip_transform is None else skip_transform
self._mlflow_logging = self._settings.get("mlflow_logging") if mlflow_logging is None else mlflow_logging
self._mlflow_logging = (
False
if mlflow is None
else self._settings.get("mlflow_logging")
if mlflow_logging is None
else mlflow_logging
)
fit_kwargs_by_estimator = fit_kwargs_by_estimator or self._settings.get("fit_kwargs_by_estimator")
self._state.fit_kwargs_by_estimator = fit_kwargs_by_estimator.copy() # shallow copy of fit_kwargs_by_estimator
self._state.weight_val = sample_weight_val
self._mlflow_exp_name = mlflow_exp_name
self.mlflow_integration = None
self.autolog_extra_tag = {
"extra_tag.sid": f"flaml_{flaml_version}_{int(time.time())}_{random.randint(1001, 9999)}"
}
if internal_mlflow and self._mlflow_logging and (mlflow.active_run() or is_autolog_enabled()):
try:
self.mlflow_integration = MLflowIntegration("automl", mlflow_exp_name, extra_tag=self.autolog_extra_tag)
self._mlflow_exp_name = self.mlflow_integration.experiment_name
if not (mlflow.active_run() is not None or is_autolog_enabled()):
self.mlflow_integration.only_history = True
except KeyError:
logger.info("Not in Fabric, Skipped")
task.validate_data(
self,
self._state,
@@ -1688,7 +1760,7 @@ class AutoML(BaseEstimator):
logger.info(f"Data split method: {self._split_type}")
eval_method = self._decide_eval_method(eval_method, time_budget)
self._state.eval_method = eval_method
logger.info("Evaluation method: {}".format(eval_method))
logger.info(f"Evaluation method: {eval_method}")
self._state.cv_score_agg_func = cv_score_agg_func or self._settings.get("cv_score_agg_func")
self._retrain_in_budget = retrain_full == "budget" and (eval_method == "holdout" and self._state.X_val is None)
@@ -1705,13 +1777,9 @@ class AutoML(BaseEstimator):
if sample_size:
_sample_size_from_starting_points[_estimator] = sample_size
elif _point_per_estimator and isinstance(_point_per_estimator, list):
_sample_size_set = set(
[
config["FLAML_sample_size"]
for config in _point_per_estimator
if "FLAML_sample_size" in config
]
)
_sample_size_set = {
config["FLAML_sample_size"] for config in _point_per_estimator if "FLAML_sample_size" in config
}
if _sample_size_set:
_sample_size_from_starting_points[_estimator] = min(_sample_size_set)
if len(_sample_size_set) > 1:
@@ -1729,6 +1797,11 @@ class AutoML(BaseEstimator):
self._min_sample_size_input = min_sample_size
self._prepare_data(eval_method, split_ratio, n_splits)
# infer the signature of the input/output data
if self.mlflow_integration is not None:
self.estimator_signature = infer_signature(self._state.X_train, self._state.y_train)
self.pipeline_signature = infer_signature(X_train, y_train, dataframe, label)
# TODO pull this to task as decide_sample_size
if isinstance(self._min_sample_size, dict):
self._sample = {
@@ -1827,6 +1900,11 @@ class AutoML(BaseEstimator):
and (max_iter > 0 or retrain_full is True)
or max_iter == 1
)
if self.mlflow_integration is not None and all(
[self.mlflow_integration.parent_run_id is None, not self.mlflow_integration.only_history]
):
# force not retrain if no active run
self._state.retrain_final = False
# add custom learner
for estimator_name in estimator_list:
if estimator_name not in self._state.learner_classes:
@@ -1898,7 +1976,7 @@ class AutoML(BaseEstimator):
max_iter=max_iter / len(estimator_list) if self._learner_selector == "roundrobin" else max_iter,
budget=self._state.time_budget,
)
logger.info("List of ML learners in AutoML Run: {}".format(estimator_list))
logger.info(f"List of ML learners in AutoML Run: {estimator_list}")
self.estimator_list = estimator_list
self._active_estimators = estimator_list.copy()
self._ensemble = ensemble
@@ -1940,7 +2018,7 @@ class AutoML(BaseEstimator):
)
):
logger.warning(
"Time taken to find the best model is {0:.0f}% of the "
"Time taken to find the best model is {:.0f}% of the "
"provided time budget and not all estimators' hyperparameter "
"search converged. Consider increasing the time budget.".format(
self._time_taken_best_iter / self._state.time_budget * 100
@@ -1959,6 +2037,8 @@ class AutoML(BaseEstimator):
) # NOTE: this is after kwargs is updated to fit_kwargs_by_estimator
del self._state.groups, self._state.groups_all, self._state.groups_val
logger.setLevel(old_level)
if self.mlflow_integration is not None:
self.mlflow_integration.resume_mlflow()
def _search_parallel(self):
if self._use_ray is not False:
@@ -2055,6 +2135,14 @@ class AutoML(BaseEstimator):
if self._use_spark:
# use spark as parallel backend
mlflow_log_latency = (
get_mlflow_log_latency(model_history=self._state.model_history) if self.mlflow_integration else 0
)
(
logger.info(f"Estimated mlflow_log_latency: {mlflow_log_latency} seconds.")
if mlflow_log_latency > 0
else None
)
analysis = tune.run(
self.trainable,
search_alg=search_alg,
@@ -2067,6 +2155,9 @@ class AutoML(BaseEstimator):
use_ray=False,
use_spark=True,
force_cancel=self._force_cancel,
mlflow_exp_name=self._mlflow_exp_name,
automl_info=(mlflow_log_latency,), # pass automl info to tune.run
extra_tag=self.autolog_extra_tag,
# raise_on_failed_trial=False,
# keep_checkpoints_num=1,
# checkpoint_score_attr="min-val_loss",
@@ -2127,6 +2218,8 @@ class AutoML(BaseEstimator):
self._search_states[estimator].best_config = config
if better or self._log_type == "all":
self._log_trial(search_state, estimator)
if self.mlflow_integration:
self.mlflow_integration.record_state(self, search_state, estimator)
def _log_trial(self, search_state, estimator):
if self._training_log:
@@ -2140,36 +2233,6 @@ class AutoML(BaseEstimator):
estimator,
search_state.sample_size,
)
if self._mlflow_logging and mlflow is not None and mlflow.active_run():
with mlflow.start_run(nested=True):
mlflow.log_metric("iter_counter", self._track_iter)
if (search_state.metric_for_logging is not None) and (
"intermediate_results" in search_state.metric_for_logging
):
for each_entry in search_state.metric_for_logging["intermediate_results"]:
with mlflow.start_run(nested=True):
mlflow.log_metrics(each_entry)
mlflow.log_metric("iter_counter", self._iter_per_learner[estimator])
del search_state.metric_for_logging["intermediate_results"]
if search_state.metric_for_logging:
mlflow.log_metrics(search_state.metric_for_logging)
mlflow.log_metric("trial_time", search_state.trial_time)
mlflow.log_metric("wall_clock_time", self._state.time_from_start)
mlflow.log_metric("validation_loss", search_state.val_loss)
mlflow.log_params(search_state.config)
mlflow.log_param("learner", estimator)
mlflow.log_param("sample_size", search_state.sample_size)
mlflow.log_metric("best_validation_loss", search_state.best_loss)
mlflow.log_param("best_config", search_state.best_config)
mlflow.log_param("best_learner", self._best_estimator)
mlflow.log_metric(
self._state.metric if isinstance(self._state.metric, str) else self._state.error_metric,
1 - search_state.val_loss
if self._state.error_metric.startswith("1-")
else -search_state.val_loss
if self._state.error_metric.startswith("-")
else search_state.val_loss,
)
def _search_sequential(self):
try:
@@ -2323,9 +2386,18 @@ class AutoML(BaseEstimator):
verbose=max(self.verbose - 3, 0),
use_ray=False,
use_spark=False,
force_cancel=self._force_cancel,
mlflow_exp_name=self._mlflow_exp_name,
automl_info=(0,), # pass automl info to tune.run
extra_tag=self.autolog_extra_tag,
)
time_used = time.time() - start_run_time
better = False
(
logger.debug(f"result in automl: {analysis.trials}, {analysis.trials[-1].last_result}")
if analysis.trials
else logger.debug("result in automl: [], None")
)
if analysis.trials and analysis.trials[-1].last_result:
result = analysis.trials[-1].last_result
search_state.update(result, time_used=time_used)
@@ -2388,6 +2460,8 @@ class AutoML(BaseEstimator):
search_state.trained_estimator.cleanup()
if better or self._log_type == "all":
self._log_trial(search_state, estimator)
if self.mlflow_integration:
self.mlflow_integration.record_state(self, search_state, estimator)
logger.info(
" at {:.1f}s,\testimator {}'s best error={:.4f},\tbest estimator {}'s best error={:.4f}".format(
@@ -2440,7 +2514,7 @@ class AutoML(BaseEstimator):
state.best_config,
self.data_size_full,
)
logger.info("retrain {} for {:.1f}s".format(self._best_estimator, retrain_time))
logger.info(f"retrain {self._best_estimator} for {retrain_time:.1f}s")
self._retrained_config[best_config_sig] = state.best_config_train_time = retrain_time
est_retrain_time = 0
self._state.time_from_start = time.time() - self._start_time_flag
@@ -2462,8 +2536,8 @@ class AutoML(BaseEstimator):
self._time_taken_best_iter = 0
self._config_history = {}
self._max_iter_per_learner = 10000
self._iter_per_learner = dict([(e, 0) for e in self.estimator_list])
self._iter_per_learner_fullsize = dict([(e, 0) for e in self.estimator_list])
self._iter_per_learner = {e: 0 for e in self.estimator_list}
self._iter_per_learner_fullsize = {e: 0 for e in self.estimator_list}
self._fullsize_reached = False
self._trained_estimator = None
self._best_estimator = None
@@ -2479,6 +2553,21 @@ class AutoML(BaseEstimator):
self._selected = state = self._search_states[estimator]
state.best_config_sample_size = self._state.data_size[0]
state.best_config = state.init_config[0] if state.init_config else {}
self._track_iter = 0
self._config_history[self._track_iter] = (estimator, state.best_config, self._state.time_from_start)
self._best_iteration = self._track_iter
state.val_loss = getattr(state, "val_loss", float("inf"))
state.best_loss = getattr(state, "best_loss", float("inf"))
state.config = getattr(state, "config", state.best_config.copy())
state.metric_for_logging = getattr(state, "metric_for_logging", None)
state.sample_size = getattr(state, "sample_size", self._state.data_size[0])
state.learner_class = getattr(state, "learner_class", self._state.learner_classes.get(estimator))
if hasattr(self, "mlflow_integration") and self.mlflow_integration:
self.mlflow_integration.record_state(
automl=self,
search_state=state,
estimator=estimator,
)
elif self._use_ray is False and self._use_spark is False:
self._search_sequential()
else:
@@ -2488,6 +2577,12 @@ class AutoML(BaseEstimator):
self._training_log.checkpoint()
self._state.time_from_start = time.time() - self._start_time_flag
if self._best_estimator:
if self.mlflow_integration:
self.mlflow_integration.log_automl(self)
if mlflow.active_run() is None:
if self.mlflow_integration.parent_run_id is not None and self.mlflow_integration.autolog:
# ensure result of retrain autolog to parent run
mlflow.start_run(run_id=self.mlflow_integration.parent_run_id)
self._selected = self._search_states[self._best_estimator]
self.modelcount = sum(search_state.total_iter for search_state in self._search_states.values())
if self._trained_estimator:
@@ -2624,13 +2719,67 @@ class AutoML(BaseEstimator):
self._best_estimator,
state.best_config,
self.data_size_full,
is_retrain=True,
)
logger.info("retrain {} for {:.1f}s".format(self._best_estimator, retrain_time))
logger.info(f"retrain {self._best_estimator} for {retrain_time:.1f}s")
state.best_config_train_time = retrain_time
if self._trained_estimator:
logger.info(f"retrained model: {self._trained_estimator.model}")
if self.best_run_id is not None:
logger.info(f"Best MLflow run name: {self.best_run_name}")
logger.info(f"Best MLflow run id: {self.best_run_id}")
if self.mlflow_integration is not None:
# try log retrained model
if all(
[
self.mlflow_integration.manual_log,
not self.mlflow_integration.has_model,
self.mlflow_integration.parent_run_id is not None,
]
):
if mlflow.active_run() is None:
mlflow.start_run(run_id=self.mlflow_integration.parent_run_id)
if self.best_estimator.endswith("_spark"):
self.mlflow_integration.log_model(
self._trained_estimator.model,
self.best_estimator,
signature=self.estimator_signature,
run_id=self.mlflow_integration.parent_run_id,
)
else:
self.mlflow_integration.pickle_and_log_automl_artifacts(
self,
self.model,
self.best_estimator,
signature=self.pipeline_signature,
run_id=self.mlflow_integration.parent_run_id,
)
else:
logger.info("not retraining because the time budget is too small.")
logger.warning("not retraining because the time budget is too small.")
self.wait_futures()
def wait_futures(self):
if self.mlflow_integration is not None:
logger.debug("Collecting results from submitted record_state tasks")
t1 = time.perf_counter()
for future in as_completed(self.mlflow_integration.futures):
_task = self.mlflow_integration.futures[future]
try:
result = future.result()
logger.debug(f"Result for record_state task {_task}: {result}")
except Exception as e:
logger.warning(f"Exception for record_state task {_task}: {e}")
for future in as_completed(self.mlflow_integration.futures_log_model):
_task = self.mlflow_integration.futures_log_model[future]
try:
result = future.result()
logger.debug(f"Result for log_model task {_task}: {result}")
except Exception as e:
logger.warning(f"Exception for log_model task {_task}: {e}")
t2 = time.perf_counter()
logger.debug(f"Collecting results from tasks submitted to executors costs {t2-t1} seconds.")
else:
logger.debug("No futures to wait for.")
def __del__(self):
if (
@@ -2702,3 +2851,7 @@ class AutoML(BaseEstimator):
q += inv[i] / s
if p < q:
return estimator_list[i]
@property
def automl_pipeline(self):
return None

View File

@@ -1,7 +1,7 @@
try:
from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor
except ImportError:
pass
except ImportError as e:
print(f"scikit-learn is required for HistGradientBoostingEstimator. Please install it; error: {e}")
from flaml import tune
from flaml.automl.model import SKLearnEstimator

View File

@@ -2,13 +2,17 @@
# * Copyright (c) Microsoft Corporation. All rights reserved.
# * Licensed under the MIT License. See LICENSE file in the
# * project root for license information.
import json
import os
from datetime import datetime
import random
import uuid
from datetime import datetime, timedelta
from decimal import ROUND_HALF_UP, Decimal
from typing import TYPE_CHECKING, Union
import numpy as np
from flaml.automl.spark import DataFrame, Series, pd, ps, psDataFrame, psSeries
from flaml.automl.spark import DataFrame, F, Series, T, pd, ps, psDataFrame, psSeries
from flaml.automl.training_log import training_log_reader
try:
@@ -19,6 +23,7 @@ except ImportError:
if TYPE_CHECKING:
from flaml.automl.task import Task
TS_TIMESTAMP_COL = "ds"
TS_VALUE_COL = "y"
@@ -293,7 +298,7 @@ class DataTransformer:
y = y.rename(TS_VALUE_COL)
for column in X.columns:
# sklearn\utils\validation.py needs int/float values
if X[column].dtype.name in ("object", "category"):
if X[column].dtype.name in ("object", "category", "string"):
if X[column].nunique() == 1 or X[column].nunique(dropna=True) == n - X[column].isnull().sum():
X.drop(columns=column, inplace=True)
drop = True
@@ -445,3 +450,331 @@ class DataTransformer:
def group_counts(groups):
_, i, c = np.unique(groups, return_counts=True, return_index=True)
return c[np.argsort(i)]
def get_random_dataframe(n_rows: int = 200, ratio_none: float = 0.1, seed: int = 42) -> DataFrame:
"""Generate a random pandas DataFrame with various data types for testing.
This function creates a DataFrame with multiple column types including:
- Timestamps
- Integers
- Floats
- Categorical values
- Booleans
- Lists (tags)
- Decimal strings
- UUIDs
- Binary data (as hex strings)
- JSON blobs
- Nullable text fields
Parameters
----------
n_rows : int, default=200
Number of rows in the generated DataFrame
ratio_none : float, default=0.1
Probability of generating None values in applicable columns
seed : int, default=42
Random seed for reproducibility
Returns
-------
pd.DataFrame
A DataFrame with 14 columns of various data types
Examples
--------
>>> df = get_random_dataframe(100, 0.05, 123)
>>> df.shape
(100, 14)
>>> df.dtypes
timestamp datetime64[ns]
id int64
score float64
status object
flag object
count object
value object
tags object
rating object
uuid object
binary object
json_blob object
category category
nullable_text object
dtype: object
"""
np.random.seed(seed)
random.seed(seed)
def random_tags():
tags = ["AI", "ML", "data", "robotics", "vision"]
return random.sample(tags, k=random.randint(1, 3)) if random.random() > ratio_none else None
def random_decimal():
return (
str(Decimal(random.uniform(1, 5)).quantize(Decimal("0.01"), rounding=ROUND_HALF_UP))
if random.random() > ratio_none
else None
)
def random_json_blob():
blob = {"a": random.randint(1, 10), "b": random.random()}
return json.dumps(blob) if random.random() > ratio_none else None
def random_binary():
return bytes(random.randint(0, 255) for _ in range(4)).hex() if random.random() > ratio_none else None
data = {
"timestamp": [
datetime(2020, 1, 1) + timedelta(days=np.random.randint(0, 1000)) if np.random.rand() > ratio_none else None
for _ in range(n_rows)
],
"id": range(1, n_rows + 1),
"score": np.random.uniform(0, 100, n_rows),
"status": np.random.choice(
["active", "inactive", "pending", None],
size=n_rows,
p=[(1 - ratio_none) / 3, (1 - ratio_none) / 3, (1 - ratio_none) / 3, ratio_none],
),
"flag": np.random.choice(
[True, False, None], size=n_rows, p=[(1 - ratio_none) / 2, (1 - ratio_none) / 2, ratio_none]
),
"count": [np.random.randint(0, 100) if np.random.rand() > ratio_none else None for _ in range(n_rows)],
"value": [round(np.random.normal(50, 15), 2) if np.random.rand() > ratio_none else None for _ in range(n_rows)],
"tags": [random_tags() for _ in range(n_rows)],
"rating": [random_decimal() for _ in range(n_rows)],
"uuid": [str(uuid.uuid4()) if np.random.rand() > ratio_none else None for _ in range(n_rows)],
"binary": [random_binary() for _ in range(n_rows)],
"json_blob": [random_json_blob() for _ in range(n_rows)],
"category": pd.Categorical(
np.random.choice(
["A", "B", "C", None],
size=n_rows,
p=[(1 - ratio_none) / 3, (1 - ratio_none) / 3, (1 - ratio_none) / 3, ratio_none],
)
),
"nullable_text": [random.choice(["Good", "Bad", "Average", None]) for _ in range(n_rows)],
}
return pd.DataFrame(data)
def auto_convert_dtypes_spark(
df: psDataFrame,
na_values: list = None,
category_threshold: float = 0.3,
convert_threshold: float = 0.6,
sample_ratio: float = 0.1,
) -> tuple[psDataFrame, dict]:
"""Automatically convert data types in a PySpark DataFrame using heuristics.
This function analyzes a sample of the DataFrame to infer appropriate data types
and applies the conversions. It handles timestamps, numeric values, booleans,
and categorical fields.
Args:
df: A PySpark DataFrame to convert.
na_values: List of strings to be considered as NA/NaN. Defaults to
['NA', 'na', 'NULL', 'null', ''].
category_threshold: Maximum ratio of unique values to total values
to consider a column categorical. Defaults to 0.3.
convert_threshold: Minimum ratio of successfully converted values required
to apply a type conversion. Defaults to 0.6.
sample_ratio: Fraction of data to sample for type inference. Defaults to 0.1.
Returns:
tuple: (The DataFrame with converted types, A dictionary mapping column names to
their inferred types as strings)
Note:
- 'category' in the schema dict is conceptual as PySpark doesn't have a true
category type like pandas
- The function uses sampling for efficiency with large datasets
"""
n_rows = df.count()
if na_values is None:
na_values = ["NA", "na", "NULL", "null", ""]
# Normalize NA-like values
for colname, coltype in df.dtypes:
if coltype == "string":
df = df.withColumn(
colname,
F.when(F.trim(F.lower(F.col(colname))).isin([v.lower() for v in na_values]), None).otherwise(
F.col(colname)
),
)
schema = {}
for colname in df.columns:
# Sample once at an appropriate ratio
sample_ratio_to_use = min(1.0, sample_ratio if n_rows * sample_ratio > 100 else 100 / n_rows)
col_sample = df.select(colname).sample(withReplacement=False, fraction=sample_ratio_to_use).dropna()
sample_count = col_sample.count()
inferred_type = "string" # Default
if col_sample.dtypes[0][1] != "string":
schema[colname] = col_sample.dtypes[0][1]
continue
if sample_count == 0:
schema[colname] = "string"
continue
# Check if timestamp
ts_col = col_sample.withColumn("parsed", F.to_timestamp(F.col(colname)))
# Check numeric
if (
col_sample.withColumn("n", F.col(colname).cast("double")).filter("n is not null").count()
>= sample_count * convert_threshold
):
# All whole numbers?
all_whole = (
col_sample.withColumn("n", F.col(colname).cast("double"))
.filter("n is not null")
.withColumn("frac", F.abs(F.col("n") % 1))
.filter("frac > 0.000001")
.count()
== 0
)
inferred_type = "int" if all_whole else "double"
# Check low-cardinality (category-like)
elif (
sample_count > 0
and col_sample.select(F.countDistinct(F.col(colname))).collect()[0][0] / sample_count <= category_threshold
):
inferred_type = "category" # Will just be string, but marked as such
# Check if timestamp
elif ts_col.filter(F.col("parsed").isNotNull()).count() >= sample_count * convert_threshold:
inferred_type = "timestamp"
schema[colname] = inferred_type
# Apply inferred schema
for colname, inferred_type in schema.items():
if inferred_type == "int":
df = df.withColumn(colname, F.col(colname).cast(T.IntegerType()))
elif inferred_type == "double":
df = df.withColumn(colname, F.col(colname).cast(T.DoubleType()))
elif inferred_type == "boolean":
df = df.withColumn(
colname,
F.when(F.lower(F.col(colname)).isin("true", "yes", "1"), True)
.when(F.lower(F.col(colname)).isin("false", "no", "0"), False)
.otherwise(None),
)
elif inferred_type == "timestamp":
df = df.withColumn(colname, F.to_timestamp(F.col(colname)))
elif inferred_type == "category":
df = df.withColumn(colname, F.col(colname).cast(T.StringType())) # Marked conceptually
# otherwise keep as string (or original type)
return df, schema
def auto_convert_dtypes_pandas(
df: DataFrame,
na_values: list = None,
category_threshold: float = 0.3,
convert_threshold: float = 0.6,
sample_ratio: float = 1.0,
) -> tuple[DataFrame, dict]:
"""Automatically convert data types in a pandas DataFrame using heuristics.
This function analyzes the DataFrame to infer appropriate data types
and applies the conversions. It handles timestamps, timedeltas, numeric values,
and categorical fields.
Args:
df: A pandas DataFrame to convert.
na_values: List of strings to be considered as NA/NaN. Defaults to
['NA', 'na', 'NULL', 'null', ''].
category_threshold: Maximum ratio of unique values to total values
to consider a column categorical. Defaults to 0.3.
convert_threshold: Minimum ratio of successfully converted values required
to apply a type conversion. Defaults to 0.6.
sample_ratio: Fraction of data to sample for type inference. Not used in pandas version
but included for API compatibility. Defaults to 1.0.
Returns:
tuple: (The DataFrame with converted types, A dictionary mapping column names to
their inferred types as strings)
"""
if na_values is None:
na_values = {"NA", "na", "NULL", "null", ""}
df_converted = df.convert_dtypes()
schema = {}
# Sample if needed (for API compatibility)
if sample_ratio < 1.0:
df = df.sample(frac=sample_ratio)
n_rows = len(df)
for col in df.columns:
series = df[col]
# Replace NA-like values if string
series_cleaned = series.map(lambda x: np.nan if isinstance(x, str) and x.strip() in na_values else x)
# Skip conversion if already non-object data type, except bool which can potentially be categorical
if (
not isinstance(series_cleaned.dtype, pd.BooleanDtype)
and not isinstance(series_cleaned.dtype, pd.StringDtype)
and series_cleaned.dtype != "object"
):
# Keep the original data type for non-object dtypes
df_converted[col] = series
schema[col] = str(series_cleaned.dtype)
continue
# print(f"type: {series_cleaned.dtype}, column: {series_cleaned.name}")
if not isinstance(series_cleaned.dtype, pd.BooleanDtype):
# Try numeric (int or float)
numeric = pd.to_numeric(series_cleaned, errors="coerce")
if numeric.notna().sum() >= n_rows * convert_threshold:
if (numeric.dropna() % 1 == 0).all():
try:
df_converted[col] = numeric.astype("int") # Nullable integer
schema[col] = "int"
continue
except Exception:
pass
df_converted[col] = numeric.astype("double")
schema[col] = "double"
continue
# Try datetime
datetime_converted = pd.to_datetime(series_cleaned, errors="coerce")
if datetime_converted.notna().sum() >= n_rows * convert_threshold:
df_converted[col] = datetime_converted
schema[col] = "timestamp"
continue
# Try timedelta
try:
timedelta_converted = pd.to_timedelta(series_cleaned, errors="coerce")
if timedelta_converted.notna().sum() >= n_rows * convert_threshold:
df_converted[col] = timedelta_converted
schema[col] = "timedelta"
continue
except TypeError:
pass
# Try category
try:
unique_ratio = series_cleaned.nunique(dropna=True) / n_rows if n_rows > 0 else 1.0
if unique_ratio <= category_threshold:
df_converted[col] = series_cleaned.astype("category")
schema[col] = "category"
continue
except Exception:
pass
df_converted[col] = series_cleaned.astype("string")
schema[col] = "string"
return df_converted, schema

View File

@@ -1,7 +1,37 @@
import logging
import os
class ColoredFormatter(logging.Formatter):
# ANSI escape codes for colors
COLORS = {
# logging.DEBUG: "\033[36m", # Cyan
# logging.INFO: "\033[32m", # Green
logging.WARNING: "\033[33m", # Yellow
logging.ERROR: "\033[31m", # Red
logging.CRITICAL: "\033[1;31m", # Bright Red
}
RESET = "\033[0m" # Reset to default
def __init__(self, fmt, datefmt, use_color=True):
super().__init__(fmt, datefmt)
self.use_color = use_color
def format(self, record):
formatted = super().format(record)
if self.use_color:
color = self.COLORS.get(record.levelno, "")
if color:
return f"{color}{formatted}{self.RESET}"
return formatted
logger = logging.getLogger(__name__)
logger_formatter = logging.Formatter(
"[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s", "%m-%d %H:%M:%S"
use_color = True
if os.getenv("FLAML_LOG_NO_COLOR"):
use_color = False
logger_formatter = ColoredFormatter(
"[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s", "%m-%d %H:%M:%S", use_color
)
logger.propagate = False

View File

@@ -13,6 +13,7 @@ from flaml.automl.model import BaseEstimator, TransformersEstimator
from flaml.automl.spark import ERROR as SPARK_ERROR
from flaml.automl.spark import DataFrame, Series, psDataFrame, psSeries
from flaml.automl.task.task import Task
from flaml.automl.time_series import TimeSeriesDataset
try:
from sklearn.metrics import (
@@ -33,7 +34,6 @@ except ImportError:
if SPARK_ERROR is None:
from flaml.automl.spark.metrics import spark_metric_loss_score
from flaml.automl.time_series import TimeSeriesDataset
logger = logging.getLogger(__name__)
@@ -89,6 +89,11 @@ huggingface_metric_to_mode = {
"wer": "min",
}
huggingface_submetric_to_metric = {"rouge1": "rouge", "rouge2": "rouge"}
spark_metric_name_dict = {
"Regression": ["r2", "rmse", "mse", "mae", "var"],
"Binary Classification": ["pr_auc", "roc_auc"],
"Multi-class Classification": ["accuracy", "log_loss", "f1", "micro_f1", "macro_f1"],
}
def metric_loss_score(
@@ -122,7 +127,7 @@ def metric_loss_score(
import datasets
datasets_metric_name = huggingface_submetric_to_metric.get(metric_name, metric_name.split(":")[0])
metric = datasets.load_metric(datasets_metric_name)
metric = datasets.load_metric(datasets_metric_name, trust_remote_code=True)
metric_mode = huggingface_metric_to_mode[datasets_metric_name]
if metric_name.startswith("seqeval"):
@@ -334,6 +339,14 @@ def compute_estimator(
if fit_kwargs is None:
fit_kwargs = {}
fe_params = {}
for param, value in config_dic.items():
if param.startswith("fe."):
fe_params[param] = value
for param, value in fe_params.items():
config_dic.pop(param)
estimator_class = estimator_class or task.estimator_class_from_str(estimator_name)
estimator = estimator_class(
**config_dic,
@@ -401,12 +414,21 @@ def train_estimator(
free_mem_ratio=0,
) -> Tuple[EstimatorSubclass, float]:
start_time = time.time()
fe_params = {}
for param, value in config_dic.items():
if param.startswith("fe."):
fe_params[param] = value
for param, value in fe_params.items():
config_dic.pop(param)
estimator_class = estimator_class or task.estimator_class_from_str(estimator_name)
estimator = estimator_class(
**config_dic,
task=task,
n_jobs=n_jobs,
)
if fit_kwargs is None:
fit_kwargs = {}

File diff suppressed because it is too large Load Diff

View File

@@ -32,7 +32,7 @@ class DataCollatorForMultipleChoiceClassification(DataCollatorWithPadding):
[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
]
flattened_features = list(chain(*flattened_features))
batch = super(DataCollatorForMultipleChoiceClassification, self).__call__(flattened_features)
batch = super().__call__(flattened_features)
# Un-flatten
batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
# Add back labels

View File

@@ -245,7 +245,7 @@ def tokenize_row(
return_column_name=False,
):
if prefix:
this_row = tuple(["".join(x) for x in zip(prefix, this_row)])
this_row = tuple("".join(x) for x in zip(prefix, this_row))
# tokenizer.pad_token = tokenizer.eos_token
tokenized_example = tokenizer(

View File

@@ -32,7 +32,7 @@ def is_a_list_of_str(this_obj):
def _clean_value(value: Any) -> str:
if isinstance(value, float):
return "{:.5}".format(value)
return f"{value:.5}"
else:
return str(value).replace("/", "_")
@@ -86,7 +86,7 @@ class Counter:
@staticmethod
def get_trial_fold_name(local_dir, trial_config, trial_id):
Counter.counter += 1
experiment_tag = "{0}_{1}".format(str(Counter.counter), format_vars(trial_config))
experiment_tag = f"{str(Counter.counter)}_{format_vars(trial_config)}"
logdir = get_logdir_name(_generate_dirname(experiment_tag, trial_id=trial_id), local_dir)
return logdir

View File

@@ -1,97 +0,0 @@
ParamList_LightGBM_Base = [
"baggingFraction",
"baggingFreq",
"baggingSeed",
"binSampleCount",
"boostFromAverage",
"boostingType",
"catSmooth",
"categoricalSlotIndexes",
"categoricalSlotNames",
"catl2",
"chunkSize",
"dataRandomSeed",
"defaultListenPort",
"deterministic",
"driverListenPort",
"dropRate",
"dropSeed",
"earlyStoppingRound",
"executionMode",
"extraSeed" "featureFraction",
"featureFractionByNode",
"featureFractionSeed",
"featuresCol",
"featuresShapCol",
"fobj" "improvementTolerance",
"initScoreCol",
"isEnableSparse",
"isProvideTrainingMetric",
"labelCol",
"lambdaL1",
"lambdaL2",
"leafPredictionCol",
"learningRate",
"matrixType",
"maxBin",
"maxBinByFeature",
"maxCatThreshold",
"maxCatToOnehot",
"maxDeltaStep",
"maxDepth",
"maxDrop",
"metric",
"microBatchSize",
"minDataInLeaf",
"minDataPerBin",
"minDataPerGroup",
"minGainToSplit",
"minSumHessianInLeaf",
"modelString",
"monotoneConstraints",
"monotoneConstraintsMethod",
"monotonePenalty",
"negBaggingFraction",
"numBatches",
"numIterations",
"numLeaves",
"numTasks",
"numThreads",
"objectiveSeed",
"otherRate",
"parallelism",
"passThroughArgs",
"posBaggingFraction",
"predictDisableShapeCheck",
"predictionCol",
"repartitionByGroupingColumn",
"seed",
"skipDrop",
"slotNames",
"timeout",
"topK",
"topRate",
"uniformDrop",
"useBarrierExecutionMode",
"useMissing",
"useSingleDatasetMode",
"validationIndicatorCol",
"verbosity",
"weightCol",
"xGBoostDartMode",
"zeroAsMissing",
"objective",
]
ParamList_LightGBM_Classifier = ParamList_LightGBM_Base + [
"isUnbalance",
"probabilityCol",
"rawPredictionCol",
"thresholds",
]
ParamList_LightGBM_Regressor = ParamList_LightGBM_Base + ["tweedieVariancePower"]
ParamList_LightGBM_Ranker = ParamList_LightGBM_Base + [
"groupCol",
"evalAt",
"labelGain",
"maxPosition",
]

View File

@@ -1,3 +1,4 @@
import json
from typing import Union
import numpy as np
@@ -9,7 +10,7 @@ from pyspark.ml.evaluation import (
RegressionEvaluator,
)
from flaml.automl.spark import F, psSeries
from flaml.automl.spark import F, T, psDataFrame, psSeries, sparkDataFrame
def ps_group_counts(groups: Union[psSeries, np.ndarray]) -> np.ndarray:
@@ -36,6 +37,16 @@ def _compute_label_from_probability(df, probability_col, prediction_col):
return df
def string_to_array(s):
try:
return json.loads(s)
except json.JSONDecodeError:
return []
string_to_array_udf = F.udf(string_to_array, T.ArrayType(T.DoubleType()))
def spark_metric_loss_score(
metric_name: str,
y_predict: psSeries,
@@ -135,6 +146,11 @@ def spark_metric_loss_score(
)
elif metric_name == "log_loss":
# For log_loss, prediction_col should be probability, and we need to convert it to label
# handle data like "{'type': '1', 'values': '[1, 2, 3]'}"
# Fix cannot resolve "array_max(prediction)" due to data type mismatch: Parameter 1 requires the "ARRAY" type,
# however "prediction" has the type "STRUCT<type: TINYINT, size: INT, indices: ARRAY<INT>, values: ARRAY<DOUBLE>>"
df = df.withColumn(prediction_col, df[prediction_col].cast(T.StringType()))
df = df.withColumn(prediction_col, string_to_array_udf(df[prediction_col]))
df = _compute_label_from_probability(df, prediction_col, prediction_col + "_label")
evaluator = MulticlassClassificationEvaluator(
metricName="logLoss",

View File

@@ -65,6 +65,7 @@ class SearchState:
custom_hp=None,
max_iter=None,
budget=None,
featurization="auto",
):
self.init_eci = learner_class.cost_relative2lgbm() if budget >= 0 else 1
self._search_space_domain = {}
@@ -82,6 +83,7 @@ class SearchState:
else:
data_size = data.shape
search_space = learner_class.search_space(data_size=data_size, task=task)
self.data_size = data_size
if custom_hp is not None:
@@ -91,9 +93,7 @@ class SearchState:
starting_point = AutoMLState.sanitize(starting_point)
if max_iter > 1 and not self.valid_starting_point(starting_point, search_space):
# If the number of iterations is larger than 1, remove invalid point
logger.warning(
"Starting point {} removed because it is outside of the search space".format(starting_point)
)
logger.warning(f"Starting point {starting_point} removed because it is outside of the search space")
starting_point = None
elif isinstance(starting_point, list):
starting_point = [AutoMLState.sanitize(x) for x in starting_point]
@@ -208,7 +208,7 @@ class SearchState:
self.val_loss, self.config = obj, config
def get_hist_config_sig(self, sample_size, config):
config_values = tuple([config[k] for k in self._hp_names if k in config])
config_values = tuple(config[k] for k in self._hp_names if k in config)
config_sig = str(sample_size) + "_" + str(config_values)
return config_sig
@@ -290,9 +290,11 @@ class AutoMLState:
budget = (
None
if state.time_budget < 0
else state.time_budget - state.time_from_start
if sample_size == state.data_size[0]
else (state.time_budget - state.time_from_start) / 2 * sample_size / state.data_size[0]
else (
state.time_budget - state.time_from_start
if sample_size == state.data_size[0]
else (state.time_budget - state.time_from_start) / 2 * sample_size / state.data_size[0]
)
)
(
@@ -353,6 +355,7 @@ class AutoMLState:
estimator: str,
config_w_resource: dict,
sample_size: Optional[int] = None,
is_retrain: bool = False,
):
if not sample_size:
sample_size = config_w_resource.get("FLAML_sample_size", len(self.y_train_all))
@@ -378,9 +381,8 @@ class AutoMLState:
this_estimator_kwargs[
"groups"
] = groups # NOTE: _train_with_config is after kwargs is updated to fit_kwargs_by_estimator
this_estimator_kwargs.update({"is_retrain": is_retrain})
budget = None if self.time_budget < 0 else self.time_budget - self.time_from_start
estimator, train_time = train_estimator(
X_train=sampled_X_train,
y_train=sampled_y_train,

View File

@@ -16,12 +16,7 @@ from flaml.automl.spark.utils import (
unique_pandas_on_spark,
unique_value_first_index,
)
from flaml.automl.task.task import (
TS_FORECAST,
TS_FORECASTPANEL,
Task,
get_classification_objective,
)
from flaml.automl.task.task import TS_FORECAST, TS_FORECASTPANEL, Task, get_classification_objective
from flaml.config import RANDOM_SEED
try:
@@ -53,13 +48,24 @@ class GenericTask(Task):
from flaml.automl.contrib.histgb import HistGradientBoostingEstimator
from flaml.automl.model import (
CatBoostEstimator,
ElasticNetEstimator,
ExtraTreesEstimator,
KNeighborsEstimator,
LassoLarsEstimator,
LGBMEstimator,
LRL1Classifier,
LRL2Classifier,
RandomForestEstimator,
SGDEstimator,
SparkAFTSurvivalRegressionEstimator,
SparkGBTEstimator,
SparkGLREstimator,
SparkLGBMEstimator,
SparkLinearRegressionEstimator,
SparkLinearSVCEstimator,
SparkNaiveBayesEstimator,
SparkRandomForestEstimator,
SVCEstimator,
TransformersEstimator,
TransformersEstimatorModelSelection,
XGBoostLimitDepthEstimator,
@@ -72,6 +78,7 @@ class GenericTask(Task):
"rf": RandomForestEstimator,
"lgbm": LGBMEstimator,
"lgbm_spark": SparkLGBMEstimator,
"rf_spark": SparkRandomForestEstimator,
"lrl1": LRL1Classifier,
"lrl2": LRL2Classifier,
"catboost": CatBoostEstimator,
@@ -80,6 +87,16 @@ class GenericTask(Task):
"transformer": TransformersEstimator,
"transformer_ms": TransformersEstimatorModelSelection,
"histgb": HistGradientBoostingEstimator,
"svc": SVCEstimator,
"sgd": SGDEstimator,
"nb_spark": SparkNaiveBayesEstimator,
"enet": ElasticNetEstimator,
"lassolars": LassoLarsEstimator,
"glr_spark": SparkGLREstimator,
"lr_spark": SparkLinearRegressionEstimator,
"svc_spark": SparkLinearSVCEstimator,
"gbt_spark": SparkGBTEstimator,
"aft_spark": SparkAFTSurvivalRegressionEstimator,
}
return self._estimators
@@ -271,8 +288,8 @@ class GenericTask(Task):
seed=RANDOM_SEED,
)
columns_to_drop = [c for c in df_all_train.columns if c in [stratify_column, "sample_weight"]]
X_train = df_all_train.drop(columns_to_drop)
X_val = df_all_val.drop(columns_to_drop)
X_train = df_all_train.drop(columns=columns_to_drop)
X_val = df_all_val.drop(columns=columns_to_drop)
y_train = df_all_train[stratify_column]
y_val = df_all_val[stratify_column]
@@ -425,8 +442,8 @@ class GenericTask(Task):
X_train_all, y_train_all = shuffle(X_train_all, y_train_all, random_state=RANDOM_SEED)
if data_is_df:
X_train_all.reset_index(drop=True, inplace=True)
if isinstance(y_train_all, pd.Series):
y_train_all.reset_index(drop=True, inplace=True)
if isinstance(y_train_all, pd.Series):
y_train_all.reset_index(drop=True, inplace=True)
X_train, y_train = X_train_all, y_train_all
state.groups_all = state.groups
@@ -497,14 +514,37 @@ class GenericTask(Task):
last = first[i] + 1
rest.extend(range(last, len(y_train_all)))
X_first = X_train_all.iloc[first] if data_is_df else X_train_all[first]
X_rest = X_train_all.iloc[rest] if data_is_df else X_train_all[rest]
y_rest = (
y_train_all[rest]
if isinstance(y_train_all, np.ndarray)
else iloc_pandas_on_spark(y_train_all, rest)
if is_spark_dataframe
else y_train_all.iloc[rest]
)
if len(first) < len(y_train_all) / 2:
# Get X_rest and y_rest with drop, sparse matrix can't apply np.delete
X_rest = (
np.delete(X_train_all, first, axis=0)
if isinstance(X_train_all, np.ndarray)
else X_train_all.drop(first.tolist())
if data_is_df
else X_train_all[rest]
)
y_rest = (
np.delete(y_train_all, first, axis=0)
if isinstance(y_train_all, np.ndarray)
else y_train_all.drop(first.tolist())
if data_is_df
else y_train_all[rest]
)
else:
X_rest = (
iloc_pandas_on_spark(X_train_all, rest)
if is_spark_dataframe
else X_train_all.iloc[rest]
if data_is_df
else X_train_all[rest]
)
y_rest = (
iloc_pandas_on_spark(y_train_all, rest)
if is_spark_dataframe
else y_train_all.iloc[rest]
if data_is_df
else y_train_all[rest]
)
stratify = y_rest if split_type == "stratified" else None
X_train, X_val, y_train, y_val = self._train_test_split(
state, X_rest, y_rest, first, rest, split_ratio, stratify
@@ -513,6 +553,12 @@ class GenericTask(Task):
y_train = concat(label_set, y_train) if data_is_df else np.concatenate([label_set, y_train])
X_val = concat(X_first, X_val)
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
if isinstance(y_train, (psDataFrame, pd.DataFrame)) and y_train.shape[1] == 1:
y_train = y_train[y_train.columns[0]]
y_val = y_val[y_val.columns[0]]
y_train.name = y_val.name = y_rest.name
elif self.is_regression():
X_train, X_val, y_train, y_val = self._train_test_split(
state, X_train_all, y_train_all, split_ratio=split_ratio
@@ -659,7 +705,6 @@ class GenericTask(Task):
fit_kwargs = {}
if cv_score_agg_func is None:
cv_score_agg_func = default_cv_score_agg_func
start_time = time.time()
val_loss_folds = []
log_metric_folds = []
metric = None
@@ -701,7 +746,10 @@ class GenericTask(Task):
elif isinstance(kf, TimeSeriesSplit):
kf = kf.split(X_train_split, y_train_split)
else:
kf = kf.split(X_train_split)
try:
kf = kf.split(X_train_split)
except TypeError:
kf = kf.split(X_train_split, y_train_split)
for train_index, val_index in kf:
if shuffle:
@@ -724,10 +772,10 @@ class GenericTask(Task):
if not is_spark_dataframe:
y_train, y_val = y_train_split[train_index], y_train_split[val_index]
if weight is not None:
fit_kwargs["sample_weight"], weight_val = (
weight[train_index],
weight[val_index],
fit_kwargs["sample_weight"] = (
weight[train_index] if isinstance(weight, np.ndarray) else weight.iloc[train_index]
)
weight_val = weight[val_index] if isinstance(weight, np.ndarray) else weight.iloc[val_index]
if groups is not None:
fit_kwargs["groups"] = (
groups[train_index] if isinstance(groups, np.ndarray) else groups.iloc[train_index]
@@ -766,8 +814,6 @@ class GenericTask(Task):
if is_spark_dataframe:
X_train.spark.unpersist() # uncache data to free memory
X_val.spark.unpersist() # uncache data to free memory
if budget and time.time() - start_time >= budget:
break
val_loss, metric = cv_score_agg_func(val_loss_folds, log_metric_folds)
n = total_fold_num
pred_time /= n
@@ -810,27 +856,23 @@ class GenericTask(Task):
elif self.is_ts_forecastpanel():
estimator_list = ["tft"]
else:
estimator_list = [
"lgbm",
"rf",
"xgboost",
"extra_tree",
"xgb_limitdepth",
"lgbm_spark",
"rf_spark",
"sgd",
]
try:
import catboost
estimator_list = [
"lgbm",
"rf",
"catboost",
"xgboost",
"extra_tree",
"xgb_limitdepth",
"lgbm_spark",
]
estimator_list += ["catboost"]
except ImportError:
estimator_list = [
"lgbm",
"rf",
"xgboost",
"extra_tree",
"xgb_limitdepth",
"lgbm_spark",
]
pass
# if self.is_ts_forecast():
# # catboost is removed because it has a `name` parameter, making it incompatible with hcrystalball
# if "catboost" in estimator_list:
@@ -862,9 +904,7 @@ class GenericTask(Task):
return metric
if self.is_nlp():
from flaml.automl.nlp.utils import (
load_default_huggingface_metric_for_task,
)
from flaml.automl.nlp.utils import load_default_huggingface_metric_for_task
return load_default_huggingface_metric_for_task(self.name)
elif self.is_binary():

View File

@@ -192,7 +192,7 @@ class Task(ABC):
* Valid str options depend on different tasks.
For classification tasks, valid choices are
["auto", 'stratified', 'uniform', 'time', 'group']. "auto" -> stratified.
For regression tasks, valid choices are ["auto", 'uniform', 'time'].
For regression tasks, valid choices are ["auto", 'uniform', 'time', 'group'].
"auto" -> uniform.
For time series forecast tasks, must be "auto" or 'time'.
For ranking task, must be "auto" or 'group'.

View File

@@ -36,11 +36,17 @@ class TimeSeriesTask(Task):
LGBM_TS,
RF_TS,
SARIMAX,
Average,
CatBoost_TS,
ExtraTrees_TS,
HoltWinters,
LassoLars_TS,
Naive,
Orbit,
Prophet,
SeasonalAverage,
SeasonalNaive,
TCNEstimator,
TemporalFusionTransformerEstimator,
XGBoost_TS,
XGBoostLimitDepth_TS,
@@ -57,8 +63,19 @@ class TimeSeriesTask(Task):
"holt-winters": HoltWinters,
"catboost": CatBoost_TS,
"tft": TemporalFusionTransformerEstimator,
"lassolars": LassoLars_TS,
"tcn": TCNEstimator,
"snaive": SeasonalNaive,
"naive": Naive,
"savg": SeasonalAverage,
"avg": Average,
}
if self._estimators["tcn"] is None:
# remove TCN if import failed
del self._estimators["tcn"]
logger.info("Couldn't import pytorch_lightning, skipping TCN estimator")
try:
from prophet import Prophet as foo
@@ -71,7 +88,7 @@ class TimeSeriesTask(Task):
self._estimators["orbit"] = Orbit
except ImportError:
logger.info("Couldn't import Prophet, skipping")
logger.info("Couldn't import orbit, skipping")
return self._estimators

View File

@@ -1,16 +1,27 @@
from .tft import TemporalFusionTransformerEstimator
from .ts_data import TimeSeriesDataset
from .ts_model import (
ARIMA,
LGBM_TS,
RF_TS,
SARIMAX,
Average,
CatBoost_TS,
ExtraTrees_TS,
HoltWinters,
LassoLars_TS,
Naive,
Orbit,
Prophet,
SeasonalAverage,
SeasonalNaive,
TimeSeriesEstimator,
XGBoost_TS,
XGBoostLimitDepth_TS,
)
try:
from .tcn import TCNEstimator
except ImportError:
TCNEstimator = None
from .ts_data import TimeSeriesDataset

View File

@@ -0,0 +1,285 @@
# This file is adapted from
# https://github.com/locuslab/TCN/blob/master/TCN/tcn.py
# https://github.com/locuslab/TCN/blob/master/TCN/adding_problem/add_test.py
import datetime
import logging
import time
import pandas as pd
import pytorch_lightning as pl
import torch
import torch.nn as nn
import torch.optim as optim
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
from pytorch_lightning.loggers import TensorBoardLogger
from torch.nn.utils import weight_norm
from torch.utils.data import DataLoader, TensorDataset
from flaml import tune
from flaml.automl.data import add_time_idx_col
from flaml.automl.logger import logger, logger_formatter
from flaml.automl.time_series.ts_data import TimeSeriesDataset
from flaml.automl.time_series.ts_model import TimeSeriesEstimator
class Chomp1d(nn.Module):
def __init__(self, chomp_size):
super().__init__()
self.chomp_size = chomp_size
def forward(self, x):
return x[:, :, : -self.chomp_size].contiguous()
class TemporalBlock(nn.Module):
def __init__(self, n_inputs, n_outputs, kernel_size, stride, dilation, padding, dropout=0.2):
super().__init__()
self.conv1 = weight_norm(
nn.Conv1d(n_inputs, n_outputs, kernel_size, stride=stride, padding=padding, dilation=dilation)
)
self.chomp1 = Chomp1d(padding)
self.relu1 = nn.ReLU()
self.dropout1 = nn.Dropout(dropout)
self.conv2 = weight_norm(
nn.Conv1d(n_outputs, n_outputs, kernel_size, stride=stride, padding=padding, dilation=dilation)
)
self.chomp2 = Chomp1d(padding)
self.relu2 = nn.ReLU()
self.dropout2 = nn.Dropout(dropout)
self.net = nn.Sequential(
self.conv1, self.chomp1, self.relu1, self.dropout1, self.conv2, self.chomp2, self.relu2, self.dropout2
)
self.downsample = nn.Conv1d(n_inputs, n_outputs, 1) if n_inputs != n_outputs else None
self.relu = nn.ReLU()
self.init_weights()
def init_weights(self):
self.conv1.weight.data.normal_(0, 0.01)
self.conv2.weight.data.normal_(0, 0.01)
if self.downsample is not None:
self.downsample.weight.data.normal_(0, 0.01)
def forward(self, x):
out = self.net(x)
res = x if self.downsample is None else self.downsample(x)
return self.relu(out + res)
class TCNForecaster(nn.Module):
def __init__(
self,
input_feature_num,
num_outputs,
num_channels,
kernel_size=2,
dropout=0.2,
):
super().__init__()
layers = []
num_levels = len(num_channels)
for i in range(num_levels):
dilation_size = 2**i
in_channels = input_feature_num if i == 0 else num_channels[i - 1]
out_channels = num_channels[i]
layers += [
TemporalBlock(
in_channels,
out_channels,
kernel_size,
stride=1,
dilation=dilation_size,
padding=(kernel_size - 1) * dilation_size,
dropout=dropout,
)
]
self.network = nn.Sequential(*layers)
self.linear = nn.Linear(num_channels[-1], num_outputs)
def forward(self, x):
y1 = self.network(x)
return self.linear(y1[:, :, -1])
class TCNForecasterLightningModule(pl.LightningModule):
def __init__(self, model: TCNForecaster, learning_rate: float = 1e-3):
super().__init__()
self.model = model
self.learning_rate = learning_rate
self.loss_fn = nn.MSELoss()
def forward(self, x):
return self.model(x)
def step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = self.loss_fn(y_hat, y)
return loss
def training_step(self, batch, batch_idx):
loss = self.step(batch, batch_idx)
self.log("train_loss", loss)
return loss
def validation_step(self, batch, batch_idx):
loss = self.step(batch, batch_idx)
self.log("val_loss", loss)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=self.learning_rate)
class DataframeDataset(torch.utils.data.Dataset):
def __init__(self, dataframe, target_column, features_columns, sequence_length, train=True):
self.data = torch.tensor(dataframe[features_columns].to_numpy(), dtype=torch.float)
self.sequence_length = sequence_length
if train:
self.labels = torch.tensor(dataframe[target_column].to_numpy(), dtype=torch.float)
self.is_train = train
def __len__(self):
return len(self.data) - self.sequence_length + 1
def __getitem__(self, idx):
data = self.data[idx : idx + self.sequence_length]
data = data.permute(1, 0)
if self.is_train:
label = self.labels[idx : idx + self.sequence_length]
return data, label
else:
return data
class TCNEstimator(TimeSeriesEstimator):
"""The class for tuning TCN Forecaster"""
@classmethod
def search_space(cls, data, task, pred_horizon, **params):
space = {
"num_levels": {
"domain": tune.randint(lower=4, upper=20), # hidden = 2^num_hidden
"init_value": 4,
},
"num_hidden": {
"domain": tune.randint(lower=4, upper=8), # hidden = 2^num_hidden
"init_value": 5,
},
"kernel_size": {
"domain": tune.choice([2, 3, 5, 7]), # common choices for kernel size
"init_value": 3,
},
"dropout": {
"domain": tune.uniform(lower=0.0, upper=0.5), # standard range for dropout
"init_value": 0.1,
},
"learning_rate": {
"domain": tune.loguniform(lower=1e-4, upper=1e-1), # typical range for learning rate
"init_value": 1e-3,
},
}
return space
def __init__(self, task="ts_forecast", n_jobs=1, **params):
super().__init__(task, **params)
logging.getLogger("pytorch_lightning").setLevel(logging.WARNING)
def fit(self, X_train: TimeSeriesDataset, y_train=None, budget=None, **kwargs):
start_time = time.time()
if budget is not None:
deltabudget = datetime.timedelta(seconds=budget)
else:
deltabudget = None
X_train = self.enrich(X_train)
super().fit(X_train, y_train, budget, **kwargs)
self.batch_size = kwargs.get("batch_size", 64)
self.horizon = kwargs.get("period", 1)
self.feature_cols = X_train.time_varying_known_reals
self.target_col = X_train.target_names[0]
train_dataset = DataframeDataset(
X_train.train_data,
self.target_col,
self.feature_cols,
self.horizon,
)
train_loader = DataLoader(train_dataset, batch_size=self.batch_size, shuffle=False)
if not X_train.test_data.empty:
val_dataset = DataframeDataset(
X_train.test_data,
self.target_col,
self.feature_cols,
self.horizon,
)
else:
val_dataset = DataframeDataset(
X_train.train_data.sample(frac=0.2, random_state=kwargs.get("random_state", 0)),
self.target_col,
self.feature_cols,
self.horizon,
)
val_loader = DataLoader(val_dataset, batch_size=self.batch_size, shuffle=False)
model = TCNForecaster(
len(self.feature_cols),
self.horizon,
[2 ** self.params["num_hidden"]] * self.params["num_levels"],
self.params["kernel_size"],
self.params["dropout"],
)
pl_module = TCNForecasterLightningModule(model, self.params["learning_rate"])
# Training loop
# gpus is deprecated in v1.7 and removed in v2.0
# accelerator="auto" can cast all condition.
trainer = pl.Trainer(
max_epochs=kwargs.get("max_epochs", 10),
accelerator="auto",
callbacks=[
EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=10, verbose=False, mode="min"),
LearningRateMonitor(),
],
logger=TensorBoardLogger(kwargs.get("log_dir", "logs/lightning_logs")), # logging results to a tensorboard
max_time=deltabudget,
enable_model_summary=False,
enable_progress_bar=False,
)
trainer.fit(
pl_module,
train_dataloaders=train_loader,
val_dataloaders=val_loader,
)
best_model = trainer.model
self._model = best_model
train_time = time.time() - start_time
return train_time
def predict(self, X):
X = self.enrich(X)
if isinstance(X, TimeSeriesDataset):
df = X.X_val
else:
df = X
dataset = DataframeDataset(
df,
self.target_col,
self.feature_cols,
self.horizon,
train=False,
)
data_loader = DataLoader(dataset, batch_size=self.batch_size, shuffle=False)
self._model.eval()
raw_preds = []
for batch_x in data_loader:
raw_pred = self._model(batch_x)
raw_preds.append(raw_pred)
raw_preds = torch.cat(raw_preds, dim=0)
preds = pd.Series(raw_preds.detach().numpy().ravel())
return preds

View File

@@ -393,7 +393,7 @@ class DataTransformerTS:
for column in X.columns:
# sklearn/utils/validation.py needs int/float values
if X[column].dtype.name in ("object", "category"):
if X[column].dtype.name in ("object", "category", "string"):
if (
# drop columns where all values are the same
X[column].nunique() == 1

View File

@@ -26,6 +26,7 @@ from flaml.automl.data import TS_TIMESTAMP_COL, TS_VALUE_COL
from flaml.automl.model import (
CatBoostEstimator,
ExtraTreesEstimator,
LassoLarsEstimator,
LGBMEstimator,
RandomForestEstimator,
SKLearnEstimator,
@@ -631,6 +632,125 @@ class HoltWinters(StatsModelsEstimator):
return train_time
class SimpleForecaster(StatsModelsEstimator):
"""Base class for Naive Forecaster like Seasonal Naive, Naive, Seasonal Average, Average"""
@classmethod
def _search_space(cls, data: TimeSeriesDataset, task: Task, pred_horizon: int, **params):
return {
"season": {
"domain": tune.randint(1, pred_horizon),
"init_value": pred_horizon,
}
}
def joint_preprocess(self, X_train, y_train=None):
X_train = self.enrich(X_train)
self.regressors = []
if isinstance(X_train, TimeSeriesDataset):
data = X_train
target_col = data.target_names[0]
# this class only supports univariate regression
train_df = data.train_data[self.regressors + [target_col]]
train_df.index = to_datetime(data.train_data[data.time_col])
else:
target_col = TS_VALUE_COL
train_df = self._join(X_train, y_train)
self.time_col = data.time_col
self.target_names = data.target_names
train_df = self._preprocess(train_df)
return train_df, target_col
def fit(self, X_train, y_train=None, budget=None, **kwargs):
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
self.season = self.params.get("season", 1)
current_time = time.time()
super().fit(X_train, y_train, budget=budget, **kwargs)
train_df, target_col = self.joint_preprocess(X_train, y_train)
model = SimpleExpSmoothing(
train_df[[target_col]],
)
with suppress_stdout_stderr():
model = model.fit(smoothing_level=self.smoothing_level)
train_time = time.time() - current_time
self._model = model
return train_time
class SeasonalNaive(SimpleForecaster):
smoothing_level = 1.0
def predict(self, X, **kwargs):
if isinstance(X, int):
forecasts = []
for i in range(X):
forecast = self._model.forecast(steps=self.season)[0]
forecasts.append(forecast)
return pd.Series(forecasts)
else:
return super().predict(X, **kwargs)
class Naive(SimpleForecaster):
smoothing_level = 0.0
@classmethod
def _search_space(cls, data: TimeSeriesDataset, task: Task, pred_horizon: int, **params):
return {}
def predict(self, X, **kwargs):
if isinstance(X, int):
last_observation = self._model.params["initial_level"]
return pd.Series([last_observation] * X)
else:
return super().predict(X, **kwargs)
class SeasonalAverage(SimpleForecaster):
def fit(self, X_train, y_train=None, budget=None, **kwargs):
from statsmodels.tsa.ar_model import AutoReg, ar_select_order
start_time = time.time()
self.season = kwargs.get("season", 1) # seasonality period
train_df, target_col = self.joint_preprocess(X_train, y_train)
selection_res = ar_select_order(train_df[target_col], maxlag=self.season)
# Fit autoregressive model with optimal order
model = AutoReg(train_df[target_col], lags=selection_res.ar_lags)
self._model = model.fit()
end_time = time.time()
return end_time - start_time
class Average(SimpleForecaster):
@classmethod
def _search_space(cls, data: TimeSeriesDataset, task: Task, pred_horizon: int, **params):
return {}
def fit(self, X_train, y_train=None, budget=None, **kwargs):
from statsmodels.tsa.ar_model import AutoReg
start_time = time.time()
train_df, target_col = self.joint_preprocess(X_train, y_train)
model = AutoReg(train_df[target_col], lags=0)
self._model = model.fit()
end_time = time.time()
return end_time - start_time
class TS_SKLearn(TimeSeriesEstimator):
"""The class for tuning SKLearn Regressors for time-series forecasting"""
@@ -757,3 +877,7 @@ class XGBoostLimitDepth_TS(TS_SKLearn):
# catboost regressor is invalid because it has a `name` parameter, making it incompatible with hcrystalball
class CatBoost_TS(TS_SKLearn):
base_class = CatBoostEstimator
class LassoLars_TS(TS_SKLearn):
base_class = LassoLarsEstimator

View File

@@ -11,7 +11,7 @@ from typing import IO
logger = logging.getLogger("flaml.automl")
class TrainingLogRecord(object):
class TrainingLogRecord:
def __init__(
self,
record_id: int,
@@ -52,7 +52,7 @@ class TrainingLogCheckPoint(TrainingLogRecord):
self.curr_best_record_id = curr_best_record_id
class TrainingLogWriter(object):
class TrainingLogWriter:
def __init__(self, output_filename: str):
self.output_filename = output_filename
self.file = None
@@ -79,7 +79,7 @@ class TrainingLogWriter(object):
sample_size,
):
if self.file is None:
raise IOError("Call open() to open the output file first.")
raise OSError("Call open() to open the output file first.")
if validation_loss is None:
raise ValueError("TEST LOSS NONE ERROR!!!")
record = TrainingLogRecord(
@@ -109,7 +109,7 @@ class TrainingLogWriter(object):
def checkpoint(self):
if self.file is None:
raise IOError("Call open() to open the output file first.")
raise OSError("Call open() to open the output file first.")
if self.current_best_loss_record_id is None:
logger.warning("flaml.training_log: checkpoint() called before any record is written, skipped.")
return
@@ -124,7 +124,7 @@ class TrainingLogWriter(object):
self.file = None # for pickle
class TrainingLogReader(object):
class TrainingLogReader:
def __init__(self, filename: str):
self.filename = filename
self.file = None
@@ -134,7 +134,7 @@ class TrainingLogReader(object):
def records(self):
if self.file is None:
raise IOError("Call open() before reading log file.")
raise OSError("Call open() before reading log file.")
for line in self.file:
data = json.loads(line)
if len(data) == 1:
@@ -149,7 +149,7 @@ class TrainingLogReader(object):
def get_record(self, record_id) -> TrainingLogRecord:
if self.file is None:
raise IOError("Call open() before reading log file.")
raise OSError("Call open() before reading log file.")
for rec in self.records():
if rec.record_id == record_id:
return rec

View File

@@ -69,7 +69,7 @@ def build_portfolio(meta_features, regret, strategy):
def load_json(filename):
"""Returns the contents of json file filename."""
with open(filename, "r") as f:
with open(filename) as f:
return json.load(f)

View File

@@ -43,7 +43,7 @@ def meta_feature(task, X_train, y_train, meta_feature_names):
# 'numpy.ndarray' object has no attribute 'select_dtypes'
this_feature.append(1) # all features are numeric
else:
raise ValueError("Feature {} not implemented. ".format(each_feature_name))
raise ValueError(f"Feature {each_feature_name} not implemented. ")
return this_feature
@@ -57,7 +57,7 @@ def load_config_predictor(estimator_name, task, location=None):
task = "multiclass" if task == "multi" else task # TODO: multi -> multiclass?
try:
location = location or LOCATION
with open(f"{location}/{estimator_name}/{task}.json", "r") as f:
with open(f"{location}/{estimator_name}/{task}.json") as f:
CONFIG_PREDICTORS[key] = predictor = json.load(f)
except FileNotFoundError:
raise FileNotFoundError(f"Portfolio has not been built for {estimator_name} on {task} task.")

0
flaml/fabric/__init__.py Normal file
View File

1021
flaml/fabric/mlflow.py Normal file

File diff suppressed because it is too large Load Diff

37
flaml/tune/logger.py Normal file
View File

@@ -0,0 +1,37 @@
import logging
import os
class ColoredFormatter(logging.Formatter):
# ANSI escape codes for colors
COLORS = {
# logging.DEBUG: "\033[36m", # Cyan
# logging.INFO: "\033[32m", # Green
logging.WARNING: "\033[33m", # Yellow
logging.ERROR: "\033[31m", # Red
logging.CRITICAL: "\033[1;31m", # Bright Red
}
RESET = "\033[0m" # Reset to default
def __init__(self, fmt, datefmt, use_color=True):
super().__init__(fmt, datefmt)
self.use_color = use_color
def format(self, record):
formatted = super().format(record)
if self.use_color:
color = self.COLORS.get(record.levelno, "")
if color:
return f"{color}{formatted}{self.RESET}"
return formatted
logger = logging.getLogger(__name__)
use_color = True
if os.getenv("FLAML_LOG_NO_COLOR"):
use_color = False
logger_formatter = ColoredFormatter(
"[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s", "%m-%d %H:%M:%S", use_color
)
logger.propagate = False

View File

@@ -109,7 +109,7 @@ class FLOW2(Searcher):
else:
mode = "min"
super(FLOW2, self).__init__(metric=metric, mode=mode)
super().__init__(metric=metric, mode=mode)
# internally minimizes, so "max" => -1
if mode == "max":
self.metric_op = -1.0
@@ -350,7 +350,7 @@ class FLOW2(Searcher):
else:
assert (
self.lexico_objectives["tolerances"][k_metric][-1] == "%"
), "String tolerance of {} should use %% as the suffix".format(k_metric)
), f"String tolerance of {k_metric} should use %% as the suffix"
tolerance_bound = self._f_best[k_metric] * (
1 + 0.01 * float(self.lexico_objectives["tolerances"][k_metric].replace("%", ""))
)
@@ -385,7 +385,7 @@ class FLOW2(Searcher):
else:
assert (
self.lexico_objectives["tolerances"][k_metric][-1] == "%"
), "String tolerance of {} should use %% as the suffix".format(k_metric)
), f"String tolerance of {k_metric} should use %% as the suffix"
tolerance_bound = self._f_best[k_metric] * (
1 + 0.01 * float(self.lexico_objectives["tolerances"][k_metric].replace("%", ""))
)

View File

@@ -319,7 +319,7 @@ class ChampionFrontierSearcher(BaseSearcher):
candidate_configs = [set(seed_interactions) | set(item) for item in space]
final_candidate_configs = []
for c in candidate_configs:
new_c = set([e for e in c if len(e) > 1])
new_c = {e for e in c if len(e) > 1}
final_candidate_configs.append(new_c)
return final_candidate_configs

View File

@@ -191,7 +191,7 @@ class ConcurrencyLimiter(Searcher):
self.batch = batch
self.live_trials = set()
self.cached_results = {}
super(ConcurrencyLimiter, self).__init__(metric=self.searcher.metric, mode=self.searcher.mode)
super().__init__(metric=self.searcher.metric, mode=self.searcher.mode)
def suggest(self, trial_id: str) -> Optional[Dict]:
assert trial_id not in self.live_trials, f"Trial ID {trial_id} must be unique: already found in set."
@@ -285,25 +285,21 @@ def validate_warmstart(
"""
if points_to_evaluate:
if not isinstance(points_to_evaluate, list):
raise TypeError("points_to_evaluate expected to be a list, got {}.".format(type(points_to_evaluate)))
raise TypeError(f"points_to_evaluate expected to be a list, got {type(points_to_evaluate)}.")
for point in points_to_evaluate:
if not isinstance(point, (dict, list)):
raise TypeError(f"points_to_evaluate expected to include list or dict, " f"got {point}.")
if validate_point_name_lengths and (not len(point) == len(parameter_names)):
raise ValueError(
"Dim of point {}".format(point)
+ " and parameter_names {}".format(parameter_names)
+ " do not match."
)
raise ValueError(f"Dim of point {point}" + f" and parameter_names {parameter_names}" + " do not match.")
if points_to_evaluate and evaluated_rewards:
if not isinstance(evaluated_rewards, list):
raise TypeError("evaluated_rewards expected to be a list, got {}.".format(type(evaluated_rewards)))
raise TypeError(f"evaluated_rewards expected to be a list, got {type(evaluated_rewards)}.")
if not len(evaluated_rewards) == len(points_to_evaluate):
raise ValueError(
"Dim of evaluated_rewards {}".format(evaluated_rewards)
+ " and points_to_evaluate {}".format(points_to_evaluate)
f"Dim of evaluated_rewards {evaluated_rewards}"
+ f" and points_to_evaluate {points_to_evaluate}"
+ " do not match."
)
@@ -547,7 +543,7 @@ class OptunaSearch(Searcher):
evaluated_rewards: Optional[List] = None,
):
assert ot is not None, "Optuna must be installed! Run `pip install optuna`."
super(OptunaSearch, self).__init__(metric=metric, mode=mode)
super().__init__(metric=metric, mode=mode)
if isinstance(space, dict) and space:
resolved_vars, domain_vars, grid_vars = parse_spec_vars(space)

View File

@@ -252,7 +252,7 @@ def _try_resolve(v) -> Tuple[bool, Any]:
# Grid search values
grid_values = v["grid_search"]
if not isinstance(grid_values, list):
raise TuneError("Grid search expected list of values, got: {}".format(grid_values))
raise TuneError(f"Grid search expected list of values, got: {grid_values}")
return False, Categorical(grid_values).grid()
return True, v
@@ -302,13 +302,13 @@ def has_unresolved_values(spec: Dict) -> bool:
class _UnresolvedAccessGuard(dict):
def __init__(self, *args, **kwds):
super(_UnresolvedAccessGuard, self).__init__(*args, **kwds)
super().__init__(*args, **kwds)
self.__dict__ = self
def __getattribute__(self, item):
value = dict.__getattribute__(self, item)
if not _is_resolved(value):
raise RecursiveDependencyError("`{}` recursively depends on {}".format(item, value))
raise RecursiveDependencyError(f"`{item}` recursively depends on {value}")
elif isinstance(value, dict):
return _UnresolvedAccessGuard(value)
else:

View File

@@ -162,6 +162,10 @@ def broadcast_code(custom_code="", file_name="mylearner"):
assert isinstance(MyLargeLGBM(), LGBMEstimator)
```
"""
# Check if Spark is available
spark_available, _ = check_spark()
# Write to local driver file system
flaml_path = os.path.dirname(os.path.abspath(__file__))
custom_code = textwrap.dedent(custom_code)
custom_path = os.path.join(flaml_path, file_name + ".py")
@@ -169,6 +173,24 @@ def broadcast_code(custom_code="", file_name="mylearner"):
with open(custom_path, "w") as f:
f.write(custom_code)
# If using Spark, broadcast the code content to executors
if spark_available:
spark = SparkSession.builder.getOrCreate()
bc_code = spark.sparkContext.broadcast(custom_code)
# Execute a job to ensure the code is distributed to all executors
def _write_code(bc):
code = bc.value
import os
module_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), file_name + ".py")
os.makedirs(os.path.dirname(module_path), exist_ok=True)
with open(module_path, "w") as f:
f.write(code)
return True
spark.sparkContext.parallelize(range(1)).map(lambda _: _write_code(bc_code)).collect()
return custom_path

View File

@@ -110,7 +110,7 @@ class Trial:
}
self.metric_n_steps[metric] = {}
for n in self.n_steps:
key = "last-{:d}-avg".format(n)
key = f"last-{n:d}-avg"
self.metric_analysis[metric][key] = value
# Store n as string for correct restore.
self.metric_n_steps[metric][str(n)] = deque([value], maxlen=n)
@@ -124,7 +124,7 @@ class Trial:
self.metric_analysis[metric]["last"] = value
for n in self.n_steps:
key = "last-{:d}-avg".format(n)
key = f"last-{n:d}-avg"
self.metric_n_steps[metric][str(n)].append(value)
self.metric_analysis[metric][key] = sum(self.metric_n_steps[metric][str(n)]) / len(
self.metric_n_steps[metric][str(n)]

View File

@@ -21,16 +21,26 @@ except (ImportError, AssertionError):
from .analysis import ExperimentAnalysis as EA
else:
ray_available = True
import logging
from flaml.tune.spark.utils import PySparkOvertimeMonitor, check_spark
from .logger import logger, logger_formatter
from .result import DEFAULT_METRIC
from .trial import Trial
logger = logging.getLogger(__name__)
logger.propagate = False
try:
import mlflow
except ImportError:
mlflow = None
try:
from flaml.fabric.mlflow import MLflowIntegration, is_autolog_enabled
internal_mlflow = True
except ImportError:
internal_mlflow = False
_use_ray = True
_runner = None
_verbose = 0
@@ -44,6 +54,7 @@ class ExperimentAnalysis(EA):
"""Class for storing the experiment results."""
def __init__(self, trials, metric, mode, lexico_objectives=None):
self.best_run_id = None
try:
super().__init__(self, None, trials, metric, mode)
self.lexico_objectives = lexico_objectives
@@ -128,6 +139,16 @@ class ExperimentAnalysis(EA):
else:
return self.best_trial.last_result
@property
def best_iteration(self) -> List[str]:
"""Help better navigate"""
best_trial = self.best_trial
best_trial_id = best_trial.trial_id
for i, trial in enumerate(self.trials):
if trial.trial_id == best_trial_id:
return i
return None
def report(_metric=None, **kwargs):
"""A function called by the HPO application to report final or intermediate
@@ -174,9 +195,16 @@ def report(_metric=None, **kwargs):
global _training_iteration
if _use_ray:
try:
from ray import tune
from ray import __version__ as ray_version
return tune.report(_metric, **kwargs)
if ray_version.startswith("1."):
from ray import tune
return tune.report(_metric, **kwargs)
else: # ray>=2
from ray.air import session
return session.report(metrics={"metric": _metric, **kwargs})
except ImportError:
# calling tune.report() outside tune.run()
return
@@ -234,6 +262,11 @@ def run(
lexico_objectives: Optional[dict] = None,
force_cancel: Optional[bool] = False,
n_concurrent_trials: Optional[int] = 0,
mlflow_exp_name: Optional[str] = None,
automl_info: Optional[Tuple[float]] = None,
extra_tag: Optional[dict] = None,
cost_attr: Optional[str] = "auto",
cost_budget: Optional[float] = None,
**ray_args,
):
"""The function-based way of performing HPO.
@@ -424,6 +457,10 @@ def run(
}
```
force_cancel: boolean, default=False | Whether to forcely cancel the PySpark job if overtime.
mlflow_exp_name: str, default=None | The name of the mlflow experiment. This should be specified if
enable mlflow autologging on Spark. Otherwise it will log all the results into the experiment of the
same name as the basename of main entry file.
automl_info: tuple, default=None | The information of the automl run. It should be a tuple of (mlflow_log_latency,).
n_concurrent_trials: int, default=0 | The number of concurrent trials when perform hyperparameter
tuning with Spark. Only valid when use_spark=True and spark is required:
`pip install flaml[spark]`. Please check
@@ -431,6 +468,13 @@ def run(
for more details about installing Spark. When tune.run() is called from AutoML, it will be
overwritten by the value of `n_concurrent_trials` in AutoML. When <= 0, the concurrent trials
will be set to the number of executors.
extra_tag: dict, default=None | Extra tags to be added to the mlflow runs created by autologging.
cost_attr: None or str to specify the attribute to evaluate the cost of different trials.
Default is "auto", which means that we will automatically choose the cost attribute to use (depending
on the nature of the resource budget). When cost_attr is set to None, cost differences between different trials will be omitted
in our search algorithm. When cost_attr is set to a str different from "auto" and "time_total_s",
this cost_attr must be available in the result dict of the trial.
cost_budget: A float of the cost budget. Only valid when cost_attr is a str different from "auto" and "time_total_s".
**ray_args: keyword arguments to pass to ray.tune.run().
Only valid when use_ray=True.
"""
@@ -438,10 +482,12 @@ def run(
global _verbose
global _running_trial
global _training_iteration
global internal_mlflow
old_use_ray = _use_ray
old_verbose = _verbose
old_running_trial = _running_trial
old_training_iteration = _training_iteration
if log_file_name:
dir_name = os.path.dirname(log_file_name)
if dir_name:
@@ -473,10 +519,6 @@ def run(
elif not logger.hasHandlers():
# Add the console handler.
_ch = logging.StreamHandler(stream=sys.stdout)
logger_formatter = logging.Formatter(
"[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s",
"%m-%d %H:%M:%S",
)
_ch.setFormatter(logger_formatter)
logger.addHandler(_ch)
if verbose <= 2:
@@ -486,6 +528,13 @@ def run(
else:
logger.setLevel(logging.CRITICAL)
if internal_mlflow and not automl_info and (mlflow.active_run() or is_autolog_enabled()):
mlflow_integration = MLflowIntegration("tune", mlflow_exp_name, extra_tag)
evaluation_function = mlflow_integration.wrap_evaluation_function(evaluation_function)
_internal_mlflow = not automl_info # True if mlflow_integration will be used for logging
else:
_internal_mlflow = False
from .searcher.blendsearch import CFO, BlendSearch, RandomSearch
if lexico_objectives is not None:
@@ -531,7 +580,7 @@ def run(
import optuna as _
SearchAlgorithm = BlendSearch
logger.info("Using search algorithm {}.".format(SearchAlgorithm.__name__))
logger.info(f"Using search algorithm {SearchAlgorithm.__name__}.")
except ImportError:
if search_alg == "BlendSearch":
raise ValueError("To use BlendSearch, run: pip install flaml[blendsearch]")
@@ -540,7 +589,7 @@ def run(
logger.warning("Using CFO for search. To use BlendSearch, run: pip install flaml[blendsearch]")
else:
SearchAlgorithm = locals()[search_alg]
logger.info("Using search algorithm {}.".format(SearchAlgorithm.__name__))
logger.info(f"Using search algorithm {SearchAlgorithm.__name__}.")
metric = metric or DEFAULT_METRIC
search_alg = SearchAlgorithm(
metric=metric,
@@ -560,6 +609,8 @@ def run(
metric_constraints=metric_constraints,
use_incumbent_result_in_evaluation=use_incumbent_result_in_evaluation,
lexico_objectives=lexico_objectives,
cost_attr=cost_attr,
cost_budget=cost_budget,
)
else:
if metric is None or mode is None:
@@ -695,10 +746,16 @@ def run(
max_concurrent = max(1, search_alg.max_concurrent)
else:
max_concurrent = max(1, max_spark_parallelism)
passed_in_n_concurrent_trials = max(n_concurrent_trials, max_concurrent)
n_concurrent_trials = min(
n_concurrent_trials if n_concurrent_trials > 0 else num_executors,
max_concurrent,
)
if n_concurrent_trials < passed_in_n_concurrent_trials:
logger.warning(
f"The actual concurrent trials is {n_concurrent_trials}. You can set the environment "
f"variable `FLAML_MAX_CONCURRENT` to '{passed_in_n_concurrent_trials}' to override the detected num of executors."
)
with parallel_backend("spark"):
with Parallel(n_jobs=n_concurrent_trials, verbose=max(0, (verbose - 1) * 50)) as parallel:
try:
@@ -713,11 +770,15 @@ def run(
time_budget_s = np.inf
num_failures = 0
upperbound_num_failures = (len(evaluated_rewards) if evaluated_rewards else 0) + max_failure
logger.debug(f"automl_info: {automl_info}")
while (
time.time() - time_start < time_budget_s
and (num_samples < 0 or num_trials < num_samples)
and num_failures < upperbound_num_failures
):
if automl_info and automl_info[0] > 0 and time_budget_s < np.inf:
time_budget_s -= automl_info[0] * n_concurrent_trials
logger.debug(f"Remaining time budget with mlflow log latency: {time_budget_s} seconds.")
while len(_runner.running_trials) < n_concurrent_trials:
# suggest trials for spark
trial_next = _runner.step()
@@ -750,6 +811,9 @@ def run(
trial_to_run = trials_to_run[0]
_runner.running_trial = trial_to_run
if result is not None:
if _internal_mlflow:
mlflow_integration.record_trial(result, trial_to_run, metric)
if isinstance(result, dict):
if result:
logger.info(f"Brief result: {result}")
@@ -758,7 +822,7 @@ def run(
# When the result returned is an empty dict, set the trial status to error
trial_to_run.set_status(Trial.ERROR)
else:
logger.info("Brief result: {}".format({metric: result}))
logger.info("Brief result: {metric: result}")
report(_metric=result)
_runner.stop_trial(trial_to_run)
num_failures = 0
@@ -768,6 +832,20 @@ def run(
mode=mode,
lexico_objectives=lexico_objectives,
)
analysis.search_space = config
if _internal_mlflow:
mlflow_integration.log_tune(analysis, metric)
# try:
# _best_config = analysis.best_config
# except Exception:
# _best_config = None
# if _best_config:
# parallel(
# delayed(mlflow_integration.retrain)(evaluation_function, analysis.best_config)
# for dummy in [0]
# )
return analysis
finally:
# recover the global variables in case of nested run
@@ -779,6 +857,8 @@ def run(
_runner = old_runner
logger.handlers = old_handlers
logger.setLevel(old_level)
if _internal_mlflow:
mlflow_integration.adopt_children()
# simple sequential run without using tune.run() from ray
time_start = time.time()
@@ -812,7 +892,11 @@ def run(
result = None
with PySparkOvertimeMonitor(time_start, time_budget_s, force_cancel):
result = evaluation_function(trial_to_run.config)
logger.debug(f"result in tune: {trial_to_run}, {result}")
if result is not None:
if _internal_mlflow:
mlflow_integration.record_trial(result, trial_to_run, metric)
if isinstance(result, dict):
if result:
report(**result)
@@ -838,6 +922,19 @@ def run(
mode=mode,
lexico_objectives=lexico_objectives,
)
analysis.search_space = config
if _internal_mlflow:
mlflow_integration.log_tune(analysis, metric)
if analysis.best_run_id is not None:
logger.info(f"Best MLflow run name: {analysis.best_run_name}")
logger.info(f"Best MLflow run id: {analysis.best_run_id}")
# try:
# _best_config = analysis.best_config
# except Exception:
# _best_config = None
# if _best_config:
# mlflow_integration.retrain(evaluation_function, analysis.best_config)
return analysis
finally:
# recover the global variables in case of nested run
@@ -849,6 +946,8 @@ def run(
_runner = old_runner
logger.handlers = old_handlers
logger.setLevel(old_level)
if _internal_mlflow:
mlflow_integration.adopt_children()
class Tuner:

View File

@@ -1 +1 @@
__version__ = "2.2.0"
__version__ = "2.3.6"

View File

@@ -174,7 +174,7 @@
"import datasets\n",
"\n",
"seed = 41\n",
"data = datasets.load_dataset(\"competition_math\")\n",
"data = datasets.load_dataset(\"competition_math\", trust_remote_code=True)\n",
"train_data = data[\"train\"].shuffle(seed=seed)\n",
"test_data = data[\"test\"].shuffle(seed=seed)\n",
"n_tune_data = 20\n",
@@ -390,7 +390,7 @@
"name": "stderr",
"output_type": "stream",
"text": [
"\u001b[32m[I 2023-08-01 22:38:01,549]\u001b[0m A new study created in memory with name: optuna\u001b[0m\n"
"\u001B[32m[I 2023-08-01 22:38:01,549]\u001B[0m A new study created in memory with name: optuna\u001B[0m\n"
]
},
{

View File

@@ -196,7 +196,7 @@
"import datasets\n",
"\n",
"seed = 41\n",
"data = datasets.load_dataset(\"openai_humaneval\")[\"test\"].shuffle(seed=seed)\n",
"data = datasets.load_dataset(\"openai_humaneval\", trust_remote_code=True)[\"test\"].shuffle(seed=seed)\n",
"n_tune_data = 20\n",
"tune_data = [\n",
" {\n",
@@ -444,8 +444,8 @@
"name": "stderr",
"output_type": "stream",
"text": [
"\u001b[32m[I 2023-07-30 04:19:08,150]\u001b[0m A new study created in memory with name: optuna\u001b[0m\n",
"\u001b[32m[I 2023-07-30 04:19:08,153]\u001b[0m A new study created in memory with name: optuna\u001b[0m\n"
"\u001B[32m[I 2023-07-30 04:19:08,150]\u001B[0m A new study created in memory with name: optuna\u001B[0m\n",
"\u001B[32m[I 2023-07-30 04:19:08,153]\u001B[0m A new study created in memory with name: optuna\u001B[0m\n"
]
},
{

View File

@@ -152,7 +152,7 @@
"import datasets\n",
"\n",
"seed = 41\n",
"data = datasets.load_dataset(\"openai_humaneval\")[\"test\"].shuffle(seed=seed)\n",
"data = datasets.load_dataset(\"openai_humaneval\", trust_remote_code=True)[\"test\"].shuffle(seed=seed)\n",
"data = data.select(range(len(data))).rename_column(\"prompt\", \"definition\").remove_columns([\"task_id\", \"canonical_solution\"])"
]
},

View File

@@ -121,7 +121,7 @@
"import datasets\n",
"\n",
"seed = 41\n",
"data = datasets.load_dataset(\"competition_math\")\n",
"data = datasets.load_dataset(\"competition_math\", trust_remote_code=True)\n",
"train_data = data[\"train\"].shuffle(seed=seed)\n",
"test_data = data[\"test\"].shuffle(seed=seed)\n",
"n_tune_data = 20\n",

View File

@@ -112,9 +112,7 @@
]
}
],
"source": [
"raw_dataset = datasets.load_dataset(\"glue\", TASK)"
]
"source": "raw_dataset = datasets.load_dataset(\"glue\", TASK, trust_remote_code=True)"
},
{
"cell_type": "code",
@@ -425,9 +423,7 @@
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"metric = datasets.load_metric(\"glue\", TASK)"
]
"source": "metric = datasets.load_metric(\"glue\", TASK, trust_remote_code=True)"
},
{
"cell_type": "code",
@@ -646,7 +642,7 @@
"def train_distilbert(config: dict):\n",
"\n",
" # Load CoLA dataset and apply tokenizer\n",
" cola_raw = datasets.load_dataset(\"glue\", TASK)\n",
" cola_raw = datasets.load_dataset(\"glue\", TASK, trust_remote_code=True)\n",
" cola_encoded = cola_raw.map(tokenize, batched=True)\n",
" train_dataset, eval_dataset = cola_encoded[\"train\"], cola_encoded[\"validation\"]\n",
"\n",
@@ -654,7 +650,7 @@
" MODEL_CHECKPOINT, num_labels=NUM_LABELS\n",
" )\n",
"\n",
" metric = datasets.load_metric(\"glue\", TASK)\n",
" metric = datasets.load_metric(\"glue\", TASK, trust_remote_code=True)\n",
" def compute_metrics(eval_pred):\n",
" predictions, labels = eval_pred\n",
" predictions = np.argmax(predictions, axis=1)\n",
@@ -847,7 +843,7 @@
"name": "stderr",
"output_type": "stream",
"text": [
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m Reusing dataset glue (/home/ec2-user/.cache/huggingface/datasets/glue/cola/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)\n",
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m Reusing dataset glue (/home/ec2-user/.cache/huggingface/datasets/glue/cola/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)\n",
" 0%| | 0/9 [00:00<?, ?ba/s]\n",
" 22%|██▏ | 2/9 [00:00<00:00, 19.41ba/s]\n",
" 56%|█████▌ | 5/9 [00:00<00:00, 20.98ba/s]\n",
@@ -856,25 +852,25 @@
"100%|██████████| 2/2 [00:00<00:00, 42.79ba/s]\n",
" 0%| | 0/2 [00:00<?, ?ba/s]\n",
"100%|██████████| 2/2 [00:00<00:00, 41.48ba/s]\n",
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']\n",
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']\n",
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']\n",
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']\n",
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m To disable this warning, you can either:\n",
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m \t- Avoid using `tokenizers` before the fork if possible\n",
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m \t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m To disable this warning, you can either:\n",
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m \t- Avoid using `tokenizers` before the fork if possible\n",
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m \t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m To disable this warning, you can either:\n",
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m \t- Avoid using `tokenizers` before the fork if possible\n",
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m \t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m To disable this warning, you can either:\n",
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m \t- Avoid using `tokenizers` before the fork if possible\n",
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m \t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
]
}
],

3
pytest.ini Normal file
View File

@@ -0,0 +1,3 @@
[pytest]
markers =
spark: mark a test as requiring Spark

View File

@@ -4,7 +4,7 @@ import setuptools
here = os.path.abspath(os.path.dirname(__file__))
with open("README.md", "r", encoding="UTF-8") as fh:
with open("README.md", encoding="UTF-8") as fh:
long_description = fh.read()
@@ -55,7 +55,8 @@ setuptools.setup(
"lightgbm>=2.3.1",
"xgboost>=0.90,<2.0.0",
"scipy>=1.4.1",
"pandas>=1.1.4",
"pandas>=1.1.4,<2.0.0; python_version<'3.10'",
"pandas>=1.1.4; python_version>='3.10'",
"scikit-learn>=1.0.0",
"thop",
"pytest>=6.1.1",
@@ -72,14 +73,14 @@ setuptools.setup(
"psutil==5.8.0",
"dataclasses",
"transformers[torch]==4.26",
"datasets",
"nltk",
"datasets<=3.5.0",
"nltk<=3.8.1", # 3.8.2 doesn't work with mlflow
"rouge_score",
"hcrystalball==0.1.10",
"seqeval",
"pytorch-forecasting>=0.9.0,<=0.10.1; python_version<'3.11'",
"mlflow",
"pyspark>=3.2.0",
# "pytorch-forecasting==0.10.1; python_version=='3.11'",
"mlflow==2.15.1",
"joblibspark>=0.5.0",
"joblib<=1.3.2",
"nbconvert",
@@ -92,6 +93,7 @@ setuptools.setup(
"pydantic==1.10.9",
"sympy",
"wolframalpha",
"dill", # a drop in replacement of pickle
],
"catboost": [
"catboost>=0.26,<1.2; python_version<'3.11'",
@@ -117,14 +119,14 @@ setuptools.setup(
"hf": [
"transformers[torch]==4.26",
"datasets",
"nltk",
"nltk<=3.8.1",
"rouge_score",
"seqeval",
],
"nlp": [ # for backward compatibility; hf is the new option name
"transformers[torch]==4.26",
"datasets",
"nltk",
"nltk<=3.8.1",
"rouge_score",
"seqeval",
],
@@ -139,7 +141,8 @@ setuptools.setup(
"prophet>=1.0.1",
"statsmodels>=0.12.2",
"hcrystalball==0.1.10",
"pytorch-forecasting>=0.9.0",
"pytorch-forecasting>=0.9.0; python_version<'3.11'",
# "pytorch-forecasting==0.10.1; python_version=='3.11'",
"pytorch-lightning==1.9.0",
"tensorboardX==2.6",
],
@@ -163,9 +166,13 @@ setuptools.setup(
"autozero": ["scikit-learn", "pandas", "packaging"],
},
classifiers=[
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
# Specify the Python versions you support here.
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
],
python_requires=">=3.6",
python_requires=">=3.9",
)

View File

@@ -178,7 +178,7 @@ def test_tsp(human_input_mode="NEVER", max_consecutive_auto_reply=10):
class TSPUserProxyAgent(UserProxyAgent):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
with open(f"{here}/tsp_prompt.txt", "r") as f:
with open(f"{here}/tsp_prompt.txt") as f:
self._prompt = f.read()
def generate_init_message(self, question) -> str:

View File

@@ -187,7 +187,7 @@ def test_humaneval(num_samples=1):
)
seed = 41
data = datasets.load_dataset("openai_humaneval")["test"].shuffle(seed=seed)
data = datasets.load_dataset("openai_humaneval", trust_remote_code=True)["test"].shuffle(seed=seed)
n_tune_data = 20
tune_data = [
{
@@ -334,7 +334,7 @@ def test_math(num_samples=-1):
return
seed = 41
data = datasets.load_dataset("competition_math")
data = datasets.load_dataset("competition_math", trust_remote_code=True)
train_data = data["train"].shuffle(seed=seed)
test_data = data["test"].shuffle(seed=seed)
n_tune_data = 20
@@ -356,7 +356,7 @@ def test_math(num_samples=-1):
]
print(
"max tokens in tuning data's canonical solutions",
max([len(x["solution"].split()) for x in tune_data]),
max(len(x["solution"].split()) for x in tune_data),
)
print(len(tune_data), len(test_data))
# prompt template

View File

@@ -1,11 +1,15 @@
import unittest
from datetime import datetime
from test.conftest import evaluate_cv_folds_with_underlying_model
import numpy as np
import pandas as pd
import pytest
import scipy.sparse
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import (
train_test_split,
)
from flaml import AutoML, tune
from flaml.automl.model import LGBMEstimator
@@ -420,6 +424,122 @@ class TestClassification(unittest.TestCase):
print(automl_experiment.best_estimator)
@pytest.mark.parametrize(
"estimator",
[
"catboost",
"extra_tree",
"histgb",
"kneighbor",
"lgbm",
# "lrl1",
"lrl2",
"rf",
"svc",
"xgboost",
"xgb_limitdepth",
],
)
def test_reproducibility_of_classification_models(estimator: str):
"""FLAML finds the best model for a given dataset, which it then provides to users.
However, there are reported issues where FLAML was providing an incorrect model - see here:
https://github.com/microsoft/FLAML/issues/1317
In this test we take the best model which FLAML provided us, and then retrain and test it on the
same folds, to verify that the result is reproducible.
"""
automl = AutoML()
automl_settings = {
"max_iter": 5,
"time_budget": -1,
"task": "classification",
"n_jobs": 1,
"estimator_list": [estimator],
"eval_method": "cv",
"n_splits": 10,
"metric": "f1",
"keep_search_state": True,
"skip_transform": True,
}
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
automl.fit(X_train=X, y_train=y, **automl_settings)
best_model = automl.model
assert best_model is not None
config = best_model.get_params()
val_loss_flaml = automl.best_result["val_loss"]
# Take the best model, and see if we can reproduce the best result
reproduced_val_loss, metric_for_logging, train_time, pred_time = automl._state.task.evaluate_model_CV(
config=config,
estimator=best_model,
X_train_all=automl._state.X_train_all,
y_train_all=automl._state.y_train_all,
budget=None,
kf=automl._state.kf,
eval_metric="f1",
best_val_loss=None,
cv_score_agg_func=None,
log_training_metric=False,
fit_kwargs=None,
free_mem_ratio=0,
)
assert pytest.approx(val_loss_flaml) == reproduced_val_loss
@pytest.mark.parametrize(
"estimator",
[
"catboost",
"extra_tree",
"histgb",
"kneighbor",
"lgbm",
# "lrl1",
"lrl2",
"svc",
"rf",
"xgboost",
"xgb_limitdepth",
],
)
def test_reproducibility_of_underlying_classification_models(estimator: str):
"""FLAML finds the best model for a given dataset, which it then provides to users.
However, there are reported issues where FLAML was providing an incorrect model - see here:
https://github.com/microsoft/FLAML/issues/1317
FLAML defines FLAMLised models, which wrap around the underlying (SKLearn/XGBoost/CatBoost) model.
Ideally, FLAMLised models should perform identically to the underlying model, when fitted
to the same data, with no budget. This verifies that this is the case for classification models.
In this test we take the best model which FLAML provided us, extract the underlying model,
before retraining and testing it on the same folds - to verify that the result is reproducible.
"""
automl = AutoML()
automl_settings = {
"max_iter": 5,
"time_budget": -1,
"task": "classification",
"n_jobs": 1,
"estimator_list": [estimator],
"eval_method": "cv",
"n_splits": 10,
"metric": "f1",
"keep_search_state": True,
"skip_transform": True,
}
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
automl.fit(X_train=X, y_train=y, **automl_settings)
best_model = automl.model
assert best_model is not None
val_loss_flaml = automl.best_result["val_loss"]
reproduced_val_loss_underlying_model = np.mean(
evaluate_cv_folds_with_underlying_model(
automl._state.X_train_all, automl._state.y_train_all, automl._state.kf, best_model.model, "classification"
)
)
assert pytest.approx(val_loss_flaml) == reproduced_val_loss_underlying_model
if __name__ == "__main__":
test = TestClassification()
test.test_preprocess()

View File

@@ -125,14 +125,12 @@ def test_metric_constraints_custom():
print(automl.estimator_list)
print(automl.search_space)
print(automl.points_to_evaluate)
print("Best minimization objective on validation data: {0:.4g}".format(automl.best_loss))
print(f"Best minimization objective on validation data: {automl.best_loss:.4g}")
print(
"pred_time of the best config on validation data: {0:.4g}".format(
automl.metrics_for_best_config[1]["pred_time"]
)
"pred_time of the best config on validation data: {:.4g}".format(automl.metrics_for_best_config[1]["pred_time"])
)
print(
"val_train_loss_gap of the best config on validation data: {0:.4g}".format(
"val_train_loss_gap of the best config on validation data: {:.4g}".format(
automl.metrics_for_best_config[1]["val_train_loss_gap"]
)
)

View File

@@ -0,0 +1,312 @@
import os
import sys
import unittest
import warnings
from collections import defaultdict
import mlflow
import numpy as np
import pandas as pd
import pytest
import scipy
from packaging.version import Version
from sklearn.datasets import load_breast_cancer, load_diabetes, load_iris
from sklearn.model_selection import train_test_split
from flaml import AutoML
from flaml.automl.ml import sklearn_metric_loss_score
from flaml.tune.spark.utils import check_spark
pytestmark = pytest.mark.spark
leaderboard = defaultdict(dict)
warnings.simplefilter(action="ignore")
if sys.platform == "darwin" or "nt" in os.name:
# skip this test if the platform is not linux
skip_spark = True
else:
try:
import pyspark
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, RegressionEvaluator
from pyspark.ml.feature import VectorAssembler
from flaml.automl.spark.utils import to_pandas_on_spark
spark = (
pyspark.sql.SparkSession.builder.appName("MyApp")
.master("local[2]")
.config(
"spark.jars.packages",
(
"com.microsoft.azure:synapseml_2.12:1.0.2,"
"org.apache.hadoop:hadoop-azure:3.3.5,"
"com.microsoft.azure:azure-storage:8.6.6,"
f"org.mlflow:mlflow-spark_2.12:{mlflow.__version__}"
if Version(mlflow.__version__) >= Version("2.9.0")
else f"org.mlflow:mlflow-spark:{mlflow.__version__}"
),
)
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
.config("spark.sql.debug.maxToStringFields", "100")
.config("spark.driver.extraJavaOptions", "-Xss1m")
.config("spark.executor.extraJavaOptions", "-Xss1m")
.getOrCreate()
)
spark.sparkContext._conf.set(
"spark.mlflow.pysparkml.autolog.logModelAllowlistFile",
"https://mmlspark.blob.core.windows.net/publicwasb/log_model_allowlist.txt",
)
# spark.sparkContext.setLogLevel("ERROR")
spark_available, _ = check_spark()
skip_spark = not spark_available
except ImportError:
skip_spark = True
def _test_regular_models(estimator_list, task):
if isinstance(estimator_list, str):
estimator_list = [estimator_list]
if task == "classification":
load_dataset_func = load_iris
metric = "accuracy"
else:
load_dataset_func = load_diabetes
metric = "r2"
x, y = load_dataset_func(return_X_y=True, as_frame=True)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=7654321)
automl_experiment = AutoML()
automl_settings = {
"max_iter": 5,
"task": task,
"estimator_list": estimator_list,
"metric": metric,
}
automl_experiment.fit(X_train=x_train, y_train=y_train, **automl_settings)
predictions = automl_experiment.predict(x_test)
score = sklearn_metric_loss_score(metric, predictions, y_test)
for estimator_name in estimator_list:
leaderboard[task][estimator_name] = score
def _test_spark_models(estimator_list, task):
if isinstance(estimator_list, str):
estimator_list = [estimator_list]
if task == "classification":
load_dataset_func = load_iris
evaluator = MulticlassClassificationEvaluator(
labelCol="target", predictionCol="prediction", metricName="accuracy"
)
metric = "accuracy"
elif task == "regression":
load_dataset_func = load_diabetes
evaluator = RegressionEvaluator(labelCol="target", predictionCol="prediction", metricName="r2")
metric = "r2"
elif task == "binary":
load_dataset_func = load_breast_cancer
evaluator = MulticlassClassificationEvaluator(
labelCol="target", predictionCol="prediction", metricName="accuracy"
)
metric = "accuracy"
final_cols = ["target", "features"]
extra_args = {}
if estimator_list is not None and "aft_spark" in estimator_list:
# survival analysis task
pd_df = pd.read_csv(
"https://raw.githubusercontent.com/CamDavidsonPilon/lifelines/master/lifelines/datasets/rossi.csv"
)
pd_df.rename(columns={"week": "target"}, inplace=True)
final_cols += ["arrest"]
extra_args["censorCol"] = "arrest"
else:
pd_df = load_dataset_func(as_frame=True).frame
rename = {}
for attr in pd_df.columns:
rename[attr] = attr.replace(" ", "_")
pd_df = pd_df.rename(columns=rename)
df = spark.createDataFrame(pd_df)
df = df.repartition(4)
train, test = df.randomSplit([0.8, 0.2], seed=7654321)
feature_cols = [col for col in df.columns if col not in ["target", "arrest"]]
featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
train_data = featurizer.transform(train)[final_cols]
test_data = featurizer.transform(test)[final_cols]
automl = AutoML()
settings = {
"max_iter": 1,
"estimator_list": estimator_list, # ML learner we intend to test
"task": task, # task type
"metric": metric, # metric to optimize
}
settings.update(extra_args)
df = to_pandas_on_spark(to_pandas_on_spark(train_data).to_spark(index_col="index"))
automl.fit(
dataframe=df,
label="target",
**settings,
)
model = automl.model.estimator
predictions = model.transform(test_data)
predictions.show(5)
score = evaluator.evaluate(predictions)
if estimator_list is not None:
for estimator_name in estimator_list:
leaderboard[task][estimator_name] = score
def _test_sparse_matrix_classification(estimator):
automl_experiment = AutoML()
automl_settings = {
"estimator_list": [estimator],
"time_budget": 2,
"metric": "auto",
"task": "classification",
"log_file_name": "test/sparse_classification.log",
"split_type": "uniform",
"n_jobs": 1,
"model_history": True,
}
X_train = scipy.sparse.random(1554, 21, dtype=int)
y_train = np.random.randint(3, size=1554)
automl_experiment.fit(X_train=X_train, y_train=y_train, **automl_settings)
def load_multi_dataset():
"""multivariate time series forecasting dataset"""
import pandas as pd
# pd.set_option("display.max_rows", None, "display.max_columns", None)
df = pd.read_csv(
"https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/nyc_energy_consumption.csv"
)
# preprocessing data
df["timeStamp"] = pd.to_datetime(df["timeStamp"])
df = df.set_index("timeStamp")
df = df.resample("D").mean()
df["temp"] = df["temp"].fillna(method="ffill")
df["precip"] = df["precip"].fillna(method="ffill")
df = df[:-2] # last two rows are NaN for 'demand' column so remove them
df = df.reset_index()
return df
def _test_forecast(estimator_list, budget=10):
if isinstance(estimator_list, str):
estimator_list = [estimator_list]
df = load_multi_dataset()
# split data into train and test
time_horizon = 180
num_samples = df.shape[0]
split_idx = num_samples - time_horizon
train_df = df[:split_idx]
test_df = df[split_idx:]
# test dataframe must contain values for the regressors / multivariate variables
X_test = test_df[["timeStamp", "precip", "temp"]]
y_test = test_df["demand"]
# return
automl = AutoML()
settings = {
"time_budget": budget, # total running time in seconds
"metric": "mape", # primary metric
"task": "ts_forecast", # task type
"log_file_name": "test/energy_forecast_numerical.log", # flaml log file
"log_dir": "logs/forecast_logs", # tcn/tft log folder
"eval_method": "holdout",
"log_type": "all",
"label": "demand",
"estimator_list": estimator_list,
}
"""The main flaml automl API"""
automl.fit(dataframe=train_df, **settings, period=time_horizon)
print(automl.best_config)
pred_y = automl.predict(X_test)
mape = sklearn_metric_loss_score("mape", pred_y, y_test)
for estimator_name in estimator_list:
leaderboard["forecast"][estimator_name] = mape
class TestExtraModel(unittest.TestCase):
@unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_rf_spark(self):
tasks = ["classification", "regression"]
for task in tasks:
_test_spark_models("rf_spark", task)
@unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_nb_spark(self):
_test_spark_models("nb_spark", "classification")
@unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_glr(self):
_test_spark_models("glr_spark", "regression")
@unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_lr(self):
_test_spark_models("lr_spark", "regression")
@unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_svc_spark(self):
_test_spark_models("svc_spark", "binary")
@unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_gbt_spark(self):
tasks = ["binary", "regression"]
for task in tasks:
_test_spark_models("gbt_spark", task)
@unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_aft(self):
_test_spark_models("aft_spark", "regression")
@unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_default_spark(self):
_test_spark_models(None, "classification")
def test_svc(self):
_test_regular_models("svc", "classification")
_test_sparse_matrix_classification("svc")
def test_sgd(self):
tasks = ["classification", "regression"]
for task in tasks:
_test_regular_models("sgd", task)
_test_sparse_matrix_classification("sgd")
def test_enet(self):
_test_regular_models("enet", "regression")
def test_lassolars(self):
_test_regular_models("lassolars", "regression")
_test_forecast("lassolars")
def test_seasonal_naive(self):
_test_forecast("snaive")
def test_naive(self):
_test_forecast("naive")
def test_seasonal_avg(self):
_test_forecast("savg")
def test_avg(self):
_test_forecast("avg")
@unittest.skipIf(skip_spark, reason="Skip on Mac or Windows")
def test_tcn(self):
_test_forecast("tcn")
if __name__ == "__main__":
unittest.main()
print(leaderboard)

View File

@@ -1,4 +1,5 @@
import datetime
import os
import sys
import numpy as np
@@ -95,6 +96,7 @@ def test_forecast_automl(budget=10, estimators_when_no_prophet=["arima", "sarima
)
@pytest.mark.skipif(sys.platform == "darwin" or "nt" in os.name, reason="skip on mac or windows")
def test_models(budget=3):
n = 200
X = pd.DataFrame(
@@ -151,6 +153,10 @@ def test_numpy():
print(automl.predict(12))
@pytest.mark.skipif(
sys.platform in ["darwin"],
reason="do not run on mac os",
)
def test_numpy_large():
import numpy as np
import pandas as pd
@@ -471,7 +477,10 @@ def test_forecast_classification(budget=5):
def get_stalliion_data():
from pytorch_forecasting.data.examples import get_stallion_data
data = get_stallion_data()
# data = get_stallion_data()
data = pd.read_parquet(
"https://raw.githubusercontent.com/sktime/pytorch-forecasting/refs/heads/main/examples/data/stallion.parquet"
)
# add time index - For datasets with no missing values, FLAML will automate this process
data["time_idx"] = data["date"].dt.year * 12 + data["date"].dt.month
data["time_idx"] -= data["time_idx"].min()
@@ -567,7 +576,7 @@ def test_forecast_panel(budget=5):
print(f"Training duration of best run: {automl.best_config_train_time}s")
print(automl.model.estimator)
""" pickle and save the automl object """
import pickle
import dill as pickle
with open("automl.pkl", "wb") as f:
pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL)

View File

@@ -0,0 +1,51 @@
import mlflow
import numpy as np
import pandas as pd
from flaml import AutoML
def test_max_iter_1():
date_rng = pd.date_range(start="2024-01-01", periods=100, freq="H")
X = pd.DataFrame({"ds": date_rng})
y_train_24h = np.random.rand(len(X)) * 100
# AutoML
settings = {
"max_iter": 1,
"estimator_list": ["xgboost", "lgbm"],
"starting_points": {"xgboost": {}, "lgbm": {}},
"task": "ts_forecast",
"log_file_name": "test_max_iter_1.log",
"seed": 41,
"mlflow_exp_name": "TestExp-max_iter-1",
"use_spark": False,
"n_concurrent_trials": 1,
"verbose": 1,
"featurization": "off",
"metric": "rmse",
"mlflow_logging": True,
}
automl = AutoML(**settings)
with mlflow.start_run(run_name="AutoMLModel-XGBoost-and-LGBM-max_iter_1"):
automl.fit(
X_train=X,
y_train=y_train_24h,
period=24,
X_val=X,
y_val=y_train_24h,
split_ratio=0,
force_cancel=False,
)
assert automl.model is not None, "AutoML failed to return a model"
assert automl.best_run_id is not None, "Best run ID should not be None with mlflow logging"
print("Best model:", automl.model)
print("Best run ID:", automl.best_run_id)
if __name__ == "__main__":
test_max_iter_1()

View File

@@ -1,3 +1,5 @@
import pickle
import mlflow
import mlflow.entities
import pytest
@@ -8,58 +10,113 @@ from flaml import AutoML
class TestMLFlowLoggingParam:
def test_update_and_install_requirements(self):
import mlflow
from sklearn import tree
from flaml.fabric.mlflow import update_and_install_requirements
with mlflow.start_run(run_name="test") as run:
sk_model = tree.DecisionTreeClassifier()
mlflow.sklearn.log_model(sk_model, "model", registered_model_name="test")
update_and_install_requirements(run_id=run.info.run_id)
def test_should_start_new_run_by_default(self, automl_settings):
with mlflow.start_run():
parent = mlflow.last_active_run()
with mlflow.start_run() as parent_run:
automl = AutoML()
X_train, y_train = load_iris(return_X_y=True)
automl.fit(X_train=X_train, y_train=y_train, **automl_settings)
try:
self._check_mlflow_parameters(automl, parent_run.info)
except FileNotFoundError:
print("[WARNING]: No file found")
children = self._get_child_runs(parent)
assert len(children) >= 1, "Expected at least 1 child run, got {}".format(len(children))
children = self._get_child_runs(parent_run)
assert len(children) >= 1, f"Expected at least 1 child run, got {len(children)}"
def test_should_not_start_new_run_when_mlflow_logging_set_to_false_in_init(self, automl_settings):
with mlflow.start_run():
parent = mlflow.last_active_run()
with mlflow.start_run() as parent_run:
automl = AutoML(mlflow_logging=False)
X_train, y_train = load_iris(return_X_y=True)
automl.fit(X_train=X_train, y_train=y_train, **automl_settings)
try:
self._check_mlflow_parameters(automl, parent_run.info)
except FileNotFoundError:
print("[WARNING]: No file found")
children = self._get_child_runs(parent)
assert len(children) == 0, "Expected 0 child runs, got {}".format(len(children))
children = self._get_child_runs(parent_run)
assert len(children) == 0, f"Expected 0 child runs, got {len(children)}"
def test_should_not_start_new_run_when_mlflow_logging_set_to_false_in_fit(self, automl_settings):
with mlflow.start_run():
parent = mlflow.last_active_run()
with mlflow.start_run() as parent_run:
automl = AutoML()
X_train, y_train = load_iris(return_X_y=True)
automl.fit(X_train=X_train, y_train=y_train, mlflow_logging=False, **automl_settings)
try:
self._check_mlflow_parameters(automl, parent_run.info)
except FileNotFoundError:
print("[WARNING]: No file found")
children = self._get_child_runs(parent)
assert len(children) == 0, "Expected 0 child runs, got {}".format(len(children))
children = self._get_child_runs(parent_run)
assert len(children) == 0, f"Expected 0 child runs, got {len(children)}"
def test_should_start_new_run_when_mlflow_logging_set_to_true_in_fit(self, automl_settings):
with mlflow.start_run():
parent = mlflow.last_active_run()
with mlflow.start_run() as parent_run:
automl = AutoML(mlflow_logging=False)
X_train, y_train = load_iris(return_X_y=True)
automl.fit(X_train=X_train, y_train=y_train, mlflow_logging=True, **automl_settings)
try:
self._check_mlflow_parameters(automl, parent_run.info)
except FileNotFoundError:
print("[WARNING]: No file found")
children = self._get_child_runs(parent)
assert len(children) >= 1, "Expected at least 1 child run, got {}".format(len(children))
children = self._get_child_runs(parent_run)
assert len(children) >= 1, f"Expected at least 1 child run, got {len(children)}"
@staticmethod
def _get_child_runs(parent_run: mlflow.entities.Run) -> DataFrame:
experiment_id = parent_run.info.experiment_id
return mlflow.search_runs(
[experiment_id], filter_string="tags.mlflow.parentRunId = '{}'".format(parent_run.info.run_id)
[experiment_id], filter_string=f"tags.mlflow.parentRunId = '{parent_run.info.run_id}'"
)
@staticmethod
def _check_mlflow_parameters(automl: AutoML, run_info: mlflow.entities.RunInfo):
with open(
f"./mlruns/{run_info.experiment_id}/{run_info.run_id}/artifacts/automl_pipeline/model.pkl", "rb"
) as f:
t = pickle.load(f)
if __name__ == "__main__":
print(t)
if not hasattr(automl.model._model, "_get_param_names"):
return
for param in automl.model._model._get_param_names():
assert eval("t._final_estimator._model" + f".{param}") == eval(
"automl.model._model" + f".{param}"
), "The mlflow logging not consistent with automl model"
if __name__ == "__main__":
print(param, "\t", eval("automl.model._model" + f".{param}"))
print("[INFO]: Successfully Logged")
@pytest.fixture(scope="class")
def automl_settings(self):
mlflow.end_run()
return {
"time_budget": 2, # in seconds
"time_budget": 5, # in seconds
"metric": "accuracy",
"task": "classification",
"log_file_name": "iris.log",
}
if __name__ == "__main__":
s = TestMLFlowLoggingParam()
automl_settings = {
"time_budget": 5, # in seconds
"metric": "accuracy",
"task": "classification",
"log_file_name": "iris.log",
}
s.test_should_start_new_run_by_default(automl_settings)
s.test_should_start_new_run_when_mlflow_logging_set_to_true_in_fit(automl_settings)

View File

@@ -143,4 +143,5 @@ def test_prep():
if __name__ == "__main__":
test_lrl2()
test_prep()

View File

@@ -187,7 +187,6 @@ class TestMultiClass(unittest.TestCase):
def test_custom_metric(self):
df, y = load_iris(return_X_y=True, as_frame=True)
df["label"] = y
automl = AutoML()
settings = {
"dataframe": df,
"label": "label",
@@ -204,7 +203,8 @@ class TestMultiClass(unittest.TestCase):
"pred_time_limit": 1e-5,
"ensemble": True,
}
automl.fit(**settings)
automl = AutoML(**settings) # test safe_json_dumps
automl.fit(dataframe=df, label="label")
print(automl.classes_)
print(automl.model)
print(automl.config_history)
@@ -438,8 +438,8 @@ class TestMultiClass(unittest.TestCase):
automl_val_accuracy = 1.0 - automl.best_loss
print("Best ML leaner:", automl.best_estimator)
print("Best hyperparmeter config:", automl.best_config)
print("Best accuracy on validation data: {0:.4g}".format(automl_val_accuracy))
print("Training duration of best run: {0:.4g} s".format(automl.best_config_train_time))
print(f"Best accuracy on validation data: {automl_val_accuracy:.4g}")
print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")
starting_points = automl.best_config_per_estimator
print("starting_points", starting_points)
@@ -461,8 +461,8 @@ class TestMultiClass(unittest.TestCase):
new_automl_val_accuracy = 1.0 - new_automl.best_loss
print("Best ML leaner:", new_automl.best_estimator)
print("Best hyperparmeter config:", new_automl.best_config)
print("Best accuracy on validation data: {0:.4g}".format(new_automl_val_accuracy))
print("Training duration of best run: {0:.4g} s".format(new_automl.best_config_train_time))
print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
print(f"Training duration of best run: {new_automl.best_config_train_time:.4g} s")
def test_fit_w_starting_point_2(self, as_frame=True):
try:
@@ -493,8 +493,8 @@ class TestMultiClass(unittest.TestCase):
automl_val_accuracy = 1.0 - automl.best_loss
print("Best ML leaner:", automl.best_estimator)
print("Best hyperparmeter config:", automl.best_config)
print("Best accuracy on validation data: {0:.4g}".format(automl_val_accuracy))
print("Training duration of best run: {0:.4g} s".format(automl.best_config_train_time))
print(f"Best accuracy on validation data: {automl_val_accuracy:.4g}")
print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")
starting_points = {}
log_file_name = settings["log_file_name"]
@@ -508,7 +508,7 @@ class TestMultiClass(unittest.TestCase):
if learner not in starting_points:
starting_points[learner] = []
starting_points[learner].append(config)
max_iter = sum([len(s) for k, s in starting_points.items()])
max_iter = sum(len(s) for k, s in starting_points.items())
settings_resume = {
"time_budget": 2,
"metric": "accuracy",
@@ -528,7 +528,7 @@ class TestMultiClass(unittest.TestCase):
new_automl_val_accuracy = 1.0 - new_automl.best_loss
# print('Best ML leaner:', new_automl.best_estimator)
# print('Best hyperparmeter config:', new_automl.best_config)
print("Best accuracy on validation data: {0:.4g}".format(new_automl_val_accuracy))
print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
# print('Training duration of best run: {0:.4g} s'.format(new_automl_experiment.best_config_train_time))

View File

@@ -65,8 +65,8 @@ def test_automl(budget=5, dataset_format="dataframe", hpo_method=None):
""" retrieve best config and best learner """
print("Best ML leaner:", automl.best_estimator)
print("Best hyperparmeter config:", automl.best_config)
print("Best accuracy on validation data: {0:.4g}".format(1 - automl.best_loss))
print("Training duration of best run: {0:.4g} s".format(automl.best_config_train_time))
print(f"Best accuracy on validation data: {1 - automl.best_loss:.4g}")
print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")
print(automl.model.estimator)
print(automl.best_config_per_estimator)
print("time taken to find best model:", automl.time_to_find_best_model)

View File

@@ -1,9 +1,12 @@
import unittest
from test.conftest import evaluate_cv_folds_with_underlying_model
import numpy as np
import pytest
import scipy.sparse
from sklearn.datasets import (
fetch_california_housing,
make_regression,
)
from flaml import AutoML
@@ -205,7 +208,6 @@ class TestRegression(unittest.TestCase):
def test_multioutput():
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputRegressor, RegressorChain
@@ -230,5 +232,210 @@ def test_multioutput():
print(model.predict(X_test))
@pytest.mark.parametrize(
"estimator",
[
"catboost",
"enet",
"extra_tree",
"histgb",
"kneighbor",
"lgbm",
"rf",
"xgboost",
"xgb_limitdepth",
],
)
def test_reproducibility_of_regression_models(estimator: str):
"""FLAML finds the best model for a given dataset, which it then provides to users.
However, there are reported issues where FLAML was providing an incorrect model - see here:
https://github.com/microsoft/FLAML/issues/1317
In this test we take the best regression model which FLAML provided us, and then retrain and test it on the
same folds, to verify that the result is reproducible.
"""
automl = AutoML()
automl_settings = {
"max_iter": 2,
"time_budget": -1,
"task": "regression",
"n_jobs": 1,
"estimator_list": [estimator],
"eval_method": "cv",
"n_splits": 3,
"metric": "r2",
"keep_search_state": True,
"skip_transform": True,
"retrain_full": True,
}
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
automl.fit(X_train=X, y_train=y, **automl_settings)
best_model = automl.model
assert best_model is not None
config = best_model.get_params()
val_loss_flaml = automl.best_result["val_loss"]
# Take the best model, and see if we can reproduce the best result
reproduced_val_loss, metric_for_logging, train_time, pred_time = automl._state.task.evaluate_model_CV(
config=config,
estimator=best_model,
X_train_all=automl._state.X_train_all,
y_train_all=automl._state.y_train_all,
budget=None,
kf=automl._state.kf,
eval_metric="r2",
best_val_loss=None,
cv_score_agg_func=None,
log_training_metric=False,
fit_kwargs=None,
free_mem_ratio=0,
)
assert pytest.approx(val_loss_flaml) == reproduced_val_loss
def test_reproducibility_of_catboost_regression_model():
"""FLAML finds the best model for a given dataset, which it then provides to users.
However, there are reported issues around the catboost model - see here:
https://github.com/microsoft/FLAML/issues/1317
In this test we take the best catboost regression model which FLAML provided us, and then retrain and test it on the
same folds, to verify that the result is reproducible.
"""
automl = AutoML()
automl_settings = {
"time_budget": 7,
"task": "regression",
"n_jobs": 1,
"estimator_list": ["catboost"],
"eval_method": "cv",
"n_splits": 10,
"metric": "r2",
"keep_search_state": True,
"skip_transform": True,
"retrain_full": True,
}
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
automl.fit(X_train=X, y_train=y, **automl_settings)
best_model = automl.model
assert best_model is not None
config = best_model.get_params()
val_loss_flaml = automl.best_result["val_loss"]
# Take the best model, and see if we can reproduce the best result
reproduced_val_loss, metric_for_logging, train_time, pred_time = automl._state.task.evaluate_model_CV(
config=config,
estimator=best_model,
X_train_all=automl._state.X_train_all,
y_train_all=automl._state.y_train_all,
budget=None,
kf=automl._state.kf,
eval_metric="r2",
best_val_loss=None,
cv_score_agg_func=None,
log_training_metric=False,
fit_kwargs=None,
free_mem_ratio=0,
)
assert pytest.approx(val_loss_flaml) == reproduced_val_loss
def test_reproducibility_of_lgbm_regression_model():
"""FLAML finds the best model for a given dataset, which it then provides to users.
However, there are reported issues around LGBMs - see here:
https://github.com/microsoft/FLAML/issues/1368
In this test we take the best LGBM regression model which FLAML provided us, and then retrain and test it on the
same folds, to verify that the result is reproducible.
"""
automl = AutoML()
automl_settings = {
"time_budget": 3,
"task": "regression",
"n_jobs": 1,
"estimator_list": ["lgbm"],
"eval_method": "cv",
"n_splits": 9,
"metric": "r2",
"keep_search_state": True,
"skip_transform": True,
"retrain_full": True,
}
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
automl.fit(X_train=X, y_train=y, **automl_settings)
best_model = automl.model
assert best_model is not None
config = best_model.get_params()
val_loss_flaml = automl.best_result["val_loss"]
# Take the best model, and see if we can reproduce the best result
reproduced_val_loss, metric_for_logging, train_time, pred_time = automl._state.task.evaluate_model_CV(
config=config,
estimator=best_model,
X_train_all=automl._state.X_train_all,
y_train_all=automl._state.y_train_all,
budget=None,
kf=automl._state.kf,
eval_metric="r2",
best_val_loss=None,
cv_score_agg_func=None,
log_training_metric=False,
fit_kwargs=None,
free_mem_ratio=0,
)
assert pytest.approx(val_loss_flaml) == reproduced_val_loss or val_loss_flaml > reproduced_val_loss
@pytest.mark.parametrize(
"estimator",
[
"catboost",
"enet",
"extra_tree",
"histgb",
"kneighbor",
"lgbm",
"rf",
"xgboost",
"xgb_limitdepth",
],
)
def test_reproducibility_of_underlying_regression_models(estimator: str):
"""FLAML finds the best model for a given dataset, which it then provides to users.
However, there are reported issues where FLAML was providing an incorrect model - see here:
https://github.com/microsoft/FLAML/issues/1317
FLAML defines FLAMLised models, which wrap around the underlying (SKLearn/XGBoost/CatBoost) model.
Ideally, FLAMLised models should perform identically to the underlying model, when fitted
to the same data, with no budget. This verifies that this is the case for regression models.
In this test we take the best model which FLAML provided us, extract the underlying model,
before retraining and testing it on the same folds - to verify that the result is reproducible.
"""
automl = AutoML()
automl_settings = {
"max_iter": 5,
"time_budget": -1,
"task": "regression",
"n_jobs": 1,
"estimator_list": [estimator],
"eval_method": "cv",
"n_splits": 10,
"metric": "r2",
"keep_search_state": True,
"skip_transform": True,
"retrain_full": False,
}
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
automl.fit(X_train=X, y_train=y, **automl_settings)
best_model = automl.model
assert best_model is not None
val_loss_flaml = automl.best_result["val_loss"]
reproduced_val_loss_underlying_model = np.mean(
evaluate_cv_folds_with_underlying_model(
automl._state.X_train_all, automl._state.y_train_all, automl._state.kf, best_model.model, "regression"
)
)
assert pytest.approx(val_loss_flaml) == reproduced_val_loss_underlying_model
if __name__ == "__main__":
unittest.main()

View File

@@ -195,7 +195,7 @@ class TestScore:
automl_settings = {
"time_budget": 2,
"task": "rank",
"log_file_name": "test/{}.log".format(dataset),
"log_file_name": f"test/{dataset}.log",
"model_history": True,
"groups": np.array([0] * 200 + [1] * 200 + [2] * 100), # group labels
"learner_selector": "roundrobin",

View File

@@ -1,4 +1,6 @@
from sklearn.datasets import fetch_openml
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml, load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GroupKFold, KFold, train_test_split
@@ -16,7 +18,7 @@ def _test(split_type):
"time_budget": 2,
# "metric": 'accuracy',
"task": "classification",
"log_file_name": "test/{}.log".format(dataset),
"log_file_name": f"test/{dataset}.log",
"model_history": True,
"log_training_metric": True,
"split_type": split_type,
@@ -48,7 +50,7 @@ def test_time():
_test(split_type="time")
def test_groups():
def test_groups_for_classification_task():
from sklearn.externals._arff import ArffException
try:
@@ -58,17 +60,15 @@ def test_groups():
X, y = load_wine(return_X_y=True)
import numpy as np
automl = AutoML()
automl_settings = {
"time_budget": 2,
"task": "classification",
"log_file_name": "test/{}.log".format(dataset),
"log_file_name": f"test/{dataset}.log",
"model_history": True,
"eval_method": "cv",
"groups": np.random.randint(low=0, high=10, size=len(y)),
"estimator_list": ["lgbm", "rf", "xgboost", "kneighbor"],
"estimator_list": ["catboost", "lgbm", "rf", "xgboost", "kneighbor"],
"learner_selector": "roundrobin",
}
automl.fit(X, y, **automl_settings)
@@ -88,6 +88,72 @@ def test_groups():
automl.fit(X, y, **automl_settings)
def test_groups_for_regression_task():
"""Append nonsensical groups to iris dataset and use it to test that GroupKFold works for regression tasks"""
iris_dict_data = load_iris(as_frame=True) # numpy arrays
iris_data = iris_dict_data["frame"] # pandas dataframe data + target
rng = np.random.default_rng(42)
iris_data["cluster"] = rng.integers(
low=0, high=5, size=iris_data.shape[0]
) # np.random.randint(0, 5, iris_data.shape[0])
automl = AutoML()
X = iris_data[["sepal length (cm)", "sepal width (cm)", "petal length (cm)"]].to_numpy()
y = iris_data["petal width (cm)"]
X_train, X_test, y_train, y_test, groups_train, groups_test = train_test_split(
X, y, iris_data["cluster"], random_state=42
)
automl_settings = {
"max_iter": 5,
"time_budget": -1,
"metric": "r2",
"task": "regression",
"estimator_list": ["lgbm", "rf", "xgboost", "kneighbor"],
"eval_method": "cv",
"split_type": "uniform",
"groups": groups_train,
}
automl.fit(X_train, y_train, **automl_settings)
def test_groups_with_sample_weights():
"""Verifies that sample weights can be used with group splits i.e. that https://github.com/microsoft/FLAML/issues/1396 remains fixed"""
iris_dict_data = load_iris(as_frame=True) # numpy arrays
iris_data = iris_dict_data["frame"] # pandas dataframe data + target
iris_data["cluster"] = np.random.randint(0, 5, iris_data.shape[0])
automl = AutoML()
X = iris_data[["sepal length (cm)", "sepal width (cm)", "petal length (cm)"]].to_numpy()
y = iris_data["petal width (cm)"]
sample_weight = pd.Series(np.random.rand(X.shape[0]))
(
X_train,
X_test,
y_train,
y_test,
groups_train,
groups_test,
sample_weight_train,
sample_weight_test,
) = train_test_split(X, y, iris_data["cluster"], sample_weight, random_state=42)
automl_settings = {
"max_iter": 5,
"time_budget": -1,
"metric": "r2",
"task": "regression",
"log_file_name": "error.log",
"log_type": "all",
"estimator_list": ["lgbm"],
"eval_method": "cv",
"split_type": "group",
"groups": groups_train,
"sample_weight": sample_weight_train,
}
automl.fit(X_train, y_train, **automl_settings)
assert automl.model is not None
def test_stratified_groupkfold():
from minio.error import ServerError
from sklearn.model_selection import StratifiedGroupKFold
@@ -108,6 +174,7 @@ def test_stratified_groupkfold():
"split_type": splitter,
"groups": X_train["Airline"],
"estimator_list": [
"catboost",
"lgbm",
"rf",
"xgboost",
@@ -136,7 +203,7 @@ def test_rank():
automl_settings = {
"time_budget": 2,
"task": "rank",
"log_file_name": "test/{}.log".format(dataset),
"log_file_name": f"test/{dataset}.log",
"model_history": True,
"eval_method": "cv",
"groups": np.array([0] * 200 + [1] * 200 + [2] * 200 + [3] * 200 + [4] * 100 + [5] * 100), # group labels
@@ -149,7 +216,7 @@ def test_rank():
"time_budget": 2,
"task": "rank",
"metric": "ndcg@5", # 5 can be replaced by any number
"log_file_name": "test/{}.log".format(dataset),
"log_file_name": f"test/{dataset}.log",
"model_history": True,
"groups": [200] * 4 + [100] * 2, # alternative way: group counts
# "estimator_list": ['lgbm', 'xgboost'], # list of ML learners
@@ -188,7 +255,7 @@ def test_object():
automl_settings = {
"time_budget": 2,
"task": "classification",
"log_file_name": "test/{}.log".format(dataset),
"log_file_name": f"test/{dataset}.log",
"model_history": True,
"log_training_metric": True,
"split_type": TestKFold(5),
@@ -203,4 +270,4 @@ def test_object():
if __name__ == "__main__":
test_groups()
test_groups_for_classification_task()

View File

@@ -29,8 +29,8 @@ class TestWarmStart(unittest.TestCase):
automl_val_accuracy = 1.0 - automl.best_loss
print("Best ML leaner:", automl.best_estimator)
print("Best hyperparmeter config:", automl.best_config)
print("Best accuracy on validation data: {0:.4g}".format(automl_val_accuracy))
print("Training duration of best run: {0:.4g} s".format(automl.best_config_train_time))
print(f"Best accuracy on validation data: {automl_val_accuracy:.4g}")
print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")
# 1. Get starting points from previous experiments.
starting_points = automl.best_config_per_estimator
print("starting_points", starting_points)
@@ -97,8 +97,8 @@ class TestWarmStart(unittest.TestCase):
new_automl_val_accuracy = 1.0 - new_automl.best_loss
print("Best ML leaner:", new_automl.best_estimator)
print("Best hyperparmeter config:", new_automl.best_config)
print("Best accuracy on validation data: {0:.4g}".format(new_automl_val_accuracy))
print("Training duration of best run: {0:.4g} s".format(new_automl.best_config_train_time))
print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
print(f"Training duration of best run: {new_automl.best_config_train_time:.4g} s")
def test_nobudget(self):
automl = AutoML()

42
test/conftest.py Normal file
View File

@@ -0,0 +1,42 @@
from typing import Any, Dict, List, Union
import numpy as np
import pandas as pd
from catboost import CatBoostClassifier, CatBoostRegressor, Pool
from sklearn.metrics import f1_score, r2_score
def evaluate_cv_folds_with_underlying_model(X_train_all, y_train_all, kf, model: Any, task: str) -> pd.DataFrame:
"""Mimic the FLAML CV process to calculate the metrics across each fold.
:param X_train_all: X training data
:param y_train_all: y training data
:param kf: The splitter object to use to generate the folds
:param model: The estimator to fit to the data during the CV process
:param task: classification or regression
:return: An array containing the metrics
"""
rng = np.random.RandomState(2020)
all_fold_metrics: List[Dict[str, Union[int, float]]] = []
for train_index, val_index in kf.split(X_train_all, y_train_all):
X_train_split, y_train_split = X_train_all, y_train_all
train_index = rng.permutation(train_index)
X_train = X_train_split.iloc[train_index]
X_val = X_train_split.iloc[val_index]
y_train, y_val = y_train_split[train_index], y_train_split[val_index]
model_type = type(model)
if model_type is not CatBoostClassifier and model_type is not CatBoostRegressor:
model.fit(X_train, y_train)
else:
use_best_model = True
n = max(int(len(y_train) * 0.9), len(y_train) - 1000) if use_best_model else len(y_train)
X_tr, y_tr = (X_train)[:n], y_train[:n]
eval_set = Pool(data=X_train[n:], label=y_train[n:], cat_features=[]) if use_best_model else None
model.fit(X_tr, y_tr, eval_set=eval_set, use_best_model=True)
y_pred_classes = model.predict(X_val)
if task == "classification":
reproduced_metric = 1 - f1_score(y_val, y_pred_classes)
else:
reproduced_metric = 1 - r2_score(y_val, y_pred_classes)
all_fold_metrics.append(reproduced_metric)
return all_fold_metrics

View File

@@ -30,7 +30,7 @@ def test_hf_data():
import json
with open("seqclass.log", "r") as fin:
with open("seqclass.log") as fin:
for line in fin:
each_log = json.loads(line.strip("\n"))
if "validation_loss" in each_log:

View File

@@ -24,6 +24,8 @@ model_path_list = [
if sys.platform.startswith("darwin") and sys.version_info[0] == 3 and sys.version_info[1] == 11:
pytest.skip("skipping Python 3.11 on MacOS", allow_module_level=True)
pytestmark = pytest.mark.spark # set to spark as parallel testing raised RuntimeError
def test_switch_1_1():
data_idx, model_path_idx = 0, 0

View File

@@ -5,6 +5,8 @@ import sys
import pytest
from utils import get_automl_settings, get_toy_data_seqclassification
pytestmark = pytest.mark.spark # set to spark as parallel testing raised MlflowException of changing parameter
@pytest.mark.skipif(sys.platform in ["darwin", "win32"], reason="do not run on mac os or windows")
def test_cv():

View File

@@ -44,7 +44,7 @@ def test_tokenclassification_idlabel():
# perf test
import json
with open("seqclass.log", "r") as fin:
with open("seqclass.log") as fin:
for line in fin:
each_log = json.loads(line.strip("\n"))
if "validation_loss" in each_log:
@@ -86,7 +86,7 @@ def test_tokenclassification_tokenlabel():
# perf test
import json
with open("seqclass.log", "r") as fin:
with open("seqclass.log") as fin:
for line in fin:
each_log = json.loads(line.strip("\n"))
if "validation_loss" in each_log:

View File

@@ -10,6 +10,10 @@ from flaml.default import portfolio
if sys.platform.startswith("darwin") and sys.version_info[0] == 3 and sys.version_info[1] == 11:
pytest.skip("skipping Python 3.11 on MacOS", allow_module_level=True)
pytestmark = (
pytest.mark.spark
) # set to spark as parallel testing raised ValueError: Feature NonExisting not implemented.
def pop_args(fit_kwargs):
fit_kwargs.pop("max_iter", None)

View File

@@ -25,7 +25,7 @@ logger = logging.getLogger("mnist_AutoML")
class Net(nn.Module):
def __init__(self, hidden_size):
super(Net, self).__init__()
super().__init__()
self.conv1 = nn.Conv2d(1, 20, 5, 1)
self.conv2 = nn.Conv2d(20, 50, 5, 1)
self.fc1 = nn.Linear(4 * 4 * 50, hidden_size)

View File

@@ -3,10 +3,13 @@ import sys
import warnings
import mlflow
import numpy as np
import pytest
import sklearn.datasets as skds
from packaging.version import Version
from flaml import AutoML
from flaml.automl.data import auto_convert_dtypes_pandas, auto_convert_dtypes_spark, get_random_dataframe
from flaml.tune.spark.utils import check_spark
warnings.simplefilter(action="ignore")
@@ -20,23 +23,26 @@ else:
from flaml.automl.spark.utils import to_pandas_on_spark
postfix_version = "-spark3.3," if pyspark.__version__ > "3.2" else ","
spark = (
pyspark.sql.SparkSession.builder.appName("MyApp")
.master("local[2]")
.config(
"spark.jars.packages",
(
f"com.microsoft.azure:synapseml_2.12:0.11.3{postfix_version}"
"com.microsoft.azure:synapseml_2.12:1.0.4,"
"org.apache.hadoop:hadoop-azure:3.3.5,"
"com.microsoft.azure:azure-storage:8.6.6,"
f"org.mlflow:mlflow-spark:2.6.0"
f"org.mlflow:mlflow-spark_2.12:{mlflow.__version__}"
if Version(mlflow.__version__) >= Version("2.9.0")
else f"org.mlflow:mlflow-spark:{mlflow.__version__}"
),
)
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
.config("spark.sql.debug.maxToStringFields", "100")
.config("spark.driver.extraJavaOptions", "-Xss1m")
.config("spark.executor.extraJavaOptions", "-Xss1m")
# .config("spark.executor.memory", "48G")
# .config("spark.driver.memory", "48G")
.getOrCreate()
)
spark.sparkContext._conf.set(
@@ -49,8 +55,12 @@ else:
except ImportError:
skip_spark = True
if sys.version_info >= (3, 11):
skip_py311 = True
else:
skip_py311 = False
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
def _test_spark_synapseml_lightgbm(spark=None, task="classification"):
@@ -159,10 +169,11 @@ def test_spark_input_df():
settings = {
"time_budget": 30, # total running time in seconds
"metric": "roc_auc",
"estimator_list": ["lgbm_spark"], # list of ML learners; we tune lightgbm in this example
# "estimator_list": ["lgbm_spark"], # list of ML learners; we tune lightgbm in this example
"task": "classification", # task type
"log_file_name": "flaml_experiment.log", # flaml log file
"seed": 7654321, # random seed
"eval_method": "holdout",
}
df = to_pandas_on_spark(to_pandas_on_spark(train_data).to_spark(index_col="index"))
@@ -176,17 +187,17 @@ def test_spark_input_df():
try:
model = automl.model.estimator
predictions = model.transform(test_data)
predictions.show()
# from synapse.ml.train import ComputeModelStatistics
# metrics = ComputeModelStatistics(
# evaluationMetric="classification",
# labelCol="Bankrupt?",
# scoredLabelsCol="prediction",
# ).transform(predictions)
# metrics.show()
from synapse.ml.train import ComputeModelStatistics
if not skip_py311:
# ComputeModelStatistics doesn't support python 3.11
metrics = ComputeModelStatistics(
evaluationMetric="classification",
labelCol="Bankrupt?",
scoredLabelsCol="prediction",
).transform(predictions)
metrics.show()
except AttributeError:
print("No fitted model because of too short training time.")
@@ -207,16 +218,173 @@ def test_spark_input_df():
assert "No estimator is left." in str(excinfo.value)
def _test_spark_large_df():
"""Test with large dataframe, should not run in pipeline."""
import os
import time
import pandas as pd
from pyspark.sql import functions as F
import flaml
os.environ["FLAML_MAX_CONCURRENT"] = "8"
start_time = time.time()
def load_higgs():
# 11M rows, 29 columns, 1.1GB
df = (
spark.read.format("csv")
.option("header", False)
.option("inferSchema", True)
.load("/datadrive/datasets/HIGGS.csv")
.withColumnRenamed("_c0", "target")
.withColumn("target", F.col("target").cast("integer"))
.limit(1000000)
.fillna(0)
.na.drop(how="any")
.repartition(64)
.cache()
)
print("Number of rows in data: ", df.count())
return df
def load_bosch():
# 1.184M rows, 969 cols, 1.5GB
df = (
spark.read.format("csv")
.option("header", True)
.option("inferSchema", True)
.load("/datadrive/datasets/train_numeric.csv")
.withColumnRenamed("Response", "target")
.withColumn("target", F.col("target").cast("integer"))
.limit(1000000)
.fillna(0)
.drop("Id")
.repartition(64)
.cache()
)
print("Number of rows in data: ", df.count())
return df
def prepare_data(dataset_name="higgs"):
df = load_higgs() if dataset_name == "higgs" else load_bosch()
train, test = df.randomSplit([0.75, 0.25], seed=7654321)
feature_cols = [col for col in df.columns if col not in ["target", "arrest"]]
final_cols = ["target", "features"]
featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
train_data = featurizer.transform(train)[final_cols]
test_data = featurizer.transform(test)[final_cols]
train_data = to_pandas_on_spark(to_pandas_on_spark(train_data).to_spark(index_col="index"))
return train_data, test_data
train_data, test_data = prepare_data("higgs")
end_time = time.time()
print("time cost in minutes for prepare data: ", (end_time - start_time) / 60)
automl = flaml.AutoML()
automl_settings = {
"max_iter": 3,
"time_budget": 7200,
"metric": "accuracy",
"task": "classification",
"seed": 1234,
"eval_method": "holdout",
}
automl.fit(dataframe=train_data, label="target", ensemble=False, **automl_settings)
model = automl.model.estimator
predictions = model.transform(test_data)
predictions.show(5)
end_time = time.time()
print("time cost in minutes: ", (end_time - start_time) / 60)
def test_get_random_dataframe():
# Test with default parameters
df = get_random_dataframe(n_rows=50, ratio_none=0.2, seed=123)
assert df.shape == (50, 14) # Default is 200 rows and 14 columns
# Test column types
assert "timestamp" in df.columns and np.issubdtype(df["timestamp"].dtype, np.datetime64)
assert "id" in df.columns and np.issubdtype(df["id"].dtype, np.integer)
assert "score" in df.columns and np.issubdtype(df["score"].dtype, np.floating)
assert "category" in df.columns and df["category"].dtype.name == "category"
def test_auto_convert_dtypes_pandas():
# Create a test DataFrame with various types
import pandas as pd
test_df = pd.DataFrame(
{
"int_col": ["1", "2", "3", "4", "5", "6", "6"],
"float_col": ["1.1", "2.2", "3.3", "NULL", "5.5", "6.6", "6.6"],
"date_col": ["2021-01-01", "2021-02-01", "NA", "2021-04-01", "2021-05-01", "2021-06-01", "2021-06-01"],
"cat_col": ["A", "B", "A", "A", "B", "A", "B"],
"string_col": ["text1", "text2", "text3", "text4", "text5", "text6", "text7"],
}
)
# Convert dtypes
converted_df, schema = auto_convert_dtypes_pandas(test_df)
# Check conversions
assert schema["int_col"] == "int"
assert schema["float_col"] == "double"
assert schema["date_col"] == "timestamp"
assert schema["cat_col"] == "category"
assert schema["string_col"] == "string"
def test_auto_convert_dtypes_spark():
"""Test auto_convert_dtypes_spark function with various data types."""
import pandas as pd
# Create a test DataFrame with various types
test_pdf = pd.DataFrame(
{
"int_col": ["1", "2", "3", "4", "NA"],
"float_col": ["1.1", "2.2", "3.3", "NULL", "5.5"],
"date_col": ["2021-01-01", "2021-02-01", "NA", "2021-04-01", "2021-05-01"],
"cat_col": ["A", "B", "A", "C", "B"],
"string_col": ["text1", "text2", "text3", "text4", "text5"],
}
)
# Convert pandas DataFrame to Spark DataFrame
test_df = spark.createDataFrame(test_pdf)
# Convert dtypes
converted_df, schema = auto_convert_dtypes_spark(test_df)
# Check conversions
assert schema["int_col"] == "int"
assert schema["float_col"] == "double"
assert schema["date_col"] == "timestamp"
assert schema["cat_col"] == "string" # Conceptual category in schema
assert schema["string_col"] == "string"
# Verify the actual data types from the Spark DataFrame
spark_dtypes = dict(converted_df.dtypes)
assert spark_dtypes["int_col"] == "int"
assert spark_dtypes["float_col"] == "double"
assert spark_dtypes["date_col"] == "timestamp"
assert spark_dtypes["cat_col"] == "string" # In Spark, categories are still strings
assert spark_dtypes["string_col"] == "string"
if __name__ == "__main__":
test_spark_synapseml_classification()
test_spark_synapseml_regression()
test_spark_synapseml_rank()
test_spark_input_df()
test_get_random_dataframe()
test_auto_convert_dtypes_pandas()
test_auto_convert_dtypes_spark()
# import cProfile
# import pstats
# from pstats import SortKey
# cProfile.run("test_spark_input_df()", "test_spark_input_df.profile")
# p = pstats.Stats("test_spark_input_df.profile")
# p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats("utils.py")
# cProfile.run("_test_spark_large_df()", "_test_spark_large_df.profile")
# p = pstats.Stats("_test_spark_large_df.profile")
# p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(50)

View File

@@ -25,7 +25,7 @@ os.environ["FLAML_MAX_CONCURRENT"] = "2"
spark_available, _ = check_spark()
skip_spark = not spark_available
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
def test_parallel_xgboost(hpo_method=None, data_size=1000):

View File

@@ -1,6 +1,7 @@
import os
import unittest
import pytest
from sklearn.datasets import load_wine
from flaml import AutoML
@@ -24,6 +25,8 @@ if os.path.exists(os.path.join(os.getcwd(), "test", "spark", "custom_mylearner.p
else:
skip_my_learner = True
pytestmark = pytest.mark.spark
class TestEnsemble(unittest.TestCase):
def setUp(self) -> None:

View File

@@ -9,7 +9,7 @@ from flaml.tune.spark.utils import check_spark
spark_available, _ = check_spark()
skip_spark = not spark_available
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
os.environ["FLAML_MAX_CONCURRENT"] = "2"
@@ -41,8 +41,8 @@ def base_automl(n_concurrent_trials=1, use_ray=False, use_spark=False, verbose=0
print("Best ML leaner:", automl.best_estimator)
print("Best hyperparmeter config:", automl.best_config)
print("Best accuracy on validation data: {0:.4g}".format(1 - automl.best_loss))
print("Training duration of best run: {0:.4g} s".format(automl.best_config_train_time))
print(f"Best accuracy on validation data: {1 - automl.best_loss:.4g}")
print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")
def test_both_ray_spark():

343
test/spark/test_mlflow.py Normal file
View File

@@ -0,0 +1,343 @@
import importlib
import os
import sys
import time
import warnings
import mlflow
import pytest
from packaging.version import Version
from sklearn.datasets import fetch_california_housing, load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
import flaml
from flaml.automl.spark.utils import to_pandas_on_spark
try:
import pyspark
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler
except ImportError:
pass
pytestmark = pytest.mark.spark
warnings.filterwarnings("ignore")
skip_spark = importlib.util.find_spec("pyspark") is None
client = mlflow.tracking.MlflowClient()
if (sys.platform.startswith("darwin") or sys.platform.startswith("nt")) and (
sys.version_info[0] == 3 and sys.version_info[1] >= 10
):
# TODO: remove this block when tests are stable
# Below tests will fail, but the functions run without error if run individually.
# test_tune_autolog_parentrun_nonparallel()
# test_tune_autolog_noparentrun_nonparallel()
# test_tune_noautolog_parentrun_nonparallel()
# test_tune_noautolog_noparentrun_nonparallel()
pytest.skip("skipping MacOS and Windows for python 3.10 and 3.11", allow_module_level=True)
"""
The spark used in below tests should be initiated in test_0sparkml.py when run with pytest.
"""
def _sklearn_tune(config):
is_autolog = config.pop("is_autolog")
is_parent_run = config.pop("is_parent_run")
is_parallel = config.pop("is_parallel")
X, y = load_diabetes(return_X_y=True, as_frame=True)
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.25)
rf = RandomForestRegressor(**config)
rf.fit(train_x, train_y)
pred = rf.predict(test_x)
r2 = r2_score(test_y, pred)
if not is_autolog and not is_parent_run and not is_parallel:
with mlflow.start_run(nested=True):
mlflow.log_metric("r2", r2)
return {"r2": r2}
def _test_tune(is_autolog, is_parent_run, is_parallel):
mlflow.end_run()
mlflow_exp_name = f"test_mlflow_integration_{int(time.time())}"
mlflow_experiment = mlflow.set_experiment(mlflow_exp_name)
params = {
"n_estimators": flaml.tune.randint(100, 1000),
"min_samples_leaf": flaml.tune.randint(1, 10),
"is_autolog": is_autolog,
"is_parent_run": is_parent_run,
"is_parallel": is_parallel,
}
if is_autolog:
mlflow.autolog()
else:
mlflow.autolog(disable=True)
if is_parent_run:
mlflow.start_run(run_name=f"tune_autolog_{is_autolog}_sparktrial_{is_parallel}")
flaml.tune.run(
_sklearn_tune,
params,
metric="r2",
mode="max",
num_samples=3,
use_spark=True if is_parallel else False,
n_concurrent_trials=2 if is_parallel else 1,
mlflow_exp_name=mlflow_exp_name,
)
mlflow.end_run() # end current run
mlflow.autolog(disable=True)
return mlflow_experiment.experiment_id
def _check_mlflow_logging(possible_num_runs, metric, is_parent_run, experiment_id, is_automl=False, skip_tags=False):
if isinstance(possible_num_runs, int):
possible_num_runs = [possible_num_runs]
if is_parent_run:
parent_run = mlflow.last_active_run()
child_runs = client.search_runs(
experiment_ids=[experiment_id],
filter_string=f"tags.mlflow.parentRunId = '{parent_run.info.run_id}'",
)
else:
child_runs = client.search_runs(experiment_ids=[experiment_id])
experiment_name = client.get_experiment(experiment_id).name
metrics = [metric in run.data.metrics for run in child_runs]
tags = ["flaml.version" in run.data.tags for run in child_runs]
params = ["learner" in run.data.params for run in child_runs]
assert (
len(child_runs) in possible_num_runs
), f"The number of child runs is not correct on experiment {experiment_name}."
if possible_num_runs[0] > 0:
assert all(metrics), f"The metrics are not logged correctly on experiment {experiment_name}."
assert (
all(tags) if not skip_tags else True
), f"The tags are not logged correctly on experiment {experiment_name}."
assert (
all(params) if is_automl else True
), f"The params are not logged correctly on experiment {experiment_name}."
# mlflow.delete_experiment(experiment_id)
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_tune_autolog_parentrun_parallel():
experiment_id = _test_tune(is_autolog=True, is_parent_run=True, is_parallel=True)
_check_mlflow_logging([4, 3], "r2", True, experiment_id)
def test_tune_autolog_parentrun_nonparallel():
experiment_id = _test_tune(is_autolog=True, is_parent_run=True, is_parallel=False)
_check_mlflow_logging(3, "r2", True, experiment_id)
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_tune_autolog_noparentrun_parallel():
experiment_id = _test_tune(is_autolog=True, is_parent_run=False, is_parallel=True)
_check_mlflow_logging([4, 3], "r2", False, experiment_id)
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_tune_noautolog_parentrun_parallel():
experiment_id = _test_tune(is_autolog=False, is_parent_run=True, is_parallel=True)
_check_mlflow_logging([4, 3], "r2", True, experiment_id)
def test_tune_autolog_noparentrun_nonparallel():
experiment_id = _test_tune(is_autolog=True, is_parent_run=False, is_parallel=False)
_check_mlflow_logging(3, "r2", False, experiment_id)
def test_tune_noautolog_parentrun_nonparallel():
experiment_id = _test_tune(is_autolog=False, is_parent_run=True, is_parallel=False)
_check_mlflow_logging(3, "r2", True, experiment_id)
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_tune_noautolog_noparentrun_parallel():
experiment_id = _test_tune(is_autolog=False, is_parent_run=False, is_parallel=True)
_check_mlflow_logging(0, "r2", False, experiment_id)
def test_tune_noautolog_noparentrun_nonparallel():
experiment_id = _test_tune(is_autolog=False, is_parent_run=False, is_parallel=False)
_check_mlflow_logging(3, "r2", False, experiment_id, skip_tags=True)
def _test_automl_sparkdata(is_autolog, is_parent_run):
mlflow.end_run()
mlflow_exp_name = f"test_mlflow_integration_{int(time.time())}"
mlflow_experiment = mlflow.set_experiment(mlflow_exp_name)
if is_autolog:
mlflow.autolog()
else:
mlflow.autolog(disable=True)
if is_parent_run:
mlflow.start_run(run_name=f"automl_sparkdata_autolog_{is_autolog}")
spark = pyspark.sql.SparkSession.builder.getOrCreate()
pd_df = load_diabetes(as_frame=True).frame
df = spark.createDataFrame(pd_df)
df = df.repartition(4).cache()
train, test = df.randomSplit([0.8, 0.2], seed=1)
feature_cols = df.columns[:-1]
featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
train_data = featurizer.transform(train)["target", "features"]
featurizer.transform(test)["target", "features"]
automl = flaml.AutoML()
settings = {
"max_iter": 3,
"metric": "mse",
"task": "regression", # task type
"log_file_name": "flaml_experiment.log", # flaml log file
"mlflow_exp_name": mlflow_exp_name,
"log_type": "all",
"n_splits": 2,
"model_history": True,
}
df = to_pandas_on_spark(to_pandas_on_spark(train_data).to_spark(index_col="index"))
automl.fit(
dataframe=df,
label="target",
**settings,
)
mlflow.end_run() # end current run
mlflow.autolog(disable=True)
return mlflow_experiment.experiment_id
def _test_automl_nonsparkdata(is_autolog, is_parent_run):
mlflow_exp_name = f"test_mlflow_integration_{int(time.time())}"
mlflow_experiment = mlflow.set_experiment(mlflow_exp_name)
if is_autolog:
mlflow.autolog()
else:
mlflow.autolog(disable=True)
if is_parent_run:
mlflow.start_run(run_name=f"automl_nonsparkdata_autolog_{is_autolog}")
automl_experiment = flaml.AutoML()
automl_settings = {
"max_iter": 3,
"metric": "r2",
"task": "regression",
"n_concurrent_trials": 2,
"use_spark": True,
"mlflow_exp_name": None if is_parent_run else mlflow_exp_name,
"log_type": "all",
"n_splits": 2,
"model_history": True,
}
X, y = load_diabetes(return_X_y=True, as_frame=True)
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.25)
automl_experiment.fit(X_train=train_x, y_train=train_y, **automl_settings)
mlflow.end_run() # end current run
mlflow.autolog(disable=True)
return mlflow_experiment.experiment_id
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_automl_sparkdata_autolog_parentrun():
experiment_id = _test_automl_sparkdata(is_autolog=True, is_parent_run=True)
_check_mlflow_logging(3, "mse", True, experiment_id, is_automl=True)
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_automl_sparkdata_autolog_noparentrun():
experiment_id = _test_automl_sparkdata(is_autolog=True, is_parent_run=False)
_check_mlflow_logging(3, "mse", False, experiment_id, is_automl=True)
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_automl_sparkdata_noautolog_parentrun():
experiment_id = _test_automl_sparkdata(is_autolog=False, is_parent_run=True)
_check_mlflow_logging(3, "mse", True, experiment_id, is_automl=True)
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_automl_sparkdata_noautolog_noparentrun():
experiment_id = _test_automl_sparkdata(is_autolog=False, is_parent_run=False)
_check_mlflow_logging(0, "mse", False, experiment_id, is_automl=True) # no logging
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_automl_nonsparkdata_autolog_parentrun():
experiment_id = _test_automl_nonsparkdata(is_autolog=True, is_parent_run=True)
_check_mlflow_logging([4, 3], "r2", True, experiment_id, is_automl=True)
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_automl_nonsparkdata_autolog_noparentrun():
experiment_id = _test_automl_nonsparkdata(is_autolog=True, is_parent_run=False)
_check_mlflow_logging([4, 3], "r2", False, experiment_id, is_automl=True)
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_automl_nonsparkdata_noautolog_parentrun():
experiment_id = _test_automl_nonsparkdata(is_autolog=False, is_parent_run=True)
_check_mlflow_logging([4, 3], "r2", True, experiment_id, is_automl=True)
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_automl_nonsparkdata_noautolog_noparentrun():
experiment_id = _test_automl_nonsparkdata(is_autolog=False, is_parent_run=False)
_check_mlflow_logging(0, "r2", False, experiment_id, is_automl=True) # no logging
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
def test_exit_pyspark_autolog():
import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
spark.sparkContext._gateway.shutdown_callback_server() # this is to avoid stucking
mlflow.autolog(disable=True)
def _init_spark_for_main():
import pyspark
spark = (
pyspark.sql.SparkSession.builder.appName("MyApp")
.master("local[2]")
.config(
"spark.jars.packages",
(
"com.microsoft.azure:synapseml_2.12:1.0.4,"
"org.apache.hadoop:hadoop-azure:3.3.5,"
"com.microsoft.azure:azure-storage:8.6.6,"
f"org.mlflow:mlflow-spark_2.12:{mlflow.__version__}"
if Version(mlflow.__version__) >= Version("2.9.0")
else f"org.mlflow:mlflow-spark:{mlflow.__version__}"
),
)
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
.config("spark.sql.debug.maxToStringFields", "100")
.config("spark.driver.extraJavaOptions", "-Xss1m")
.config("spark.executor.extraJavaOptions", "-Xss1m")
.getOrCreate()
)
spark.sparkContext._conf.set(
"spark.mlflow.pysparkml.autolog.logModelAllowlistFile",
"https://mmlspark.blob.core.windows.net/publicwasb/log_model_allowlist.txt",
)
if __name__ == "__main__":
_init_spark_for_main()
# test_tune_autolog_parentrun_parallel()
# test_tune_autolog_parentrun_nonparallel()
test_tune_autolog_noparentrun_parallel() # TODO: runs not removed
# test_tune_noautolog_parentrun_parallel()
# test_tune_autolog_noparentrun_nonparallel()
# test_tune_noautolog_parentrun_nonparallel()
# test_tune_noautolog_noparentrun_parallel()
# test_tune_noautolog_noparentrun_nonparallel()
# test_automl_sparkdata_autolog_parentrun()
# test_automl_sparkdata_autolog_noparentrun()
# test_automl_sparkdata_noautolog_parentrun()
# test_automl_sparkdata_noautolog_noparentrun()
# test_automl_nonsparkdata_autolog_parentrun()
# test_automl_nonsparkdata_autolog_noparentrun() # TODO: runs not removed
# test_automl_nonsparkdata_noautolog_parentrun()
# test_automl_nonsparkdata_noautolog_noparentrun()
test_exit_pyspark_autolog()

View File

@@ -2,6 +2,7 @@ import os
import unittest
import numpy as np
import pytest
import scipy.sparse
from sklearn.datasets import load_iris, load_wine
@@ -12,6 +13,7 @@ from flaml.tune.spark.utils import check_spark
spark_available, _ = check_spark()
skip_spark = not spark_available
pytestmark = pytest.mark.spark
os.environ["FLAML_MAX_CONCURRENT"] = "2"
@@ -344,8 +346,8 @@ class TestMultiClass(unittest.TestCase):
automl_val_accuracy = 1.0 - automl_experiment.best_loss
print("Best ML leaner:", automl_experiment.best_estimator)
print("Best hyperparmeter config:", automl_experiment.best_config)
print("Best accuracy on validation data: {0:.4g}".format(automl_val_accuracy))
print("Training duration of best run: {0:.4g} s".format(automl_experiment.best_config_train_time))
print(f"Best accuracy on validation data: {automl_val_accuracy:.4g}")
print(f"Training duration of best run: {automl_experiment.best_config_train_time:.4g} s")
starting_points = automl_experiment.best_config_per_estimator
print("starting_points", starting_points)
@@ -369,8 +371,8 @@ class TestMultiClass(unittest.TestCase):
new_automl_val_accuracy = 1.0 - new_automl_experiment.best_loss
print("Best ML leaner:", new_automl_experiment.best_estimator)
print("Best hyperparmeter config:", new_automl_experiment.best_config)
print("Best accuracy on validation data: {0:.4g}".format(new_automl_val_accuracy))
print("Training duration of best run: {0:.4g} s".format(new_automl_experiment.best_config_train_time))
print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
print(f"Training duration of best run: {new_automl_experiment.best_config_train_time:.4g} s")
def test_fit_w_starting_points_list(self, as_frame=True):
automl_experiment = AutoML()
@@ -394,8 +396,8 @@ class TestMultiClass(unittest.TestCase):
automl_val_accuracy = 1.0 - automl_experiment.best_loss
print("Best ML leaner:", automl_experiment.best_estimator)
print("Best hyperparmeter config:", automl_experiment.best_config)
print("Best accuracy on validation data: {0:.4g}".format(automl_val_accuracy))
print("Training duration of best run: {0:.4g} s".format(automl_experiment.best_config_train_time))
print(f"Best accuracy on validation data: {automl_val_accuracy:.4g}")
print(f"Training duration of best run: {automl_experiment.best_config_train_time:.4g} s")
starting_points = {}
log_file_name = automl_settings["log_file_name"]
@@ -409,7 +411,7 @@ class TestMultiClass(unittest.TestCase):
if learner not in starting_points:
starting_points[learner] = []
starting_points[learner].append(config)
max_iter = sum([len(s) for k, s in starting_points.items()])
max_iter = sum(len(s) for k, s in starting_points.items())
automl_settings_resume = {
"time_budget": 2,
"metric": "accuracy",
@@ -431,7 +433,7 @@ class TestMultiClass(unittest.TestCase):
new_automl_val_accuracy = 1.0 - new_automl_experiment.best_loss
# print('Best ML leaner:', new_automl_experiment.best_estimator)
# print('Best hyperparmeter config:', new_automl_experiment.best_config)
print("Best accuracy on validation data: {0:.4g}".format(new_automl_val_accuracy))
print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
# print('Training duration of best run: {0:.4g} s'.format(new_automl_experiment.best_config_train_time))

View File

@@ -9,7 +9,7 @@ from flaml.tune.spark.utils import check_spark
spark_available, _ = check_spark()
skip_spark = not spark_available
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
here = os.path.abspath(os.path.dirname(__file__))
os.environ["FLAML_MAX_CONCURRENT"] = "2"

View File

@@ -25,7 +25,7 @@ try:
except ImportError:
skip_spark = True
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
def test_overtime():
@@ -55,7 +55,7 @@ def test_overtime():
start_time = time.time()
automl_experiment.fit(**automl_settings)
elapsed_time = time.time() - start_time
print("time budget: {:.2f}s, actual elapsed time: {:.2f}s".format(time_budget, elapsed_time))
print(f"time budget: {time_budget:.2f}s, actual elapsed time: {elapsed_time:.2f}s")
# assert abs(elapsed_time - time_budget) < 5 # cancel assertion because github VM sometimes is super slow, causing the test to fail
print(automl_experiment.predict(df))
print(automl_experiment.model)

View File

@@ -11,7 +11,7 @@ from flaml.tune.spark.utils import check_spark
spark_available, _ = check_spark()
skip_spark = not spark_available
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
os.environ["FLAML_MAX_CONCURRENT"] = "2"
@@ -75,8 +75,8 @@ def run_automl(budget=3, dataset_format="dataframe", hpo_method=None):
""" retrieve best config and best learner """
print("Best ML leaner:", automl.best_estimator)
print("Best hyperparmeter config:", automl.best_config)
print("Best accuracy on validation data: {0:.4g}".format(1 - automl.best_loss))
print("Training duration of best run: {0:.4g} s".format(automl.best_config_train_time))
print(f"Best accuracy on validation data: {1 - automl.best_loss:.4g}")
print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")
print(automl.model.estimator)
print(automl.best_config_per_estimator)
print("time taken to find best model:", automl.time_to_find_best_model)

View File

@@ -14,7 +14,7 @@ from flaml.tune.spark.utils import check_spark
spark_available, _ = check_spark()
skip_spark = not spark_available
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
os.environ["FLAML_MAX_CONCURRENT"] = "2"
X, y = load_breast_cancer(return_X_y=True)

View File

@@ -36,7 +36,7 @@ except ImportError:
print("Spark is not installed. Skip all spark tests.")
skip_spark = True
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
def test_with_parameters_spark():
@@ -167,7 +167,7 @@ def test_len_labels():
assert len_labels(y1) == 4
ll, la = len_labels(y2, return_labels=True)
assert ll == 4
assert set(la.to_numpy()) == set([1, 2, 5, 4])
assert set(la.to_numpy()) == {1, 2, 5, 4}
def test_unique_value_first_index():

View File

@@ -50,11 +50,11 @@ def oml_to_vw_w_grouping(X, y, ds_dir, fname, orginal_dim, group_num, grouping_m
for i in range(len(X)):
NS_content = []
for zz in range(len(group_indexes)):
ns_features = " ".join("{}:{:.6f}".format(ind, X[i][ind]) for ind in group_indexes[zz])
ns_features = " ".join(f"{ind}:{X[i][ind]:.6f}" for ind in group_indexes[zz])
NS_content.append(ns_features)
ns_line = "{} |{}".format(
str(y[i]),
"|".join("{} {}".format(NS_LIST[j], NS_content[j]) for j in range(len(group_indexes))),
"|".join(f"{NS_LIST[j]} {NS_content[j]}" for j in range(len(group_indexes))),
)
f.write(ns_line)
f.write("\n")
@@ -67,7 +67,7 @@ def save_vw_dataset_w_ns(X, y, did, ds_dir, max_ns_num, is_regression):
"""convert openml dataset to vw example and save to file"""
print("is_regression", is_regression)
if is_regression:
fname = "ds_{}_{}_{}.vw".format(did, max_ns_num, 0)
fname = f"ds_{did}_{max_ns_num}_{0}.vw"
print("dataset size", X.shape[0], X.shape[1])
print("saving data", did, ds_dir, fname)
dim = X.shape[1]
@@ -131,7 +131,7 @@ def load_vw_dataset(did, ds_dir, is_regression, max_ns_num):
if is_regression:
# the second field specifies the largest number of namespaces using.
fname = "ds_{}_{}_{}.vw".format(did, max_ns_num, 0)
fname = f"ds_{did}_{max_ns_num}_{0}.vw"
vw_dataset_file = os.path.join(ds_dir, fname)
# if file does not exist, generate and save the datasets
if not os.path.exists(vw_dataset_file) or os.stat(vw_dataset_file).st_size < 1000:
@@ -139,7 +139,7 @@ def load_vw_dataset(did, ds_dir, is_regression, max_ns_num):
print(ds_dir, vw_dataset_file)
if not os.path.exists(ds_dir):
os.makedirs(ds_dir)
with open(os.path.join(ds_dir, fname), "r") as f:
with open(os.path.join(ds_dir, fname)) as f:
vw_content = f.read().splitlines()
print(type(vw_content), len(vw_content))
return vw_content

View File

@@ -59,6 +59,17 @@ def _test_hf_data():
except requests.exceptions.ConnectionError:
return
# Tests will only run if there is a GPU available
try:
import ray
pg = ray.util.placement_group([{"CPU": 1, "GPU": 1}])
if not pg.wait(timeout_seconds=10): # Wait 10 seconds for resources
raise RuntimeError("No available node types can fulfill resource request!")
except RuntimeError:
return
custom_sent_keys = ["sentence1", "sentence2"]
label_key = "label"

View File

@@ -75,10 +75,10 @@ def test_lexiflow():
layers = []
in_features = 28 * 28
for i in range(n_layers):
out_features = configuration["n_units_l{}".format(i)]
out_features = configuration[f"n_units_l{i}"]
layers.append(nn.Linear(in_features, out_features))
layers.append(nn.ReLU())
p = configuration["dropout_{}".format(i)]
p = configuration[f"dropout_{i}"]
layers.append(nn.Dropout(p))
in_features = out_features
layers.append(nn.Linear(in_features, 10))

View File

@@ -24,7 +24,7 @@ try:
# __net_begin__
class Net(nn.Module):
def __init__(self, l1=120, l2=84):
super(Net, self).__init__()
super().__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
@@ -277,7 +277,7 @@ def cifar10_main(method="BlendSearch", num_samples=10, max_num_epochs=100, gpus_
logger.info(f"#trials={len(result.trials)}")
logger.info(f"time={time.time()-start_time}")
best_trial = result.get_best_trial("loss", "min", "all")
logger.info("Best trial config: {}".format(best_trial.config))
logger.info(f"Best trial config: {best_trial.config}")
logger.info("Best trial final validation loss: {}".format(best_trial.metric_analysis["loss"]["min"]))
logger.info("Best trial final validation accuracy: {}".format(best_trial.metric_analysis["accuracy"]["max"]))
@@ -296,7 +296,7 @@ def cifar10_main(method="BlendSearch", num_samples=10, max_num_epochs=100, gpus_
best_trained_model.load_state_dict(model_state)
test_acc = _test_accuracy(best_trained_model, device)
logger.info("Best trial test set accuracy: {}".format(test_acc))
logger.info(f"Best trial test set accuracy: {test_acc}")
# __main_end__

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

View File

@@ -1,5 +1,6 @@
Please find tutorials on FLAML below:
- [AutoML 2024](flaml-tutorial-automl-24.md)
- [PyData Seattle 2023](flaml-tutorial-pydata-23.md)
- [A hands-on tutorial on FLAML presented at KDD 2022](flaml-tutorial-kdd-22.md)
- [A lab forum on FLAML at AAAI 2023](flaml-tutorial-aaai-23.md)

Some files were not shown because too many files have changed in this diff Show More