mirror of
https://github.com/microsoft/FLAML.git
synced 2026-02-14 20:59:16 +08:00
Compare commits
69 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
13aec414ea | ||
|
|
bb16dcde93 | ||
|
|
be81a76da9 | ||
|
|
2d16089529 | ||
|
|
01c3c83653 | ||
|
|
9b66103f7c | ||
|
|
48dfd72e64 | ||
|
|
dec92e5b02 | ||
|
|
22911ea1ef | ||
|
|
12183e5f73 | ||
|
|
c2b25310fc | ||
|
|
0f9420590d | ||
|
|
5107c506b4 | ||
|
|
9e219ef8dc | ||
|
|
6e4083743b | ||
|
|
17e95edd9e | ||
|
|
468bc62d27 | ||
|
|
437c239c11 | ||
|
|
8e753f1092 | ||
|
|
a3b57e11d4 | ||
|
|
a80dcf9925 | ||
|
|
7157af44e0 | ||
|
|
1798c4591e | ||
|
|
dd26263330 | ||
|
|
2ba5f8bed1 | ||
|
|
d0a11958a5 | ||
|
|
0ef9b00a75 | ||
|
|
840f76e5e5 | ||
|
|
d8b7d25b80 | ||
|
|
6d53929803 | ||
|
|
c038fbca07 | ||
|
|
6a99202492 | ||
|
|
42d1dcfa0e | ||
|
|
b83c8a7d3b | ||
|
|
b9194cdcf2 | ||
|
|
9a1f6b0291 | ||
|
|
07f4413aae | ||
|
|
5a74227bc3 | ||
|
|
7644958e21 | ||
|
|
a316f84fe1 | ||
|
|
72881d3a2b | ||
|
|
69da685d1e | ||
|
|
c01c3910eb | ||
|
|
98d3fd2f48 | ||
|
|
9724c626cc | ||
|
|
0d92400200 | ||
|
|
d224218ecf | ||
|
|
a2a5e1abb9 | ||
|
|
5c0f18b7bc | ||
|
|
e5d95f5674 | ||
|
|
49ba962d47 | ||
|
|
8e171bc402 | ||
|
|
c90946f303 | ||
|
|
64f30af603 | ||
|
|
f45582d3c7 | ||
|
|
bf4bca2195 | ||
|
|
efaba26d2e | ||
|
|
62194f321d | ||
|
|
5bfa0b1cd3 | ||
|
|
bd34b4e75a | ||
|
|
7670945298 | ||
|
|
43537cb539 | ||
|
|
f913b79225 | ||
|
|
a092a39b5e | ||
|
|
04bf1b8741 | ||
|
|
b348cb1136 | ||
|
|
cd0e88e383 | ||
|
|
a17c6e392e | ||
|
|
52627ff14b |
73
.github/ISSUE_TEMPLATE.md
vendored
Normal file
73
.github/ISSUE_TEMPLATE.md
vendored
Normal file
@@ -0,0 +1,73 @@
|
||||
### Description
|
||||
|
||||
<!-- A clear and concise description of the issue or feature request. -->
|
||||
|
||||
### Environment
|
||||
|
||||
- FLAML version: <!-- Specify the FLAML version (e.g., v0.2.0) -->
|
||||
- Python version: <!-- Specify the Python version (e.g., 3.8) -->
|
||||
- Operating System: <!-- Specify the OS (e.g., Windows 10, Ubuntu 20.04) -->
|
||||
|
||||
### Steps to Reproduce (for bugs)
|
||||
|
||||
<!-- Provide detailed steps to reproduce the issue. Include code snippets, configuration files, or any other relevant information. -->
|
||||
|
||||
1. Step 1
|
||||
1. Step 2
|
||||
1. ...
|
||||
|
||||
### Expected Behavior
|
||||
|
||||
<!-- Describe what you expected to happen. -->
|
||||
|
||||
### Actual Behavior
|
||||
|
||||
<!-- Describe what actually happened. Include any error messages, stack traces, or unexpected behavior. -->
|
||||
|
||||
### Screenshots / Logs (if applicable)
|
||||
|
||||
<!-- If relevant, include screenshots or logs that help illustrate the issue. -->
|
||||
|
||||
### Additional Information
|
||||
|
||||
<!-- Include any additional information that might be helpful, such as specific configurations, data samples, or context about the environment. -->
|
||||
|
||||
### Possible Solution (if you have one)
|
||||
|
||||
<!-- If you have suggestions on how to address the issue, provide them here. -->
|
||||
|
||||
### Is this a Bug or Feature Request?
|
||||
|
||||
<!-- Choose one: Bug | Feature Request -->
|
||||
|
||||
### Priority
|
||||
|
||||
<!-- Choose one: High | Medium | Low -->
|
||||
|
||||
### Difficulty
|
||||
|
||||
<!-- Choose one: Easy | Moderate | Hard -->
|
||||
|
||||
### Any related issues?
|
||||
|
||||
<!-- If this is related to another issue, reference it here. -->
|
||||
|
||||
### Any relevant discussions?
|
||||
|
||||
<!-- If there are any discussions or forum threads related to this issue, provide links. -->
|
||||
|
||||
### Checklist
|
||||
|
||||
<!-- Please check the items that you have completed -->
|
||||
|
||||
- [ ] I have searched for similar issues and didn't find any duplicates.
|
||||
- [ ] I have provided a clear and concise description of the issue.
|
||||
- [ ] I have included the necessary environment details.
|
||||
- [ ] I have outlined the steps to reproduce the issue.
|
||||
- [ ] I have included any relevant logs or screenshots.
|
||||
- [ ] I have indicated whether this is a bug or a feature request.
|
||||
- [ ] I have set the priority and difficulty levels.
|
||||
|
||||
### Additional Comments
|
||||
|
||||
<!-- Any additional comments or context that you think would be helpful. -->
|
||||
53
.github/ISSUE_TEMPLATE/bug_report.yml
vendored
Normal file
53
.github/ISSUE_TEMPLATE/bug_report.yml
vendored
Normal file
@@ -0,0 +1,53 @@
|
||||
name: Bug Report
|
||||
description: File a bug report
|
||||
title: "[Bug]: "
|
||||
labels: ["bug"]
|
||||
|
||||
body:
|
||||
- type: textarea
|
||||
id: description
|
||||
attributes:
|
||||
label: Describe the bug
|
||||
description: A clear and concise description of what the bug is.
|
||||
placeholder: What went wrong?
|
||||
- type: textarea
|
||||
id: reproduce
|
||||
attributes:
|
||||
label: Steps to reproduce
|
||||
description: |
|
||||
Steps to reproduce the behavior:
|
||||
|
||||
1. Step 1
|
||||
2. Step 2
|
||||
3. ...
|
||||
4. See error
|
||||
placeholder: How can we replicate the issue?
|
||||
- type: textarea
|
||||
id: modelused
|
||||
attributes:
|
||||
label: Model Used
|
||||
description: A description of the model that was used when the error was encountered
|
||||
placeholder: gpt-4, mistral-7B etc
|
||||
- type: textarea
|
||||
id: expected_behavior
|
||||
attributes:
|
||||
label: Expected Behavior
|
||||
description: A clear and concise description of what you expected to happen.
|
||||
placeholder: What should have happened?
|
||||
- type: textarea
|
||||
id: screenshots
|
||||
attributes:
|
||||
label: Screenshots and logs
|
||||
description: If applicable, add screenshots and logs to help explain your problem.
|
||||
placeholder: Add screenshots here
|
||||
- type: textarea
|
||||
id: additional_information
|
||||
attributes:
|
||||
label: Additional Information
|
||||
description: |
|
||||
- FLAML Version: <!-- Specify the FLAML version (e.g., v0.2.0) -->
|
||||
- Operating System: <!-- Specify the OS (e.g., Windows 10, Ubuntu 20.04) -->
|
||||
- Python Version: <!-- Specify the Python version (e.g., 3.8) -->
|
||||
- Related Issues: <!-- Link to any related issues here (e.g., #1) -->
|
||||
- Any other relevant information.
|
||||
placeholder: Any additional details
|
||||
1
.github/ISSUE_TEMPLATE/config.yml
vendored
Normal file
1
.github/ISSUE_TEMPLATE/config.yml
vendored
Normal file
@@ -0,0 +1 @@
|
||||
blank_issues_enabled: true
|
||||
26
.github/ISSUE_TEMPLATE/feature_request.yml
vendored
Normal file
26
.github/ISSUE_TEMPLATE/feature_request.yml
vendored
Normal file
@@ -0,0 +1,26 @@
|
||||
name: Feature Request
|
||||
description: File a feature request
|
||||
labels: ["enhancement"]
|
||||
title: "[Feature Request]: "
|
||||
|
||||
body:
|
||||
- type: textarea
|
||||
id: problem_description
|
||||
attributes:
|
||||
label: Is your feature request related to a problem? Please describe.
|
||||
description: A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
|
||||
placeholder: What problem are you trying to solve?
|
||||
|
||||
- type: textarea
|
||||
id: solution_description
|
||||
attributes:
|
||||
label: Describe the solution you'd like
|
||||
description: A clear and concise description of what you want to happen.
|
||||
placeholder: How do you envision the solution?
|
||||
|
||||
- type: textarea
|
||||
id: additional_context
|
||||
attributes:
|
||||
label: Additional context
|
||||
description: Add any other context or screenshots about the feature request here.
|
||||
placeholder: Any additional information
|
||||
41
.github/ISSUE_TEMPLATE/general_issue.yml
vendored
Normal file
41
.github/ISSUE_TEMPLATE/general_issue.yml
vendored
Normal file
@@ -0,0 +1,41 @@
|
||||
name: General Issue
|
||||
description: File a general issue
|
||||
title: "[Issue]: "
|
||||
labels: []
|
||||
|
||||
body:
|
||||
- type: textarea
|
||||
id: description
|
||||
attributes:
|
||||
label: Describe the issue
|
||||
description: A clear and concise description of what the issue is.
|
||||
placeholder: What went wrong?
|
||||
- type: textarea
|
||||
id: reproduce
|
||||
attributes:
|
||||
label: Steps to reproduce
|
||||
description: |
|
||||
Steps to reproduce the behavior:
|
||||
|
||||
1. Step 1
|
||||
2. Step 2
|
||||
3. ...
|
||||
4. See error
|
||||
placeholder: How can we replicate the issue?
|
||||
- type: textarea
|
||||
id: screenshots
|
||||
attributes:
|
||||
label: Screenshots and logs
|
||||
description: If applicable, add screenshots and logs to help explain your problem.
|
||||
placeholder: Add screenshots here
|
||||
- type: textarea
|
||||
id: additional_information
|
||||
attributes:
|
||||
label: Additional Information
|
||||
description: |
|
||||
- FLAML Version: <!-- Specify the FLAML version (e.g., v0.2.0) -->
|
||||
- Operating System: <!-- Specify the OS (e.g., Windows 10, Ubuntu 20.04) -->
|
||||
- Python Version: <!-- Specify the Python version (e.g., 3.8) -->
|
||||
- Related Issues: <!-- Link to any related issues here (e.g., #1) -->
|
||||
- Any other relevant information.
|
||||
placeholder: Any additional details
|
||||
3
.github/PULL_REQUEST_TEMPLATE.md
vendored
3
.github/PULL_REQUEST_TEMPLATE.md
vendored
@@ -12,8 +12,7 @@
|
||||
|
||||
## Checks
|
||||
|
||||
<!-- - I've used [pre-commit](https://microsoft.github.io/FLAML/docs/Contribute#pre-commit) to lint the changes in this PR (note the same in integrated in our CI checks). -->
|
||||
|
||||
- [ ] I've used [pre-commit](https://microsoft.github.io/FLAML/docs/Contribute#pre-commit) to lint the changes in this PR (note the same in integrated in our CI checks).
|
||||
- [ ] I've included any doc changes needed for https://microsoft.github.io/FLAML/. See https://microsoft.github.io/FLAML/docs/Contribute#documentation to build and test documentation locally.
|
||||
- [ ] I've added tests (if relevant) corresponding to the changes introduced in this PR.
|
||||
- [ ] I've made sure all auto checks have passed.
|
||||
|
||||
21
.github/workflows/CD.yml
vendored
21
.github/workflows/CD.yml
vendored
@@ -12,26 +12,17 @@ jobs:
|
||||
deploy:
|
||||
strategy:
|
||||
matrix:
|
||||
os: ['ubuntu-latest']
|
||||
python-version: [3.8]
|
||||
os: ["ubuntu-latest"]
|
||||
python-version: ["3.10"]
|
||||
runs-on: ${{ matrix.os }}
|
||||
environment: package
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v3
|
||||
- name: Cache conda
|
||||
uses: actions/cache@v3
|
||||
uses: actions/checkout@v4
|
||||
- name: Set up Python ${{ matrix.python-version }}
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
path: ~/conda_pkgs_dir
|
||||
key: conda-${{ matrix.os }}-python-${{ matrix.python-version }}-${{ hashFiles('environment.yml') }}
|
||||
- name: Setup Miniconda
|
||||
uses: conda-incubator/setup-miniconda@v2
|
||||
with:
|
||||
auto-update-conda: true
|
||||
auto-activate-base: false
|
||||
activate-environment: hcrystalball
|
||||
python-version: ${{ matrix.python-version }}
|
||||
use-only-tar-bz2: true
|
||||
- name: Install from source
|
||||
# This is required for the pre-commit tests
|
||||
shell: pwsh
|
||||
@@ -42,7 +33,7 @@ jobs:
|
||||
- name: Build
|
||||
shell: pwsh
|
||||
run: |
|
||||
pip install twine
|
||||
pip install twine wheel setuptools
|
||||
python setup.py sdist bdist_wheel
|
||||
- name: Publish to PyPI
|
||||
env:
|
||||
|
||||
8
.github/workflows/deploy-website.yml
vendored
8
.github/workflows/deploy-website.yml
vendored
@@ -37,11 +37,11 @@ jobs:
|
||||
- name: setup python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: "3.8"
|
||||
python-version: "3.10"
|
||||
- name: pydoc-markdown install
|
||||
run: |
|
||||
python -m pip install --upgrade pip
|
||||
pip install pydoc-markdown==4.5.0
|
||||
pip install pydoc-markdown==4.7.0
|
||||
- name: pydoc-markdown run
|
||||
run: |
|
||||
pydoc-markdown
|
||||
@@ -73,11 +73,11 @@ jobs:
|
||||
- name: setup python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: "3.8"
|
||||
python-version: "3.10"
|
||||
- name: pydoc-markdown install
|
||||
run: |
|
||||
python -m pip install --upgrade pip
|
||||
pip install pydoc-markdown==4.5.0
|
||||
pip install pydoc-markdown==4.7.0
|
||||
- name: pydoc-markdown run
|
||||
run: |
|
||||
pydoc-markdown
|
||||
|
||||
32
.github/workflows/python-package.yml
vendored
32
.github/workflows/python-package.yml
vendored
@@ -14,6 +14,12 @@ on:
|
||||
- 'setup.py'
|
||||
pull_request:
|
||||
branches: ['main']
|
||||
paths:
|
||||
- 'flaml/**'
|
||||
- 'test/**'
|
||||
- 'notebook/**'
|
||||
- '.github/workflows/python-package.yml'
|
||||
- 'setup.py'
|
||||
merge_group:
|
||||
types: [checks_requested]
|
||||
|
||||
@@ -29,8 +35,8 @@ jobs:
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
os: [ubuntu-latest, macos-latest, windows-2019]
|
||||
python-version: ["3.8", "3.9", "3.10", "3.11"]
|
||||
os: [ubuntu-latest, macos-latest, windows-latest]
|
||||
python-version: ["3.9", "3.10", "3.11"]
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- name: Set up Python ${{ matrix.python-version }}
|
||||
@@ -50,14 +56,19 @@ jobs:
|
||||
export LDFLAGS="$LDFLAGS -Wl,-rpath,/usr/local/opt/libomp/lib -L/usr/local/opt/libomp/lib -lomp"
|
||||
- name: Install packages and dependencies
|
||||
run: |
|
||||
python -m pip install --upgrade pip wheel
|
||||
python -m pip install --upgrade pip wheel setuptools
|
||||
pip install -e .
|
||||
python -c "import flaml"
|
||||
pip install -e .[test]
|
||||
- name: On Ubuntu python 3.8, install pyspark 3.2.3
|
||||
if: matrix.python-version == '3.8' && matrix.os == 'ubuntu-latest'
|
||||
- name: On Ubuntu python 3.10, install pyspark 3.4.1
|
||||
if: matrix.python-version == '3.10' && matrix.os == 'ubuntu-latest'
|
||||
run: |
|
||||
pip install pyspark==3.2.3
|
||||
pip install pyspark==3.4.1
|
||||
pip list | grep "pyspark"
|
||||
- name: On Ubuntu python 3.11, install pyspark 3.5.1
|
||||
if: matrix.python-version == '3.11' && matrix.os == 'ubuntu-latest'
|
||||
run: |
|
||||
pip install pyspark==3.5.1
|
||||
pip list | grep "pyspark"
|
||||
- name: If linux and python<3.11, install ray 2
|
||||
if: matrix.os == 'ubuntu-latest' && matrix.python-version != '3.11'
|
||||
@@ -77,20 +88,15 @@ jobs:
|
||||
if: matrix.python-version == '3.8' || matrix.python-version == '3.9'
|
||||
run: |
|
||||
pip install -e .[vw]
|
||||
- name: Uninstall pyspark on (python 3.9) or windows
|
||||
if: matrix.python-version == '3.9' || matrix.os == 'windows-2019'
|
||||
run: |
|
||||
# Uninstall pyspark to test env without pyspark
|
||||
pip uninstall -y pyspark
|
||||
- name: Test with pytest
|
||||
if: matrix.python-version != '3.10'
|
||||
run: |
|
||||
pytest test
|
||||
pytest test/ --ignore=test/autogen
|
||||
- name: Coverage
|
||||
if: matrix.python-version == '3.10'
|
||||
run: |
|
||||
pip install coverage
|
||||
coverage run -a -m pytest test
|
||||
coverage run -a -m pytest test --ignore=test/autogen
|
||||
coverage xml
|
||||
- name: Upload coverage to Codecov
|
||||
if: matrix.python-version == '3.10'
|
||||
|
||||
18
.gitignore
vendored
18
.gitignore
vendored
@@ -163,6 +163,24 @@ output/
|
||||
flaml/tune/spark/mylearner.py
|
||||
*.pkl
|
||||
|
||||
data/
|
||||
benchmark/pmlb/csv_datasets
|
||||
benchmark/*.csv
|
||||
|
||||
checkpoints/
|
||||
test/default
|
||||
test/housing.json
|
||||
test/nlp/default/transformer_ms/seq-classification.json
|
||||
|
||||
flaml/fabric/fanova/_fanova.c
|
||||
# local config files
|
||||
*.config.local
|
||||
|
||||
local_debug/
|
||||
patch.diff
|
||||
|
||||
# Test things
|
||||
notebook/lightning_logs/
|
||||
lightning_logs/
|
||||
flaml/autogen/extensions/tmp/
|
||||
test/autogen/my_tmp/
|
||||
|
||||
@@ -23,6 +23,13 @@ repos:
|
||||
- id: end-of-file-fixer
|
||||
- id: no-commit-to-branch
|
||||
|
||||
- repo: https://github.com/asottile/pyupgrade
|
||||
rev: v2.31.1
|
||||
hooks:
|
||||
- id: pyupgrade
|
||||
args: [--py38-plus]
|
||||
name: Upgrade code
|
||||
|
||||
- repo: https://github.com/psf/black
|
||||
rev: 23.3.0
|
||||
hooks:
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
# basic setup
|
||||
FROM mcr.microsoft.com/devcontainers/python:3.8
|
||||
FROM mcr.microsoft.com/devcontainers/python:3.10
|
||||
RUN apt-get update && apt-get -y update
|
||||
RUN apt-get install -y sudo git npm
|
||||
|
||||
|
||||
14
README.md
14
README.md
@@ -1,7 +1,7 @@
|
||||
[](https://badge.fury.io/py/FLAML)
|
||||

|
||||
[](https://github.com/microsoft/FLAML/actions/workflows/python-package.yml)
|
||||

|
||||
[](https://pypi.org/project/FLAML/)
|
||||
[](https://pepy.tech/project/flaml)
|
||||
[](https://discord.gg/Cppx2vSPVP)
|
||||
|
||||
@@ -14,6 +14,8 @@
|
||||
<br>
|
||||
</p>
|
||||
|
||||
:fire: FLAML supports AutoML and Hyperparameter Tuning in [Microsoft Fabric Data Science](https://learn.microsoft.com/en-us/fabric/data-science/automated-machine-learning-fabric). In addition, we've introduced Python 3.11 support, along with a range of new estimators, and comprehensive integration with MLflow—thanks to contributions from the Microsoft Fabric product team.
|
||||
|
||||
:fire: Heads-up: We have migrated [AutoGen](https://microsoft.github.io/autogen/) into a dedicated [github repository](https://github.com/microsoft/autogen). Alongside this move, we have also launched a dedicated [Discord](https://discord.gg/pAbnFJrkgZ) server and a [website](https://microsoft.github.io/autogen/) for comprehensive documentation.
|
||||
|
||||
:fire: The automated multi-agent chat framework in [AutoGen](https://microsoft.github.io/autogen/) is in preview from v2.0.0.
|
||||
@@ -22,8 +24,6 @@
|
||||
|
||||
:fire: [autogen](https://microsoft.github.io/autogen/) is released with support for ChatGPT and GPT-4, based on [Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference](https://arxiv.org/abs/2303.04673).
|
||||
|
||||
:fire: FLAML supports Code-First AutoML & Tuning – Private Preview in [Microsoft Fabric Data Science](https://learn.microsoft.com/en-us/fabric/data-science/).
|
||||
|
||||
## What is FLAML
|
||||
|
||||
FLAML is a lightweight Python library for efficient automation of machine
|
||||
@@ -40,7 +40,7 @@ FLAML has a .NET implementation in [ML.NET](http://dot.net/ml), an open-source,
|
||||
|
||||
## Installation
|
||||
|
||||
FLAML requires **Python version >= 3.8**. It can be installed from pip:
|
||||
FLAML requires **Python version >= 3.9**. It can be installed from pip:
|
||||
|
||||
```bash
|
||||
pip install flaml
|
||||
@@ -154,3 +154,9 @@ provided by the bot. You will only need to do this once across all repos using o
|
||||
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
|
||||
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
|
||||
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
|
||||
|
||||
## Contributors Wall
|
||||
|
||||
<a href="https://github.com/microsoft/flaml/graphs/contributors">
|
||||
<img src="https://contrib.rocks/image?repo=microsoft/flaml&max=204" />
|
||||
</a>
|
||||
|
||||
@@ -1,10 +1,20 @@
|
||||
import logging
|
||||
import warnings
|
||||
|
||||
from flaml.automl import AutoML, logger_formatter
|
||||
try:
|
||||
from flaml.automl import AutoML, logger_formatter
|
||||
|
||||
has_automl = True
|
||||
except ImportError:
|
||||
has_automl = False
|
||||
from flaml.onlineml.autovw import AutoVW
|
||||
from flaml.tune.searcher import CFO, FLOW2, BlendSearch, BlendSearchTuner, RandomSearch
|
||||
from flaml.version import __version__
|
||||
|
||||
# Set the root logger.
|
||||
logger = logging.getLogger(__name__)
|
||||
logger.setLevel(logging.INFO)
|
||||
if logger.level == logging.NOTSET:
|
||||
logger.setLevel(logging.INFO)
|
||||
|
||||
if not has_automl:
|
||||
warnings.warn("flaml.automl is not available. Please install flaml[automl] to enable AutoML functionalities.")
|
||||
|
||||
@@ -156,7 +156,7 @@ class MathUserProxyAgent(UserProxyAgent):
|
||||
when the number of auto reply reaches the max_consecutive_auto_reply or when is_termination_msg is True.
|
||||
default_auto_reply (str or dict or None): the default auto reply message when no code execution or llm based reply is generated.
|
||||
max_invalid_q_per_step (int): (ADDED) the maximum number of invalid queries per step.
|
||||
**kwargs (dict): other kwargs in [UserProxyAgent](user_proxy_agent#__init__).
|
||||
**kwargs (dict): other kwargs in [UserProxyAgent](../user_proxy_agent#__init__).
|
||||
"""
|
||||
super().__init__(
|
||||
name=name,
|
||||
|
||||
@@ -123,7 +123,7 @@ class RetrieveUserProxyAgent(UserProxyAgent):
|
||||
can be found at `https://www.sbert.net/docs/pretrained_models.html`. The default model is a
|
||||
fast model. If you want to use a high performance model, `all-mpnet-base-v2` is recommended.
|
||||
- customized_prompt (Optional, str): the customized prompt for the retrieve chat. Default is None.
|
||||
**kwargs (dict): other kwargs in [UserProxyAgent](user_proxy_agent#__init__).
|
||||
**kwargs (dict): other kwargs in [UserProxyAgent](../user_proxy_agent#__init__).
|
||||
"""
|
||||
super().__init__(
|
||||
name=name,
|
||||
|
||||
@@ -125,7 +125,7 @@ def improve_function(file_name, func_name, objective, **config):
|
||||
"""(work in progress) Improve the function to achieve the objective."""
|
||||
params = {**_IMPROVE_FUNCTION_CONFIG, **config}
|
||||
# read the entire file into a str
|
||||
with open(file_name, "r") as f:
|
||||
with open(file_name) as f:
|
||||
file_string = f.read()
|
||||
response = oai.Completion.create(
|
||||
{"func_name": func_name, "objective": objective, "file_string": file_string}, **params
|
||||
@@ -158,7 +158,7 @@ def improve_code(files, objective, suggest_only=True, **config):
|
||||
code = ""
|
||||
for file_name in files:
|
||||
# read the entire file into a string
|
||||
with open(file_name, "r") as f:
|
||||
with open(file_name) as f:
|
||||
file_string = f.read()
|
||||
code += f"""{file_name}:
|
||||
{file_string}
|
||||
|
||||
@@ -130,7 +130,7 @@ def _fix_a_slash_b(string: str) -> str:
|
||||
try:
|
||||
a = int(a_str)
|
||||
b = int(b_str)
|
||||
assert string == "{}/{}".format(a, b)
|
||||
assert string == f"{a}/{b}"
|
||||
new_string = "\\frac{" + str(a) + "}{" + str(b) + "}"
|
||||
return new_string
|
||||
except Exception:
|
||||
|
||||
@@ -126,7 +126,7 @@ def split_files_to_chunks(
|
||||
"""Split a list of files into chunks of max_tokens."""
|
||||
chunks = []
|
||||
for file in files:
|
||||
with open(file, "r") as f:
|
||||
with open(file) as f:
|
||||
text = f.read()
|
||||
chunks += split_text_to_chunks(text, max_tokens, chunk_mode, must_break_at_empty_line)
|
||||
return chunks
|
||||
|
||||
@@ -1,5 +1,9 @@
|
||||
from flaml.automl.automl import AutoML, size
|
||||
from flaml.automl.logger import logger_formatter
|
||||
from flaml.automl.state import AutoMLState, SearchState
|
||||
|
||||
__all__ = ["AutoML", "AutoMLState", "SearchState", "logger_formatter", "size"]
|
||||
try:
|
||||
from flaml.automl.automl import AutoML, size
|
||||
from flaml.automl.state import AutoMLState, SearchState
|
||||
|
||||
__all__ = ["AutoML", "AutoMLState", "SearchState", "logger_formatter", "size"]
|
||||
except ImportError:
|
||||
__all__ = ["logger_formatter"]
|
||||
|
||||
@@ -7,8 +7,10 @@ from __future__ import annotations
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import random
|
||||
import sys
|
||||
import time
|
||||
from concurrent.futures import as_completed
|
||||
from functools import partial
|
||||
from typing import Callable, List, Optional, Union
|
||||
|
||||
@@ -16,7 +18,7 @@ import numpy as np
|
||||
|
||||
from flaml import tune
|
||||
from flaml.automl.logger import logger, logger_formatter
|
||||
from flaml.automl.ml import train_estimator
|
||||
from flaml.automl.ml import huggingface_metric_to_mode, sklearn_metric_name_set, spark_metric_name_dict, train_estimator
|
||||
from flaml.automl.spark import DataFrame, Series, psDataFrame, psSeries
|
||||
from flaml.automl.state import AutoMLState, SearchState
|
||||
from flaml.automl.task.factory import task_factory
|
||||
@@ -45,6 +47,7 @@ ERROR = (
|
||||
|
||||
try:
|
||||
from sklearn.base import BaseEstimator
|
||||
from sklearn.pipeline import Pipeline
|
||||
except ImportError:
|
||||
BaseEstimator = object
|
||||
ERROR = ERROR or ImportError("please install flaml[automl] option to use the flaml.automl package.")
|
||||
@@ -54,6 +57,14 @@ try:
|
||||
except ImportError:
|
||||
mlflow = None
|
||||
|
||||
try:
|
||||
from flaml.fabric.mlflow import MLflowIntegration, get_mlflow_log_latency, infer_signature, is_autolog_enabled
|
||||
|
||||
internal_mlflow = True
|
||||
except ImportError:
|
||||
internal_mlflow = False
|
||||
|
||||
|
||||
try:
|
||||
from ray import __version__ as ray_version
|
||||
|
||||
@@ -171,15 +182,22 @@ class AutoML(BaseEstimator):
|
||||
'better' only logs configs with better loss than previos iters
|
||||
'all' logs all the tried configs.
|
||||
model_history: A boolean of whether to keep the best
|
||||
model per estimator. Make sure memory is large enough if setting to True.
|
||||
model per estimator. Make sure memory is large enough if setting to True. Default False.
|
||||
log_training_metric: A boolean of whether to log the training
|
||||
metric for each model.
|
||||
mem_thres: A float of the memory size constraint in bytes.
|
||||
pred_time_limit: A float of the prediction latency constraint in seconds.
|
||||
It refers to the average prediction time per row in validation data.
|
||||
train_time_limit: A float of the training time constraint in seconds.
|
||||
train_time_limit: None or a float of the training time constraint in seconds for each trial.
|
||||
Only valid for sequential search.
|
||||
verbose: int, default=3 | Controls the verbosity, higher means more
|
||||
messages.
|
||||
verbose=0: logger level = CRITICAL
|
||||
verbose=1: logger level = ERROR
|
||||
verbose=2: logger level = WARNING
|
||||
verbose=3: logger level = INFO
|
||||
verbose=4: logger level = DEBUG
|
||||
verbose>5: logger level = NOTSET
|
||||
retrain_full: bool or str, default=True | whether to retrain the
|
||||
selected model on the full training data when using holdout.
|
||||
True - retrain only after search finishes; False - no retraining;
|
||||
@@ -193,7 +211,7 @@ class AutoML(BaseEstimator):
|
||||
* Valid str options depend on different tasks.
|
||||
For classification tasks, valid choices are
|
||||
["auto", 'stratified', 'uniform', 'time', 'group']. "auto" -> stratified.
|
||||
For regression tasks, valid choices are ["auto", 'uniform', 'time'].
|
||||
For regression tasks, valid choices are ["auto", 'uniform', 'time', 'group'].
|
||||
"auto" -> uniform.
|
||||
For time series forecast tasks, must be "auto" or 'time'.
|
||||
For ranking task, must be "auto" or 'group'.
|
||||
@@ -247,7 +265,10 @@ class AutoML(BaseEstimator):
|
||||
search is considered to converge.
|
||||
force_cancel: boolean, default=False | Whether to forcely cancel Spark jobs if the
|
||||
search time exceeded the time budget.
|
||||
append_log: boolean, default=False | Whether to directly append the log
|
||||
mlflow_exp_name: str, default=None | The name of the mlflow experiment. This should be specified if
|
||||
enable mlflow autologging on Spark. Otherwise it will log all the results into the experiment of the
|
||||
same name as the basename of main entry file.
|
||||
append_log: boolean, default=False | Whetehr to directly append the log
|
||||
records to the input log file if it exists.
|
||||
auto_augment: boolean, default=True | Whether to automatically
|
||||
augment rare classes.
|
||||
@@ -320,9 +341,7 @@ class AutoML(BaseEstimator):
|
||||
}
|
||||
}
|
||||
```
|
||||
mlflow_logging: boolean, default=True | Whether to log the training results to mlflow.
|
||||
This requires mlflow to be installed and to have an active mlflow run.
|
||||
FLAML will create nested runs.
|
||||
mlflow_logging: boolean, default=True | Whether to log the training results to mlflow. Not valid if mlflow is not installed.
|
||||
|
||||
"""
|
||||
if ERROR:
|
||||
@@ -331,6 +350,8 @@ class AutoML(BaseEstimator):
|
||||
self._state = AutoMLState()
|
||||
self._state.learner_classes = {}
|
||||
self._settings = settings
|
||||
self._automl_user_configurations = settings.copy()
|
||||
self._settings.pop("automl_user_configurations", None)
|
||||
# no budget by default
|
||||
settings["time_budget"] = settings.get("time_budget", -1)
|
||||
settings["task"] = settings.get("task", "classification")
|
||||
@@ -362,6 +383,7 @@ class AutoML(BaseEstimator):
|
||||
settings["preserve_checkpoint"] = settings.get("preserve_checkpoint", True)
|
||||
settings["early_stop"] = settings.get("early_stop", False)
|
||||
settings["force_cancel"] = settings.get("force_cancel", False)
|
||||
settings["mlflow_exp_name"] = settings.get("mlflow_exp_name", None)
|
||||
settings["append_log"] = settings.get("append_log", False)
|
||||
settings["min_sample_size"] = settings.get("min_sample_size", MIN_SAMPLE_TRAIN)
|
||||
settings["use_ray"] = settings.get("use_ray", False)
|
||||
@@ -377,6 +399,7 @@ class AutoML(BaseEstimator):
|
||||
settings["mlflow_logging"] = settings.get("mlflow_logging", True)
|
||||
|
||||
self._estimator_type = "classifier" if settings["task"] in CLASSIFICATION else "regressor"
|
||||
self.best_run_id = None
|
||||
|
||||
def get_params(self, deep: bool = False) -> dict:
|
||||
return self._settings.copy()
|
||||
@@ -409,6 +432,8 @@ class AutoML(BaseEstimator):
|
||||
If `model_history` was set to True, then the returned model is trained.
|
||||
"""
|
||||
state = self._search_states.get(estimator_name)
|
||||
if state and estimator_name == self._best_estimator:
|
||||
return self.model
|
||||
return state and getattr(state, "trained_estimator", None)
|
||||
|
||||
@property
|
||||
@@ -475,14 +500,29 @@ class AutoML(BaseEstimator):
|
||||
with open(filename, "w") as f:
|
||||
json.dump(best, f)
|
||||
|
||||
@property
|
||||
def supported_metrics(self):
|
||||
"""
|
||||
Returns a tuple of supported metrics for the task.
|
||||
|
||||
Returns:
|
||||
metrics (Tuple): sklearn metrics from sklearn package;
|
||||
huggingface metrics from datasets package;
|
||||
spark metrics from pyspark package
|
||||
|
||||
"""
|
||||
|
||||
return sklearn_metric_name_set, huggingface_metric_to_mode.keys(), spark_metric_name_dict
|
||||
|
||||
@property
|
||||
def feature_transformer(self):
|
||||
"""Returns feature transformer which is used to preprocess data before applying training or inference."""
|
||||
return getattr(self, "_transformer", None)
|
||||
"""Returns AutoML Transformer"""
|
||||
data_precessor = getattr(self, "_transformer", None)
|
||||
return data_precessor
|
||||
|
||||
@property
|
||||
def label_transformer(self):
|
||||
"""Returns label transformer which is used to preprocess labels before scoring, and inverse transform labels after inference."""
|
||||
"""Returns AutoML label transformer"""
|
||||
return getattr(self, "_label_transformer", None)
|
||||
|
||||
@property
|
||||
@@ -521,8 +561,8 @@ class AutoML(BaseEstimator):
|
||||
|
||||
def score(
|
||||
self,
|
||||
X: Union[DataFrame, psDataFrame],
|
||||
y: Union[Series, psSeries],
|
||||
X: DataFrame | psDataFrame,
|
||||
y: Series | psSeries,
|
||||
**kwargs,
|
||||
):
|
||||
estimator = getattr(self, "_trained_estimator", None)
|
||||
@@ -536,7 +576,7 @@ class AutoML(BaseEstimator):
|
||||
|
||||
def predict(
|
||||
self,
|
||||
X: Union[np.array, DataFrame, List[str], List[List[str]], psDataFrame],
|
||||
X: np.array | DataFrame | list[str] | list[list[str]] | psDataFrame,
|
||||
**pred_kwargs,
|
||||
):
|
||||
"""Predict label from features.
|
||||
@@ -611,7 +651,7 @@ class AutoML(BaseEstimator):
|
||||
"""
|
||||
self._state.learner_classes[learner_name] = learner_class
|
||||
|
||||
def get_estimator_from_log(self, log_file_name: str, record_id: int, task: Union[str, Task]):
|
||||
def get_estimator_from_log(self, log_file_name: str, record_id: int, task: str | Task):
|
||||
"""Get the estimator from log file.
|
||||
|
||||
Args:
|
||||
@@ -653,7 +693,7 @@ class AutoML(BaseEstimator):
|
||||
dataframe=None,
|
||||
label=None,
|
||||
time_budget=np.inf,
|
||||
task: Optional[Union[str, Task]] = None,
|
||||
task: str | Task | None = None,
|
||||
eval_method=None,
|
||||
split_ratio=None,
|
||||
n_splits=None,
|
||||
@@ -709,7 +749,7 @@ class AutoML(BaseEstimator):
|
||||
* Valid str options depend on different tasks.
|
||||
For classification tasks, valid choices are
|
||||
["auto", 'stratified', 'uniform', 'time', 'group']. "auto" -> stratified.
|
||||
For regression tasks, valid choices are ["auto", 'uniform', 'time'].
|
||||
For regression tasks, valid choices are ["auto", 'uniform', 'time', 'group'].
|
||||
"auto" -> uniform.
|
||||
For time series forecast tasks, must be "auto" or 'time'.
|
||||
For ranking task, must be "auto" or 'group'.
|
||||
@@ -779,7 +819,7 @@ class AutoML(BaseEstimator):
|
||||
max_epochs: int, default = 20 | Maximum number of epochs to run training,
|
||||
only used by TemporalFusionTransformerEstimator.
|
||||
batch_size: int, default = 64 | Batch size for training model, only
|
||||
used by TemporalFusionTransformerEstimator.
|
||||
used by TemporalFusionTransformerEstimator and TCNEstimator.
|
||||
"""
|
||||
task = task or self._settings.get("task")
|
||||
if isinstance(task, str):
|
||||
@@ -802,7 +842,7 @@ class AutoML(BaseEstimator):
|
||||
)
|
||||
task.validate_data(self, self._state, X_train, y_train, dataframe, label, groups=groups)
|
||||
|
||||
logger.info("log file name {}".format(log_file_name))
|
||||
logger.info(f"log file name {log_file_name}")
|
||||
|
||||
best_config = None
|
||||
best_val_loss = float("+inf")
|
||||
@@ -855,9 +895,7 @@ class AutoML(BaseEstimator):
|
||||
else:
|
||||
self._state.fit_kwargs_by_estimator[best_estimator] = self._state.fit_kwargs
|
||||
|
||||
logger.info(
|
||||
"estimator = {}, config = {}, #training instances = {}".format(best_estimator, best_config, sample_size)
|
||||
)
|
||||
logger.info(f"estimator = {best_estimator}, config = {best_config}, #training instances = {sample_size}")
|
||||
# Partially copied from fit() function
|
||||
# Initilize some attributes required for retrain_from_log
|
||||
self._split_type = task.decide_split_type(
|
||||
@@ -1028,7 +1066,7 @@ class AutoML(BaseEstimator):
|
||||
return points
|
||||
|
||||
@property
|
||||
def resource_attr(self) -> Optional[str]:
|
||||
def resource_attr(self) -> str | None:
|
||||
"""Attribute of the resource dimension.
|
||||
|
||||
Returns:
|
||||
@@ -1038,7 +1076,7 @@ class AutoML(BaseEstimator):
|
||||
return "FLAML_sample_size" if self._sample else None
|
||||
|
||||
@property
|
||||
def min_resource(self) -> Optional[float]:
|
||||
def min_resource(self) -> float | None:
|
||||
"""Attribute for pruning.
|
||||
|
||||
Returns:
|
||||
@@ -1047,7 +1085,7 @@ class AutoML(BaseEstimator):
|
||||
return self._min_sample_size if self._sample else None
|
||||
|
||||
@property
|
||||
def max_resource(self) -> Optional[float]:
|
||||
def max_resource(self) -> float | None:
|
||||
"""Attribute for pruning.
|
||||
|
||||
Returns:
|
||||
@@ -1069,7 +1107,7 @@ class AutoML(BaseEstimator):
|
||||
pickle.dump(self, f, pickle.HIGHEST_PROTOCOL)
|
||||
|
||||
@property
|
||||
def trainable(self) -> Callable[[dict], Optional[float]]:
|
||||
def trainable(self) -> Callable[[dict], float | None]:
|
||||
"""Training function.
|
||||
Returns:
|
||||
A function that evaluates each config and returns the loss.
|
||||
@@ -1155,7 +1193,7 @@ class AutoML(BaseEstimator):
|
||||
dataframe=None,
|
||||
label=None,
|
||||
metric=None,
|
||||
task: Optional[Union[str, Task]] = None,
|
||||
task: str | Task | None = None,
|
||||
n_jobs=None,
|
||||
# gpu_per_trial=0,
|
||||
log_file_name=None,
|
||||
@@ -1203,6 +1241,7 @@ class AutoML(BaseEstimator):
|
||||
skip_transform=None,
|
||||
mlflow_logging=None,
|
||||
fit_kwargs_by_estimator=None,
|
||||
mlflow_exp_name=None,
|
||||
**fit_kwargs,
|
||||
):
|
||||
"""Find a model for a given task.
|
||||
@@ -1296,14 +1335,15 @@ class AutoML(BaseEstimator):
|
||||
'all' logs all the tried configs.
|
||||
model_history: A boolean of whether to keep the trained best
|
||||
model per estimator. Make sure memory is large enough if setting to True.
|
||||
Default value is False: best_model_for_estimator would return a
|
||||
Default value is False. If False, best_model_for_estimator would return a
|
||||
untrained model for non-best learner.
|
||||
log_training_metric: A boolean of whether to log the training
|
||||
metric for each model.
|
||||
mem_thres: A float of the memory size constraint in bytes.
|
||||
pred_time_limit: A float of the prediction latency constraint in seconds.
|
||||
It refers to the average prediction time per row in validation data.
|
||||
train_time_limit: None or a float of the training time constraint in seconds.
|
||||
train_time_limit: None or a float of the training time constraint in seconds for each trial.
|
||||
Only valid for sequential search.
|
||||
X_val: None or a numpy array or a pandas dataframe of validation data.
|
||||
y_val: None or a numpy array or a pandas series of validation labels.
|
||||
sample_weight_val: None or a numpy array of the sample weight of
|
||||
@@ -1316,6 +1356,12 @@ class AutoML(BaseEstimator):
|
||||
for training data.
|
||||
verbose: int, default=3 | Controls the verbosity, higher means more
|
||||
messages.
|
||||
verbose=0: logger level = CRITICAL
|
||||
verbose=1: logger level = ERROR
|
||||
verbose=2: logger level = WARNING
|
||||
verbose=3: logger level = INFO
|
||||
verbose=4: logger level = DEBUG
|
||||
verbose>5: logger level = NOTSET
|
||||
retrain_full: bool or str, default=True | whether to retrain the
|
||||
selected model on the full training data when using holdout.
|
||||
True - retrain only after search finishes; False - no retraining;
|
||||
@@ -1329,7 +1375,7 @@ class AutoML(BaseEstimator):
|
||||
* Valid str options depend on different tasks.
|
||||
For classification tasks, valid choices are
|
||||
["auto", 'stratified', 'uniform', 'time', 'group']. "auto" -> stratified.
|
||||
For regression tasks, valid choices are ["auto", 'uniform', 'time'].
|
||||
For regression tasks, valid choices are ["auto", 'uniform', 'time', 'group'].
|
||||
"auto" -> uniform.
|
||||
For time series forecast tasks, must be "auto" or 'time'.
|
||||
For ranking task, must be "auto" or 'group'.
|
||||
@@ -1382,7 +1428,10 @@ class AutoML(BaseEstimator):
|
||||
early_stop: boolean, default=False | Whether to stop early if the
|
||||
search is considered to converge.
|
||||
force_cancel: boolean, default=False | Whether to forcely cancel the PySpark job if overtime.
|
||||
append_log: boolean, default=False | Whether to directly append the log
|
||||
mlflow_exp_name: str, default=None | The name of the mlflow experiment. This should be specified if
|
||||
enable mlflow autologging on Spark. Otherwise it will log all the results into the experiment of the
|
||||
same name as the basename of main entry file.
|
||||
append_log: boolean, default=False | Whetehr to directly append the log
|
||||
records to the input log file if it exists.
|
||||
auto_augment: boolean, default=True | Whether to automatically
|
||||
augment rare classes.
|
||||
@@ -1467,9 +1516,7 @@ class AutoML(BaseEstimator):
|
||||
skip_transform: boolean, default=False | Whether to pre-process data prior to modeling.
|
||||
mlflow_logging: boolean, default=None | Whether to log the training results to mlflow.
|
||||
Default value is None, which means the logging decision is made based on
|
||||
AutoML.__init__'s mlflow_logging argument.
|
||||
This requires mlflow to be installed and to have an active mlflow run.
|
||||
FLAML will create nested runs.
|
||||
AutoML.__init__'s mlflow_logging argument. Not valid if mlflow is not installed.
|
||||
fit_kwargs_by_estimator: dict, default=None | The user specified keywords arguments, grouped by estimator name.
|
||||
For TransformersEstimator, available fit_kwargs can be found from
|
||||
[TrainingArgumentsForAuto](nlp/huggingface/training_args).
|
||||
@@ -1519,7 +1566,7 @@ class AutoML(BaseEstimator):
|
||||
max_epochs: int, default = 20 | Maximum number of epochs to run training,
|
||||
only used by TemporalFusionTransformerEstimator.
|
||||
batch_size: int, default = 64 | Batch size for training model, only
|
||||
used by TemporalFusionTransformerEstimator.
|
||||
used by TemporalFusionTransformerEstimator and TCNEstimator.
|
||||
"""
|
||||
|
||||
self._state._start_time_flag = self._start_time_flag = time.time()
|
||||
@@ -1570,6 +1617,7 @@ class AutoML(BaseEstimator):
|
||||
)
|
||||
early_stop = self._settings.get("early_stop") if early_stop is None else early_stop
|
||||
force_cancel = self._settings.get("force_cancel") if force_cancel is None else force_cancel
|
||||
mlflow_exp_name = self._settings.get("mlflow_exp_name") if mlflow_exp_name is None else mlflow_exp_name
|
||||
# no search budget is provided?
|
||||
no_budget = time_budget < 0 and max_iter is None and not early_stop
|
||||
append_log = self._settings.get("append_log") if append_log is None else append_log
|
||||
@@ -1592,6 +1640,13 @@ class AutoML(BaseEstimator):
|
||||
_ch.setFormatter(logger_formatter)
|
||||
logger.addHandler(_ch)
|
||||
|
||||
if model_history:
|
||||
logger.warning(
|
||||
"With `model_history` set to `True` by default, all intermediate models are retained in memory, "
|
||||
"which may significantly increase memory usage and slow down training. "
|
||||
"Consider setting `model_history=False` to optimize memory and accelerate the training process."
|
||||
)
|
||||
|
||||
if not use_ray and not use_spark and n_concurrent_trials > 1:
|
||||
if ray_available:
|
||||
logger.warning(
|
||||
@@ -1622,7 +1677,6 @@ class AutoML(BaseEstimator):
|
||||
self._use_ray = use_ray
|
||||
# use the following condition if we have an estimation of average_trial_time and average_trial_overhead
|
||||
# self._use_ray = use_ray or n_concurrent_trials > ( average_trial_time + average_trial_overhead) / (average_trial_time)
|
||||
|
||||
if self._use_ray is not False:
|
||||
import ray
|
||||
|
||||
@@ -1656,11 +1710,29 @@ class AutoML(BaseEstimator):
|
||||
self._state.fit_kwargs = fit_kwargs
|
||||
custom_hp = custom_hp or self._settings.get("custom_hp")
|
||||
self._skip_transform = self._settings.get("skip_transform") if skip_transform is None else skip_transform
|
||||
self._mlflow_logging = self._settings.get("mlflow_logging") if mlflow_logging is None else mlflow_logging
|
||||
self._mlflow_logging = (
|
||||
False
|
||||
if mlflow is None
|
||||
else self._settings.get("mlflow_logging")
|
||||
if mlflow_logging is None
|
||||
else mlflow_logging
|
||||
)
|
||||
fit_kwargs_by_estimator = fit_kwargs_by_estimator or self._settings.get("fit_kwargs_by_estimator")
|
||||
self._state.fit_kwargs_by_estimator = fit_kwargs_by_estimator.copy() # shallow copy of fit_kwargs_by_estimator
|
||||
self._state.weight_val = sample_weight_val
|
||||
|
||||
self._mlflow_exp_name = mlflow_exp_name
|
||||
self.mlflow_integration = None
|
||||
self.autolog_extra_tag = {
|
||||
"extra_tag.sid": f"flaml_{flaml_version}_{int(time.time())}_{random.randint(1001, 9999)}"
|
||||
}
|
||||
if internal_mlflow and self._mlflow_logging and (mlflow.active_run() or is_autolog_enabled()):
|
||||
try:
|
||||
self.mlflow_integration = MLflowIntegration("automl", mlflow_exp_name, extra_tag=self.autolog_extra_tag)
|
||||
self._mlflow_exp_name = self.mlflow_integration.experiment_name
|
||||
if not (mlflow.active_run() is not None or is_autolog_enabled()):
|
||||
self.mlflow_integration.only_history = True
|
||||
except KeyError:
|
||||
logger.info("Not in Fabric, Skipped")
|
||||
task.validate_data(
|
||||
self,
|
||||
self._state,
|
||||
@@ -1688,7 +1760,7 @@ class AutoML(BaseEstimator):
|
||||
logger.info(f"Data split method: {self._split_type}")
|
||||
eval_method = self._decide_eval_method(eval_method, time_budget)
|
||||
self._state.eval_method = eval_method
|
||||
logger.info("Evaluation method: {}".format(eval_method))
|
||||
logger.info(f"Evaluation method: {eval_method}")
|
||||
self._state.cv_score_agg_func = cv_score_agg_func or self._settings.get("cv_score_agg_func")
|
||||
|
||||
self._retrain_in_budget = retrain_full == "budget" and (eval_method == "holdout" and self._state.X_val is None)
|
||||
@@ -1705,13 +1777,9 @@ class AutoML(BaseEstimator):
|
||||
if sample_size:
|
||||
_sample_size_from_starting_points[_estimator] = sample_size
|
||||
elif _point_per_estimator and isinstance(_point_per_estimator, list):
|
||||
_sample_size_set = set(
|
||||
[
|
||||
config["FLAML_sample_size"]
|
||||
for config in _point_per_estimator
|
||||
if "FLAML_sample_size" in config
|
||||
]
|
||||
)
|
||||
_sample_size_set = {
|
||||
config["FLAML_sample_size"] for config in _point_per_estimator if "FLAML_sample_size" in config
|
||||
}
|
||||
if _sample_size_set:
|
||||
_sample_size_from_starting_points[_estimator] = min(_sample_size_set)
|
||||
if len(_sample_size_set) > 1:
|
||||
@@ -1729,6 +1797,11 @@ class AutoML(BaseEstimator):
|
||||
self._min_sample_size_input = min_sample_size
|
||||
self._prepare_data(eval_method, split_ratio, n_splits)
|
||||
|
||||
# infer the signature of the input/output data
|
||||
if self.mlflow_integration is not None:
|
||||
self.estimator_signature = infer_signature(self._state.X_train, self._state.y_train)
|
||||
self.pipeline_signature = infer_signature(X_train, y_train, dataframe, label)
|
||||
|
||||
# TODO pull this to task as decide_sample_size
|
||||
if isinstance(self._min_sample_size, dict):
|
||||
self._sample = {
|
||||
@@ -1827,6 +1900,11 @@ class AutoML(BaseEstimator):
|
||||
and (max_iter > 0 or retrain_full is True)
|
||||
or max_iter == 1
|
||||
)
|
||||
if self.mlflow_integration is not None and all(
|
||||
[self.mlflow_integration.parent_run_id is None, not self.mlflow_integration.only_history]
|
||||
):
|
||||
# force not retrain if no active run
|
||||
self._state.retrain_final = False
|
||||
# add custom learner
|
||||
for estimator_name in estimator_list:
|
||||
if estimator_name not in self._state.learner_classes:
|
||||
@@ -1898,7 +1976,7 @@ class AutoML(BaseEstimator):
|
||||
max_iter=max_iter / len(estimator_list) if self._learner_selector == "roundrobin" else max_iter,
|
||||
budget=self._state.time_budget,
|
||||
)
|
||||
logger.info("List of ML learners in AutoML Run: {}".format(estimator_list))
|
||||
logger.info(f"List of ML learners in AutoML Run: {estimator_list}")
|
||||
self.estimator_list = estimator_list
|
||||
self._active_estimators = estimator_list.copy()
|
||||
self._ensemble = ensemble
|
||||
@@ -1940,7 +2018,7 @@ class AutoML(BaseEstimator):
|
||||
)
|
||||
):
|
||||
logger.warning(
|
||||
"Time taken to find the best model is {0:.0f}% of the "
|
||||
"Time taken to find the best model is {:.0f}% of the "
|
||||
"provided time budget and not all estimators' hyperparameter "
|
||||
"search converged. Consider increasing the time budget.".format(
|
||||
self._time_taken_best_iter / self._state.time_budget * 100
|
||||
@@ -1959,6 +2037,8 @@ class AutoML(BaseEstimator):
|
||||
) # NOTE: this is after kwargs is updated to fit_kwargs_by_estimator
|
||||
del self._state.groups, self._state.groups_all, self._state.groups_val
|
||||
logger.setLevel(old_level)
|
||||
if self.mlflow_integration is not None:
|
||||
self.mlflow_integration.resume_mlflow()
|
||||
|
||||
def _search_parallel(self):
|
||||
if self._use_ray is not False:
|
||||
@@ -2055,6 +2135,14 @@ class AutoML(BaseEstimator):
|
||||
|
||||
if self._use_spark:
|
||||
# use spark as parallel backend
|
||||
mlflow_log_latency = (
|
||||
get_mlflow_log_latency(model_history=self._state.model_history) if self.mlflow_integration else 0
|
||||
)
|
||||
(
|
||||
logger.info(f"Estimated mlflow_log_latency: {mlflow_log_latency} seconds.")
|
||||
if mlflow_log_latency > 0
|
||||
else None
|
||||
)
|
||||
analysis = tune.run(
|
||||
self.trainable,
|
||||
search_alg=search_alg,
|
||||
@@ -2067,6 +2155,9 @@ class AutoML(BaseEstimator):
|
||||
use_ray=False,
|
||||
use_spark=True,
|
||||
force_cancel=self._force_cancel,
|
||||
mlflow_exp_name=self._mlflow_exp_name,
|
||||
automl_info=(mlflow_log_latency,), # pass automl info to tune.run
|
||||
extra_tag=self.autolog_extra_tag,
|
||||
# raise_on_failed_trial=False,
|
||||
# keep_checkpoints_num=1,
|
||||
# checkpoint_score_attr="min-val_loss",
|
||||
@@ -2127,6 +2218,8 @@ class AutoML(BaseEstimator):
|
||||
self._search_states[estimator].best_config = config
|
||||
if better or self._log_type == "all":
|
||||
self._log_trial(search_state, estimator)
|
||||
if self.mlflow_integration:
|
||||
self.mlflow_integration.record_state(self, search_state, estimator)
|
||||
|
||||
def _log_trial(self, search_state, estimator):
|
||||
if self._training_log:
|
||||
@@ -2140,36 +2233,6 @@ class AutoML(BaseEstimator):
|
||||
estimator,
|
||||
search_state.sample_size,
|
||||
)
|
||||
if self._mlflow_logging and mlflow is not None and mlflow.active_run():
|
||||
with mlflow.start_run(nested=True):
|
||||
mlflow.log_metric("iter_counter", self._track_iter)
|
||||
if (search_state.metric_for_logging is not None) and (
|
||||
"intermediate_results" in search_state.metric_for_logging
|
||||
):
|
||||
for each_entry in search_state.metric_for_logging["intermediate_results"]:
|
||||
with mlflow.start_run(nested=True):
|
||||
mlflow.log_metrics(each_entry)
|
||||
mlflow.log_metric("iter_counter", self._iter_per_learner[estimator])
|
||||
del search_state.metric_for_logging["intermediate_results"]
|
||||
if search_state.metric_for_logging:
|
||||
mlflow.log_metrics(search_state.metric_for_logging)
|
||||
mlflow.log_metric("trial_time", search_state.trial_time)
|
||||
mlflow.log_metric("wall_clock_time", self._state.time_from_start)
|
||||
mlflow.log_metric("validation_loss", search_state.val_loss)
|
||||
mlflow.log_params(search_state.config)
|
||||
mlflow.log_param("learner", estimator)
|
||||
mlflow.log_param("sample_size", search_state.sample_size)
|
||||
mlflow.log_metric("best_validation_loss", search_state.best_loss)
|
||||
mlflow.log_param("best_config", search_state.best_config)
|
||||
mlflow.log_param("best_learner", self._best_estimator)
|
||||
mlflow.log_metric(
|
||||
self._state.metric if isinstance(self._state.metric, str) else self._state.error_metric,
|
||||
1 - search_state.val_loss
|
||||
if self._state.error_metric.startswith("1-")
|
||||
else -search_state.val_loss
|
||||
if self._state.error_metric.startswith("-")
|
||||
else search_state.val_loss,
|
||||
)
|
||||
|
||||
def _search_sequential(self):
|
||||
try:
|
||||
@@ -2323,9 +2386,18 @@ class AutoML(BaseEstimator):
|
||||
verbose=max(self.verbose - 3, 0),
|
||||
use_ray=False,
|
||||
use_spark=False,
|
||||
force_cancel=self._force_cancel,
|
||||
mlflow_exp_name=self._mlflow_exp_name,
|
||||
automl_info=(0,), # pass automl info to tune.run
|
||||
extra_tag=self.autolog_extra_tag,
|
||||
)
|
||||
time_used = time.time() - start_run_time
|
||||
better = False
|
||||
(
|
||||
logger.debug(f"result in automl: {analysis.trials}, {analysis.trials[-1].last_result}")
|
||||
if analysis.trials
|
||||
else logger.debug("result in automl: [], None")
|
||||
)
|
||||
if analysis.trials and analysis.trials[-1].last_result:
|
||||
result = analysis.trials[-1].last_result
|
||||
search_state.update(result, time_used=time_used)
|
||||
@@ -2388,6 +2460,8 @@ class AutoML(BaseEstimator):
|
||||
search_state.trained_estimator.cleanup()
|
||||
if better or self._log_type == "all":
|
||||
self._log_trial(search_state, estimator)
|
||||
if self.mlflow_integration:
|
||||
self.mlflow_integration.record_state(self, search_state, estimator)
|
||||
|
||||
logger.info(
|
||||
" at {:.1f}s,\testimator {}'s best error={:.4f},\tbest estimator {}'s best error={:.4f}".format(
|
||||
@@ -2440,7 +2514,7 @@ class AutoML(BaseEstimator):
|
||||
state.best_config,
|
||||
self.data_size_full,
|
||||
)
|
||||
logger.info("retrain {} for {:.1f}s".format(self._best_estimator, retrain_time))
|
||||
logger.info(f"retrain {self._best_estimator} for {retrain_time:.1f}s")
|
||||
self._retrained_config[best_config_sig] = state.best_config_train_time = retrain_time
|
||||
est_retrain_time = 0
|
||||
self._state.time_from_start = time.time() - self._start_time_flag
|
||||
@@ -2462,8 +2536,8 @@ class AutoML(BaseEstimator):
|
||||
self._time_taken_best_iter = 0
|
||||
self._config_history = {}
|
||||
self._max_iter_per_learner = 10000
|
||||
self._iter_per_learner = dict([(e, 0) for e in self.estimator_list])
|
||||
self._iter_per_learner_fullsize = dict([(e, 0) for e in self.estimator_list])
|
||||
self._iter_per_learner = {e: 0 for e in self.estimator_list}
|
||||
self._iter_per_learner_fullsize = {e: 0 for e in self.estimator_list}
|
||||
self._fullsize_reached = False
|
||||
self._trained_estimator = None
|
||||
self._best_estimator = None
|
||||
@@ -2479,6 +2553,21 @@ class AutoML(BaseEstimator):
|
||||
self._selected = state = self._search_states[estimator]
|
||||
state.best_config_sample_size = self._state.data_size[0]
|
||||
state.best_config = state.init_config[0] if state.init_config else {}
|
||||
self._track_iter = 0
|
||||
self._config_history[self._track_iter] = (estimator, state.best_config, self._state.time_from_start)
|
||||
self._best_iteration = self._track_iter
|
||||
state.val_loss = getattr(state, "val_loss", float("inf"))
|
||||
state.best_loss = getattr(state, "best_loss", float("inf"))
|
||||
state.config = getattr(state, "config", state.best_config.copy())
|
||||
state.metric_for_logging = getattr(state, "metric_for_logging", None)
|
||||
state.sample_size = getattr(state, "sample_size", self._state.data_size[0])
|
||||
state.learner_class = getattr(state, "learner_class", self._state.learner_classes.get(estimator))
|
||||
if hasattr(self, "mlflow_integration") and self.mlflow_integration:
|
||||
self.mlflow_integration.record_state(
|
||||
automl=self,
|
||||
search_state=state,
|
||||
estimator=estimator,
|
||||
)
|
||||
elif self._use_ray is False and self._use_spark is False:
|
||||
self._search_sequential()
|
||||
else:
|
||||
@@ -2488,6 +2577,12 @@ class AutoML(BaseEstimator):
|
||||
self._training_log.checkpoint()
|
||||
self._state.time_from_start = time.time() - self._start_time_flag
|
||||
if self._best_estimator:
|
||||
if self.mlflow_integration:
|
||||
self.mlflow_integration.log_automl(self)
|
||||
if mlflow.active_run() is None:
|
||||
if self.mlflow_integration.parent_run_id is not None and self.mlflow_integration.autolog:
|
||||
# ensure result of retrain autolog to parent run
|
||||
mlflow.start_run(run_id=self.mlflow_integration.parent_run_id)
|
||||
self._selected = self._search_states[self._best_estimator]
|
||||
self.modelcount = sum(search_state.total_iter for search_state in self._search_states.values())
|
||||
if self._trained_estimator:
|
||||
@@ -2624,13 +2719,67 @@ class AutoML(BaseEstimator):
|
||||
self._best_estimator,
|
||||
state.best_config,
|
||||
self.data_size_full,
|
||||
is_retrain=True,
|
||||
)
|
||||
logger.info("retrain {} for {:.1f}s".format(self._best_estimator, retrain_time))
|
||||
logger.info(f"retrain {self._best_estimator} for {retrain_time:.1f}s")
|
||||
state.best_config_train_time = retrain_time
|
||||
if self._trained_estimator:
|
||||
logger.info(f"retrained model: {self._trained_estimator.model}")
|
||||
if self.best_run_id is not None:
|
||||
logger.info(f"Best MLflow run name: {self.best_run_name}")
|
||||
logger.info(f"Best MLflow run id: {self.best_run_id}")
|
||||
if self.mlflow_integration is not None:
|
||||
# try log retrained model
|
||||
if all(
|
||||
[
|
||||
self.mlflow_integration.manual_log,
|
||||
not self.mlflow_integration.has_model,
|
||||
self.mlflow_integration.parent_run_id is not None,
|
||||
]
|
||||
):
|
||||
if mlflow.active_run() is None:
|
||||
mlflow.start_run(run_id=self.mlflow_integration.parent_run_id)
|
||||
if self.best_estimator.endswith("_spark"):
|
||||
self.mlflow_integration.log_model(
|
||||
self._trained_estimator.model,
|
||||
self.best_estimator,
|
||||
signature=self.estimator_signature,
|
||||
run_id=self.mlflow_integration.parent_run_id,
|
||||
)
|
||||
else:
|
||||
self.mlflow_integration.pickle_and_log_automl_artifacts(
|
||||
self,
|
||||
self.model,
|
||||
self.best_estimator,
|
||||
signature=self.pipeline_signature,
|
||||
run_id=self.mlflow_integration.parent_run_id,
|
||||
)
|
||||
else:
|
||||
logger.info("not retraining because the time budget is too small.")
|
||||
logger.warning("not retraining because the time budget is too small.")
|
||||
self.wait_futures()
|
||||
|
||||
def wait_futures(self):
|
||||
if self.mlflow_integration is not None:
|
||||
logger.debug("Collecting results from submitted record_state tasks")
|
||||
t1 = time.perf_counter()
|
||||
for future in as_completed(self.mlflow_integration.futures):
|
||||
_task = self.mlflow_integration.futures[future]
|
||||
try:
|
||||
result = future.result()
|
||||
logger.debug(f"Result for record_state task {_task}: {result}")
|
||||
except Exception as e:
|
||||
logger.warning(f"Exception for record_state task {_task}: {e}")
|
||||
for future in as_completed(self.mlflow_integration.futures_log_model):
|
||||
_task = self.mlflow_integration.futures_log_model[future]
|
||||
try:
|
||||
result = future.result()
|
||||
logger.debug(f"Result for log_model task {_task}: {result}")
|
||||
except Exception as e:
|
||||
logger.warning(f"Exception for log_model task {_task}: {e}")
|
||||
t2 = time.perf_counter()
|
||||
logger.debug(f"Collecting results from tasks submitted to executors costs {t2-t1} seconds.")
|
||||
else:
|
||||
logger.debug("No futures to wait for.")
|
||||
|
||||
def __del__(self):
|
||||
if (
|
||||
@@ -2702,3 +2851,7 @@ class AutoML(BaseEstimator):
|
||||
q += inv[i] / s
|
||||
if p < q:
|
||||
return estimator_list[i]
|
||||
|
||||
@property
|
||||
def automl_pipeline(self):
|
||||
return None
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
try:
|
||||
from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor
|
||||
except ImportError:
|
||||
pass
|
||||
except ImportError as e:
|
||||
print(f"scikit-learn is required for HistGradientBoostingEstimator. Please install it; error: {e}")
|
||||
|
||||
from flaml import tune
|
||||
from flaml.automl.model import SKLearnEstimator
|
||||
|
||||
@@ -2,13 +2,17 @@
|
||||
# * Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
# * Licensed under the MIT License. See LICENSE file in the
|
||||
# * project root for license information.
|
||||
import json
|
||||
import os
|
||||
from datetime import datetime
|
||||
import random
|
||||
import uuid
|
||||
from datetime import datetime, timedelta
|
||||
from decimal import ROUND_HALF_UP, Decimal
|
||||
from typing import TYPE_CHECKING, Union
|
||||
|
||||
import numpy as np
|
||||
|
||||
from flaml.automl.spark import DataFrame, Series, pd, ps, psDataFrame, psSeries
|
||||
from flaml.automl.spark import DataFrame, F, Series, T, pd, ps, psDataFrame, psSeries
|
||||
from flaml.automl.training_log import training_log_reader
|
||||
|
||||
try:
|
||||
@@ -19,6 +23,7 @@ except ImportError:
|
||||
if TYPE_CHECKING:
|
||||
from flaml.automl.task import Task
|
||||
|
||||
|
||||
TS_TIMESTAMP_COL = "ds"
|
||||
TS_VALUE_COL = "y"
|
||||
|
||||
@@ -293,7 +298,7 @@ class DataTransformer:
|
||||
y = y.rename(TS_VALUE_COL)
|
||||
for column in X.columns:
|
||||
# sklearn\utils\validation.py needs int/float values
|
||||
if X[column].dtype.name in ("object", "category"):
|
||||
if X[column].dtype.name in ("object", "category", "string"):
|
||||
if X[column].nunique() == 1 or X[column].nunique(dropna=True) == n - X[column].isnull().sum():
|
||||
X.drop(columns=column, inplace=True)
|
||||
drop = True
|
||||
@@ -445,3 +450,331 @@ class DataTransformer:
|
||||
def group_counts(groups):
|
||||
_, i, c = np.unique(groups, return_counts=True, return_index=True)
|
||||
return c[np.argsort(i)]
|
||||
|
||||
|
||||
def get_random_dataframe(n_rows: int = 200, ratio_none: float = 0.1, seed: int = 42) -> DataFrame:
|
||||
"""Generate a random pandas DataFrame with various data types for testing.
|
||||
This function creates a DataFrame with multiple column types including:
|
||||
- Timestamps
|
||||
- Integers
|
||||
- Floats
|
||||
- Categorical values
|
||||
- Booleans
|
||||
- Lists (tags)
|
||||
- Decimal strings
|
||||
- UUIDs
|
||||
- Binary data (as hex strings)
|
||||
- JSON blobs
|
||||
- Nullable text fields
|
||||
Parameters
|
||||
----------
|
||||
n_rows : int, default=200
|
||||
Number of rows in the generated DataFrame
|
||||
ratio_none : float, default=0.1
|
||||
Probability of generating None values in applicable columns
|
||||
seed : int, default=42
|
||||
Random seed for reproducibility
|
||||
Returns
|
||||
-------
|
||||
pd.DataFrame
|
||||
A DataFrame with 14 columns of various data types
|
||||
Examples
|
||||
--------
|
||||
>>> df = get_random_dataframe(100, 0.05, 123)
|
||||
>>> df.shape
|
||||
(100, 14)
|
||||
>>> df.dtypes
|
||||
timestamp datetime64[ns]
|
||||
id int64
|
||||
score float64
|
||||
status object
|
||||
flag object
|
||||
count object
|
||||
value object
|
||||
tags object
|
||||
rating object
|
||||
uuid object
|
||||
binary object
|
||||
json_blob object
|
||||
category category
|
||||
nullable_text object
|
||||
dtype: object
|
||||
"""
|
||||
|
||||
np.random.seed(seed)
|
||||
random.seed(seed)
|
||||
|
||||
def random_tags():
|
||||
tags = ["AI", "ML", "data", "robotics", "vision"]
|
||||
return random.sample(tags, k=random.randint(1, 3)) if random.random() > ratio_none else None
|
||||
|
||||
def random_decimal():
|
||||
return (
|
||||
str(Decimal(random.uniform(1, 5)).quantize(Decimal("0.01"), rounding=ROUND_HALF_UP))
|
||||
if random.random() > ratio_none
|
||||
else None
|
||||
)
|
||||
|
||||
def random_json_blob():
|
||||
blob = {"a": random.randint(1, 10), "b": random.random()}
|
||||
return json.dumps(blob) if random.random() > ratio_none else None
|
||||
|
||||
def random_binary():
|
||||
return bytes(random.randint(0, 255) for _ in range(4)).hex() if random.random() > ratio_none else None
|
||||
|
||||
data = {
|
||||
"timestamp": [
|
||||
datetime(2020, 1, 1) + timedelta(days=np.random.randint(0, 1000)) if np.random.rand() > ratio_none else None
|
||||
for _ in range(n_rows)
|
||||
],
|
||||
"id": range(1, n_rows + 1),
|
||||
"score": np.random.uniform(0, 100, n_rows),
|
||||
"status": np.random.choice(
|
||||
["active", "inactive", "pending", None],
|
||||
size=n_rows,
|
||||
p=[(1 - ratio_none) / 3, (1 - ratio_none) / 3, (1 - ratio_none) / 3, ratio_none],
|
||||
),
|
||||
"flag": np.random.choice(
|
||||
[True, False, None], size=n_rows, p=[(1 - ratio_none) / 2, (1 - ratio_none) / 2, ratio_none]
|
||||
),
|
||||
"count": [np.random.randint(0, 100) if np.random.rand() > ratio_none else None for _ in range(n_rows)],
|
||||
"value": [round(np.random.normal(50, 15), 2) if np.random.rand() > ratio_none else None for _ in range(n_rows)],
|
||||
"tags": [random_tags() for _ in range(n_rows)],
|
||||
"rating": [random_decimal() for _ in range(n_rows)],
|
||||
"uuid": [str(uuid.uuid4()) if np.random.rand() > ratio_none else None for _ in range(n_rows)],
|
||||
"binary": [random_binary() for _ in range(n_rows)],
|
||||
"json_blob": [random_json_blob() for _ in range(n_rows)],
|
||||
"category": pd.Categorical(
|
||||
np.random.choice(
|
||||
["A", "B", "C", None],
|
||||
size=n_rows,
|
||||
p=[(1 - ratio_none) / 3, (1 - ratio_none) / 3, (1 - ratio_none) / 3, ratio_none],
|
||||
)
|
||||
),
|
||||
"nullable_text": [random.choice(["Good", "Bad", "Average", None]) for _ in range(n_rows)],
|
||||
}
|
||||
|
||||
return pd.DataFrame(data)
|
||||
|
||||
|
||||
def auto_convert_dtypes_spark(
|
||||
df: psDataFrame,
|
||||
na_values: list = None,
|
||||
category_threshold: float = 0.3,
|
||||
convert_threshold: float = 0.6,
|
||||
sample_ratio: float = 0.1,
|
||||
) -> tuple[psDataFrame, dict]:
|
||||
"""Automatically convert data types in a PySpark DataFrame using heuristics.
|
||||
|
||||
This function analyzes a sample of the DataFrame to infer appropriate data types
|
||||
and applies the conversions. It handles timestamps, numeric values, booleans,
|
||||
and categorical fields.
|
||||
|
||||
Args:
|
||||
df: A PySpark DataFrame to convert.
|
||||
na_values: List of strings to be considered as NA/NaN. Defaults to
|
||||
['NA', 'na', 'NULL', 'null', ''].
|
||||
category_threshold: Maximum ratio of unique values to total values
|
||||
to consider a column categorical. Defaults to 0.3.
|
||||
convert_threshold: Minimum ratio of successfully converted values required
|
||||
to apply a type conversion. Defaults to 0.6.
|
||||
sample_ratio: Fraction of data to sample for type inference. Defaults to 0.1.
|
||||
|
||||
Returns:
|
||||
tuple: (The DataFrame with converted types, A dictionary mapping column names to
|
||||
their inferred types as strings)
|
||||
|
||||
Note:
|
||||
- 'category' in the schema dict is conceptual as PySpark doesn't have a true
|
||||
category type like pandas
|
||||
- The function uses sampling for efficiency with large datasets
|
||||
"""
|
||||
n_rows = df.count()
|
||||
if na_values is None:
|
||||
na_values = ["NA", "na", "NULL", "null", ""]
|
||||
|
||||
# Normalize NA-like values
|
||||
for colname, coltype in df.dtypes:
|
||||
if coltype == "string":
|
||||
df = df.withColumn(
|
||||
colname,
|
||||
F.when(F.trim(F.lower(F.col(colname))).isin([v.lower() for v in na_values]), None).otherwise(
|
||||
F.col(colname)
|
||||
),
|
||||
)
|
||||
|
||||
schema = {}
|
||||
for colname in df.columns:
|
||||
# Sample once at an appropriate ratio
|
||||
sample_ratio_to_use = min(1.0, sample_ratio if n_rows * sample_ratio > 100 else 100 / n_rows)
|
||||
col_sample = df.select(colname).sample(withReplacement=False, fraction=sample_ratio_to_use).dropna()
|
||||
sample_count = col_sample.count()
|
||||
|
||||
inferred_type = "string" # Default
|
||||
|
||||
if col_sample.dtypes[0][1] != "string":
|
||||
schema[colname] = col_sample.dtypes[0][1]
|
||||
continue
|
||||
|
||||
if sample_count == 0:
|
||||
schema[colname] = "string"
|
||||
continue
|
||||
|
||||
# Check if timestamp
|
||||
ts_col = col_sample.withColumn("parsed", F.to_timestamp(F.col(colname)))
|
||||
|
||||
# Check numeric
|
||||
if (
|
||||
col_sample.withColumn("n", F.col(colname).cast("double")).filter("n is not null").count()
|
||||
>= sample_count * convert_threshold
|
||||
):
|
||||
# All whole numbers?
|
||||
all_whole = (
|
||||
col_sample.withColumn("n", F.col(colname).cast("double"))
|
||||
.filter("n is not null")
|
||||
.withColumn("frac", F.abs(F.col("n") % 1))
|
||||
.filter("frac > 0.000001")
|
||||
.count()
|
||||
== 0
|
||||
)
|
||||
inferred_type = "int" if all_whole else "double"
|
||||
|
||||
# Check low-cardinality (category-like)
|
||||
elif (
|
||||
sample_count > 0
|
||||
and col_sample.select(F.countDistinct(F.col(colname))).collect()[0][0] / sample_count <= category_threshold
|
||||
):
|
||||
inferred_type = "category" # Will just be string, but marked as such
|
||||
|
||||
# Check if timestamp
|
||||
elif ts_col.filter(F.col("parsed").isNotNull()).count() >= sample_count * convert_threshold:
|
||||
inferred_type = "timestamp"
|
||||
|
||||
schema[colname] = inferred_type
|
||||
|
||||
# Apply inferred schema
|
||||
for colname, inferred_type in schema.items():
|
||||
if inferred_type == "int":
|
||||
df = df.withColumn(colname, F.col(colname).cast(T.IntegerType()))
|
||||
elif inferred_type == "double":
|
||||
df = df.withColumn(colname, F.col(colname).cast(T.DoubleType()))
|
||||
elif inferred_type == "boolean":
|
||||
df = df.withColumn(
|
||||
colname,
|
||||
F.when(F.lower(F.col(colname)).isin("true", "yes", "1"), True)
|
||||
.when(F.lower(F.col(colname)).isin("false", "no", "0"), False)
|
||||
.otherwise(None),
|
||||
)
|
||||
elif inferred_type == "timestamp":
|
||||
df = df.withColumn(colname, F.to_timestamp(F.col(colname)))
|
||||
elif inferred_type == "category":
|
||||
df = df.withColumn(colname, F.col(colname).cast(T.StringType())) # Marked conceptually
|
||||
|
||||
# otherwise keep as string (or original type)
|
||||
|
||||
return df, schema
|
||||
|
||||
|
||||
def auto_convert_dtypes_pandas(
|
||||
df: DataFrame,
|
||||
na_values: list = None,
|
||||
category_threshold: float = 0.3,
|
||||
convert_threshold: float = 0.6,
|
||||
sample_ratio: float = 1.0,
|
||||
) -> tuple[DataFrame, dict]:
|
||||
"""Automatically convert data types in a pandas DataFrame using heuristics.
|
||||
|
||||
This function analyzes the DataFrame to infer appropriate data types
|
||||
and applies the conversions. It handles timestamps, timedeltas, numeric values,
|
||||
and categorical fields.
|
||||
|
||||
Args:
|
||||
df: A pandas DataFrame to convert.
|
||||
na_values: List of strings to be considered as NA/NaN. Defaults to
|
||||
['NA', 'na', 'NULL', 'null', ''].
|
||||
category_threshold: Maximum ratio of unique values to total values
|
||||
to consider a column categorical. Defaults to 0.3.
|
||||
convert_threshold: Minimum ratio of successfully converted values required
|
||||
to apply a type conversion. Defaults to 0.6.
|
||||
sample_ratio: Fraction of data to sample for type inference. Not used in pandas version
|
||||
but included for API compatibility. Defaults to 1.0.
|
||||
|
||||
Returns:
|
||||
tuple: (The DataFrame with converted types, A dictionary mapping column names to
|
||||
their inferred types as strings)
|
||||
"""
|
||||
if na_values is None:
|
||||
na_values = {"NA", "na", "NULL", "null", ""}
|
||||
|
||||
df_converted = df.convert_dtypes()
|
||||
schema = {}
|
||||
|
||||
# Sample if needed (for API compatibility)
|
||||
if sample_ratio < 1.0:
|
||||
df = df.sample(frac=sample_ratio)
|
||||
|
||||
n_rows = len(df)
|
||||
|
||||
for col in df.columns:
|
||||
series = df[col]
|
||||
# Replace NA-like values if string
|
||||
series_cleaned = series.map(lambda x: np.nan if isinstance(x, str) and x.strip() in na_values else x)
|
||||
|
||||
# Skip conversion if already non-object data type, except bool which can potentially be categorical
|
||||
if (
|
||||
not isinstance(series_cleaned.dtype, pd.BooleanDtype)
|
||||
and not isinstance(series_cleaned.dtype, pd.StringDtype)
|
||||
and series_cleaned.dtype != "object"
|
||||
):
|
||||
# Keep the original data type for non-object dtypes
|
||||
df_converted[col] = series
|
||||
schema[col] = str(series_cleaned.dtype)
|
||||
continue
|
||||
|
||||
# print(f"type: {series_cleaned.dtype}, column: {series_cleaned.name}")
|
||||
|
||||
if not isinstance(series_cleaned.dtype, pd.BooleanDtype):
|
||||
# Try numeric (int or float)
|
||||
numeric = pd.to_numeric(series_cleaned, errors="coerce")
|
||||
if numeric.notna().sum() >= n_rows * convert_threshold:
|
||||
if (numeric.dropna() % 1 == 0).all():
|
||||
try:
|
||||
df_converted[col] = numeric.astype("int") # Nullable integer
|
||||
schema[col] = "int"
|
||||
continue
|
||||
except Exception:
|
||||
pass
|
||||
df_converted[col] = numeric.astype("double")
|
||||
schema[col] = "double"
|
||||
continue
|
||||
|
||||
# Try datetime
|
||||
datetime_converted = pd.to_datetime(series_cleaned, errors="coerce")
|
||||
if datetime_converted.notna().sum() >= n_rows * convert_threshold:
|
||||
df_converted[col] = datetime_converted
|
||||
schema[col] = "timestamp"
|
||||
continue
|
||||
|
||||
# Try timedelta
|
||||
try:
|
||||
timedelta_converted = pd.to_timedelta(series_cleaned, errors="coerce")
|
||||
if timedelta_converted.notna().sum() >= n_rows * convert_threshold:
|
||||
df_converted[col] = timedelta_converted
|
||||
schema[col] = "timedelta"
|
||||
continue
|
||||
except TypeError:
|
||||
pass
|
||||
|
||||
# Try category
|
||||
try:
|
||||
unique_ratio = series_cleaned.nunique(dropna=True) / n_rows if n_rows > 0 else 1.0
|
||||
if unique_ratio <= category_threshold:
|
||||
df_converted[col] = series_cleaned.astype("category")
|
||||
schema[col] = "category"
|
||||
continue
|
||||
except Exception:
|
||||
pass
|
||||
df_converted[col] = series_cleaned.astype("string")
|
||||
schema[col] = "string"
|
||||
|
||||
return df_converted, schema
|
||||
|
||||
@@ -1,7 +1,37 @@
|
||||
import logging
|
||||
import os
|
||||
|
||||
|
||||
class ColoredFormatter(logging.Formatter):
|
||||
# ANSI escape codes for colors
|
||||
COLORS = {
|
||||
# logging.DEBUG: "\033[36m", # Cyan
|
||||
# logging.INFO: "\033[32m", # Green
|
||||
logging.WARNING: "\033[33m", # Yellow
|
||||
logging.ERROR: "\033[31m", # Red
|
||||
logging.CRITICAL: "\033[1;31m", # Bright Red
|
||||
}
|
||||
RESET = "\033[0m" # Reset to default
|
||||
|
||||
def __init__(self, fmt, datefmt, use_color=True):
|
||||
super().__init__(fmt, datefmt)
|
||||
self.use_color = use_color
|
||||
|
||||
def format(self, record):
|
||||
formatted = super().format(record)
|
||||
if self.use_color:
|
||||
color = self.COLORS.get(record.levelno, "")
|
||||
if color:
|
||||
return f"{color}{formatted}{self.RESET}"
|
||||
return formatted
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
logger_formatter = logging.Formatter(
|
||||
"[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s", "%m-%d %H:%M:%S"
|
||||
use_color = True
|
||||
if os.getenv("FLAML_LOG_NO_COLOR"):
|
||||
use_color = False
|
||||
|
||||
logger_formatter = ColoredFormatter(
|
||||
"[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s", "%m-%d %H:%M:%S", use_color
|
||||
)
|
||||
logger.propagate = False
|
||||
|
||||
@@ -13,6 +13,7 @@ from flaml.automl.model import BaseEstimator, TransformersEstimator
|
||||
from flaml.automl.spark import ERROR as SPARK_ERROR
|
||||
from flaml.automl.spark import DataFrame, Series, psDataFrame, psSeries
|
||||
from flaml.automl.task.task import Task
|
||||
from flaml.automl.time_series import TimeSeriesDataset
|
||||
|
||||
try:
|
||||
from sklearn.metrics import (
|
||||
@@ -33,7 +34,6 @@ except ImportError:
|
||||
if SPARK_ERROR is None:
|
||||
from flaml.automl.spark.metrics import spark_metric_loss_score
|
||||
|
||||
from flaml.automl.time_series import TimeSeriesDataset
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@@ -89,6 +89,11 @@ huggingface_metric_to_mode = {
|
||||
"wer": "min",
|
||||
}
|
||||
huggingface_submetric_to_metric = {"rouge1": "rouge", "rouge2": "rouge"}
|
||||
spark_metric_name_dict = {
|
||||
"Regression": ["r2", "rmse", "mse", "mae", "var"],
|
||||
"Binary Classification": ["pr_auc", "roc_auc"],
|
||||
"Multi-class Classification": ["accuracy", "log_loss", "f1", "micro_f1", "macro_f1"],
|
||||
}
|
||||
|
||||
|
||||
def metric_loss_score(
|
||||
@@ -122,7 +127,7 @@ def metric_loss_score(
|
||||
import datasets
|
||||
|
||||
datasets_metric_name = huggingface_submetric_to_metric.get(metric_name, metric_name.split(":")[0])
|
||||
metric = datasets.load_metric(datasets_metric_name)
|
||||
metric = datasets.load_metric(datasets_metric_name, trust_remote_code=True)
|
||||
metric_mode = huggingface_metric_to_mode[datasets_metric_name]
|
||||
|
||||
if metric_name.startswith("seqeval"):
|
||||
@@ -334,6 +339,14 @@ def compute_estimator(
|
||||
if fit_kwargs is None:
|
||||
fit_kwargs = {}
|
||||
|
||||
fe_params = {}
|
||||
for param, value in config_dic.items():
|
||||
if param.startswith("fe."):
|
||||
fe_params[param] = value
|
||||
|
||||
for param, value in fe_params.items():
|
||||
config_dic.pop(param)
|
||||
|
||||
estimator_class = estimator_class or task.estimator_class_from_str(estimator_name)
|
||||
estimator = estimator_class(
|
||||
**config_dic,
|
||||
@@ -401,12 +414,21 @@ def train_estimator(
|
||||
free_mem_ratio=0,
|
||||
) -> Tuple[EstimatorSubclass, float]:
|
||||
start_time = time.time()
|
||||
fe_params = {}
|
||||
for param, value in config_dic.items():
|
||||
if param.startswith("fe."):
|
||||
fe_params[param] = value
|
||||
|
||||
for param, value in fe_params.items():
|
||||
config_dic.pop(param)
|
||||
|
||||
estimator_class = estimator_class or task.estimator_class_from_str(estimator_name)
|
||||
estimator = estimator_class(
|
||||
**config_dic,
|
||||
task=task,
|
||||
n_jobs=n_jobs,
|
||||
)
|
||||
|
||||
if fit_kwargs is None:
|
||||
fit_kwargs = {}
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -32,7 +32,7 @@ class DataCollatorForMultipleChoiceClassification(DataCollatorWithPadding):
|
||||
[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
|
||||
]
|
||||
flattened_features = list(chain(*flattened_features))
|
||||
batch = super(DataCollatorForMultipleChoiceClassification, self).__call__(flattened_features)
|
||||
batch = super().__call__(flattened_features)
|
||||
# Un-flatten
|
||||
batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
|
||||
# Add back labels
|
||||
|
||||
@@ -245,7 +245,7 @@ def tokenize_row(
|
||||
return_column_name=False,
|
||||
):
|
||||
if prefix:
|
||||
this_row = tuple(["".join(x) for x in zip(prefix, this_row)])
|
||||
this_row = tuple("".join(x) for x in zip(prefix, this_row))
|
||||
|
||||
# tokenizer.pad_token = tokenizer.eos_token
|
||||
tokenized_example = tokenizer(
|
||||
|
||||
@@ -32,7 +32,7 @@ def is_a_list_of_str(this_obj):
|
||||
|
||||
def _clean_value(value: Any) -> str:
|
||||
if isinstance(value, float):
|
||||
return "{:.5}".format(value)
|
||||
return f"{value:.5}"
|
||||
else:
|
||||
return str(value).replace("/", "_")
|
||||
|
||||
@@ -86,7 +86,7 @@ class Counter:
|
||||
@staticmethod
|
||||
def get_trial_fold_name(local_dir, trial_config, trial_id):
|
||||
Counter.counter += 1
|
||||
experiment_tag = "{0}_{1}".format(str(Counter.counter), format_vars(trial_config))
|
||||
experiment_tag = f"{str(Counter.counter)}_{format_vars(trial_config)}"
|
||||
logdir = get_logdir_name(_generate_dirname(experiment_tag, trial_id=trial_id), local_dir)
|
||||
return logdir
|
||||
|
||||
|
||||
@@ -1,97 +0,0 @@
|
||||
ParamList_LightGBM_Base = [
|
||||
"baggingFraction",
|
||||
"baggingFreq",
|
||||
"baggingSeed",
|
||||
"binSampleCount",
|
||||
"boostFromAverage",
|
||||
"boostingType",
|
||||
"catSmooth",
|
||||
"categoricalSlotIndexes",
|
||||
"categoricalSlotNames",
|
||||
"catl2",
|
||||
"chunkSize",
|
||||
"dataRandomSeed",
|
||||
"defaultListenPort",
|
||||
"deterministic",
|
||||
"driverListenPort",
|
||||
"dropRate",
|
||||
"dropSeed",
|
||||
"earlyStoppingRound",
|
||||
"executionMode",
|
||||
"extraSeed" "featureFraction",
|
||||
"featureFractionByNode",
|
||||
"featureFractionSeed",
|
||||
"featuresCol",
|
||||
"featuresShapCol",
|
||||
"fobj" "improvementTolerance",
|
||||
"initScoreCol",
|
||||
"isEnableSparse",
|
||||
"isProvideTrainingMetric",
|
||||
"labelCol",
|
||||
"lambdaL1",
|
||||
"lambdaL2",
|
||||
"leafPredictionCol",
|
||||
"learningRate",
|
||||
"matrixType",
|
||||
"maxBin",
|
||||
"maxBinByFeature",
|
||||
"maxCatThreshold",
|
||||
"maxCatToOnehot",
|
||||
"maxDeltaStep",
|
||||
"maxDepth",
|
||||
"maxDrop",
|
||||
"metric",
|
||||
"microBatchSize",
|
||||
"minDataInLeaf",
|
||||
"minDataPerBin",
|
||||
"minDataPerGroup",
|
||||
"minGainToSplit",
|
||||
"minSumHessianInLeaf",
|
||||
"modelString",
|
||||
"monotoneConstraints",
|
||||
"monotoneConstraintsMethod",
|
||||
"monotonePenalty",
|
||||
"negBaggingFraction",
|
||||
"numBatches",
|
||||
"numIterations",
|
||||
"numLeaves",
|
||||
"numTasks",
|
||||
"numThreads",
|
||||
"objectiveSeed",
|
||||
"otherRate",
|
||||
"parallelism",
|
||||
"passThroughArgs",
|
||||
"posBaggingFraction",
|
||||
"predictDisableShapeCheck",
|
||||
"predictionCol",
|
||||
"repartitionByGroupingColumn",
|
||||
"seed",
|
||||
"skipDrop",
|
||||
"slotNames",
|
||||
"timeout",
|
||||
"topK",
|
||||
"topRate",
|
||||
"uniformDrop",
|
||||
"useBarrierExecutionMode",
|
||||
"useMissing",
|
||||
"useSingleDatasetMode",
|
||||
"validationIndicatorCol",
|
||||
"verbosity",
|
||||
"weightCol",
|
||||
"xGBoostDartMode",
|
||||
"zeroAsMissing",
|
||||
"objective",
|
||||
]
|
||||
ParamList_LightGBM_Classifier = ParamList_LightGBM_Base + [
|
||||
"isUnbalance",
|
||||
"probabilityCol",
|
||||
"rawPredictionCol",
|
||||
"thresholds",
|
||||
]
|
||||
ParamList_LightGBM_Regressor = ParamList_LightGBM_Base + ["tweedieVariancePower"]
|
||||
ParamList_LightGBM_Ranker = ParamList_LightGBM_Base + [
|
||||
"groupCol",
|
||||
"evalAt",
|
||||
"labelGain",
|
||||
"maxPosition",
|
||||
]
|
||||
@@ -1,3 +1,4 @@
|
||||
import json
|
||||
from typing import Union
|
||||
|
||||
import numpy as np
|
||||
@@ -9,7 +10,7 @@ from pyspark.ml.evaluation import (
|
||||
RegressionEvaluator,
|
||||
)
|
||||
|
||||
from flaml.automl.spark import F, psSeries
|
||||
from flaml.automl.spark import F, T, psDataFrame, psSeries, sparkDataFrame
|
||||
|
||||
|
||||
def ps_group_counts(groups: Union[psSeries, np.ndarray]) -> np.ndarray:
|
||||
@@ -36,6 +37,16 @@ def _compute_label_from_probability(df, probability_col, prediction_col):
|
||||
return df
|
||||
|
||||
|
||||
def string_to_array(s):
|
||||
try:
|
||||
return json.loads(s)
|
||||
except json.JSONDecodeError:
|
||||
return []
|
||||
|
||||
|
||||
string_to_array_udf = F.udf(string_to_array, T.ArrayType(T.DoubleType()))
|
||||
|
||||
|
||||
def spark_metric_loss_score(
|
||||
metric_name: str,
|
||||
y_predict: psSeries,
|
||||
@@ -135,6 +146,11 @@ def spark_metric_loss_score(
|
||||
)
|
||||
elif metric_name == "log_loss":
|
||||
# For log_loss, prediction_col should be probability, and we need to convert it to label
|
||||
# handle data like "{'type': '1', 'values': '[1, 2, 3]'}"
|
||||
# Fix cannot resolve "array_max(prediction)" due to data type mismatch: Parameter 1 requires the "ARRAY" type,
|
||||
# however "prediction" has the type "STRUCT<type: TINYINT, size: INT, indices: ARRAY<INT>, values: ARRAY<DOUBLE>>"
|
||||
df = df.withColumn(prediction_col, df[prediction_col].cast(T.StringType()))
|
||||
df = df.withColumn(prediction_col, string_to_array_udf(df[prediction_col]))
|
||||
df = _compute_label_from_probability(df, prediction_col, prediction_col + "_label")
|
||||
evaluator = MulticlassClassificationEvaluator(
|
||||
metricName="logLoss",
|
||||
|
||||
@@ -65,6 +65,7 @@ class SearchState:
|
||||
custom_hp=None,
|
||||
max_iter=None,
|
||||
budget=None,
|
||||
featurization="auto",
|
||||
):
|
||||
self.init_eci = learner_class.cost_relative2lgbm() if budget >= 0 else 1
|
||||
self._search_space_domain = {}
|
||||
@@ -82,6 +83,7 @@ class SearchState:
|
||||
else:
|
||||
data_size = data.shape
|
||||
search_space = learner_class.search_space(data_size=data_size, task=task)
|
||||
|
||||
self.data_size = data_size
|
||||
|
||||
if custom_hp is not None:
|
||||
@@ -91,9 +93,7 @@ class SearchState:
|
||||
starting_point = AutoMLState.sanitize(starting_point)
|
||||
if max_iter > 1 and not self.valid_starting_point(starting_point, search_space):
|
||||
# If the number of iterations is larger than 1, remove invalid point
|
||||
logger.warning(
|
||||
"Starting point {} removed because it is outside of the search space".format(starting_point)
|
||||
)
|
||||
logger.warning(f"Starting point {starting_point} removed because it is outside of the search space")
|
||||
starting_point = None
|
||||
elif isinstance(starting_point, list):
|
||||
starting_point = [AutoMLState.sanitize(x) for x in starting_point]
|
||||
@@ -208,7 +208,7 @@ class SearchState:
|
||||
self.val_loss, self.config = obj, config
|
||||
|
||||
def get_hist_config_sig(self, sample_size, config):
|
||||
config_values = tuple([config[k] for k in self._hp_names if k in config])
|
||||
config_values = tuple(config[k] for k in self._hp_names if k in config)
|
||||
config_sig = str(sample_size) + "_" + str(config_values)
|
||||
return config_sig
|
||||
|
||||
@@ -290,9 +290,11 @@ class AutoMLState:
|
||||
budget = (
|
||||
None
|
||||
if state.time_budget < 0
|
||||
else state.time_budget - state.time_from_start
|
||||
if sample_size == state.data_size[0]
|
||||
else (state.time_budget - state.time_from_start) / 2 * sample_size / state.data_size[0]
|
||||
else (
|
||||
state.time_budget - state.time_from_start
|
||||
if sample_size == state.data_size[0]
|
||||
else (state.time_budget - state.time_from_start) / 2 * sample_size / state.data_size[0]
|
||||
)
|
||||
)
|
||||
|
||||
(
|
||||
@@ -353,6 +355,7 @@ class AutoMLState:
|
||||
estimator: str,
|
||||
config_w_resource: dict,
|
||||
sample_size: Optional[int] = None,
|
||||
is_retrain: bool = False,
|
||||
):
|
||||
if not sample_size:
|
||||
sample_size = config_w_resource.get("FLAML_sample_size", len(self.y_train_all))
|
||||
@@ -378,9 +381,8 @@ class AutoMLState:
|
||||
this_estimator_kwargs[
|
||||
"groups"
|
||||
] = groups # NOTE: _train_with_config is after kwargs is updated to fit_kwargs_by_estimator
|
||||
|
||||
this_estimator_kwargs.update({"is_retrain": is_retrain})
|
||||
budget = None if self.time_budget < 0 else self.time_budget - self.time_from_start
|
||||
|
||||
estimator, train_time = train_estimator(
|
||||
X_train=sampled_X_train,
|
||||
y_train=sampled_y_train,
|
||||
|
||||
@@ -16,12 +16,7 @@ from flaml.automl.spark.utils import (
|
||||
unique_pandas_on_spark,
|
||||
unique_value_first_index,
|
||||
)
|
||||
from flaml.automl.task.task import (
|
||||
TS_FORECAST,
|
||||
TS_FORECASTPANEL,
|
||||
Task,
|
||||
get_classification_objective,
|
||||
)
|
||||
from flaml.automl.task.task import TS_FORECAST, TS_FORECASTPANEL, Task, get_classification_objective
|
||||
from flaml.config import RANDOM_SEED
|
||||
|
||||
try:
|
||||
@@ -53,13 +48,24 @@ class GenericTask(Task):
|
||||
from flaml.automl.contrib.histgb import HistGradientBoostingEstimator
|
||||
from flaml.automl.model import (
|
||||
CatBoostEstimator,
|
||||
ElasticNetEstimator,
|
||||
ExtraTreesEstimator,
|
||||
KNeighborsEstimator,
|
||||
LassoLarsEstimator,
|
||||
LGBMEstimator,
|
||||
LRL1Classifier,
|
||||
LRL2Classifier,
|
||||
RandomForestEstimator,
|
||||
SGDEstimator,
|
||||
SparkAFTSurvivalRegressionEstimator,
|
||||
SparkGBTEstimator,
|
||||
SparkGLREstimator,
|
||||
SparkLGBMEstimator,
|
||||
SparkLinearRegressionEstimator,
|
||||
SparkLinearSVCEstimator,
|
||||
SparkNaiveBayesEstimator,
|
||||
SparkRandomForestEstimator,
|
||||
SVCEstimator,
|
||||
TransformersEstimator,
|
||||
TransformersEstimatorModelSelection,
|
||||
XGBoostLimitDepthEstimator,
|
||||
@@ -72,6 +78,7 @@ class GenericTask(Task):
|
||||
"rf": RandomForestEstimator,
|
||||
"lgbm": LGBMEstimator,
|
||||
"lgbm_spark": SparkLGBMEstimator,
|
||||
"rf_spark": SparkRandomForestEstimator,
|
||||
"lrl1": LRL1Classifier,
|
||||
"lrl2": LRL2Classifier,
|
||||
"catboost": CatBoostEstimator,
|
||||
@@ -80,6 +87,16 @@ class GenericTask(Task):
|
||||
"transformer": TransformersEstimator,
|
||||
"transformer_ms": TransformersEstimatorModelSelection,
|
||||
"histgb": HistGradientBoostingEstimator,
|
||||
"svc": SVCEstimator,
|
||||
"sgd": SGDEstimator,
|
||||
"nb_spark": SparkNaiveBayesEstimator,
|
||||
"enet": ElasticNetEstimator,
|
||||
"lassolars": LassoLarsEstimator,
|
||||
"glr_spark": SparkGLREstimator,
|
||||
"lr_spark": SparkLinearRegressionEstimator,
|
||||
"svc_spark": SparkLinearSVCEstimator,
|
||||
"gbt_spark": SparkGBTEstimator,
|
||||
"aft_spark": SparkAFTSurvivalRegressionEstimator,
|
||||
}
|
||||
return self._estimators
|
||||
|
||||
@@ -271,8 +288,8 @@ class GenericTask(Task):
|
||||
seed=RANDOM_SEED,
|
||||
)
|
||||
columns_to_drop = [c for c in df_all_train.columns if c in [stratify_column, "sample_weight"]]
|
||||
X_train = df_all_train.drop(columns_to_drop)
|
||||
X_val = df_all_val.drop(columns_to_drop)
|
||||
X_train = df_all_train.drop(columns=columns_to_drop)
|
||||
X_val = df_all_val.drop(columns=columns_to_drop)
|
||||
y_train = df_all_train[stratify_column]
|
||||
y_val = df_all_val[stratify_column]
|
||||
|
||||
@@ -425,8 +442,8 @@ class GenericTask(Task):
|
||||
X_train_all, y_train_all = shuffle(X_train_all, y_train_all, random_state=RANDOM_SEED)
|
||||
if data_is_df:
|
||||
X_train_all.reset_index(drop=True, inplace=True)
|
||||
if isinstance(y_train_all, pd.Series):
|
||||
y_train_all.reset_index(drop=True, inplace=True)
|
||||
if isinstance(y_train_all, pd.Series):
|
||||
y_train_all.reset_index(drop=True, inplace=True)
|
||||
|
||||
X_train, y_train = X_train_all, y_train_all
|
||||
state.groups_all = state.groups
|
||||
@@ -497,14 +514,37 @@ class GenericTask(Task):
|
||||
last = first[i] + 1
|
||||
rest.extend(range(last, len(y_train_all)))
|
||||
X_first = X_train_all.iloc[first] if data_is_df else X_train_all[first]
|
||||
X_rest = X_train_all.iloc[rest] if data_is_df else X_train_all[rest]
|
||||
y_rest = (
|
||||
y_train_all[rest]
|
||||
if isinstance(y_train_all, np.ndarray)
|
||||
else iloc_pandas_on_spark(y_train_all, rest)
|
||||
if is_spark_dataframe
|
||||
else y_train_all.iloc[rest]
|
||||
)
|
||||
if len(first) < len(y_train_all) / 2:
|
||||
# Get X_rest and y_rest with drop, sparse matrix can't apply np.delete
|
||||
X_rest = (
|
||||
np.delete(X_train_all, first, axis=0)
|
||||
if isinstance(X_train_all, np.ndarray)
|
||||
else X_train_all.drop(first.tolist())
|
||||
if data_is_df
|
||||
else X_train_all[rest]
|
||||
)
|
||||
y_rest = (
|
||||
np.delete(y_train_all, first, axis=0)
|
||||
if isinstance(y_train_all, np.ndarray)
|
||||
else y_train_all.drop(first.tolist())
|
||||
if data_is_df
|
||||
else y_train_all[rest]
|
||||
)
|
||||
else:
|
||||
X_rest = (
|
||||
iloc_pandas_on_spark(X_train_all, rest)
|
||||
if is_spark_dataframe
|
||||
else X_train_all.iloc[rest]
|
||||
if data_is_df
|
||||
else X_train_all[rest]
|
||||
)
|
||||
y_rest = (
|
||||
iloc_pandas_on_spark(y_train_all, rest)
|
||||
if is_spark_dataframe
|
||||
else y_train_all.iloc[rest]
|
||||
if data_is_df
|
||||
else y_train_all[rest]
|
||||
)
|
||||
stratify = y_rest if split_type == "stratified" else None
|
||||
X_train, X_val, y_train, y_val = self._train_test_split(
|
||||
state, X_rest, y_rest, first, rest, split_ratio, stratify
|
||||
@@ -513,6 +553,12 @@ class GenericTask(Task):
|
||||
y_train = concat(label_set, y_train) if data_is_df else np.concatenate([label_set, y_train])
|
||||
X_val = concat(X_first, X_val)
|
||||
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
|
||||
|
||||
if isinstance(y_train, (psDataFrame, pd.DataFrame)) and y_train.shape[1] == 1:
|
||||
y_train = y_train[y_train.columns[0]]
|
||||
y_val = y_val[y_val.columns[0]]
|
||||
y_train.name = y_val.name = y_rest.name
|
||||
|
||||
elif self.is_regression():
|
||||
X_train, X_val, y_train, y_val = self._train_test_split(
|
||||
state, X_train_all, y_train_all, split_ratio=split_ratio
|
||||
@@ -659,7 +705,6 @@ class GenericTask(Task):
|
||||
fit_kwargs = {}
|
||||
if cv_score_agg_func is None:
|
||||
cv_score_agg_func = default_cv_score_agg_func
|
||||
start_time = time.time()
|
||||
val_loss_folds = []
|
||||
log_metric_folds = []
|
||||
metric = None
|
||||
@@ -701,7 +746,10 @@ class GenericTask(Task):
|
||||
elif isinstance(kf, TimeSeriesSplit):
|
||||
kf = kf.split(X_train_split, y_train_split)
|
||||
else:
|
||||
kf = kf.split(X_train_split)
|
||||
try:
|
||||
kf = kf.split(X_train_split)
|
||||
except TypeError:
|
||||
kf = kf.split(X_train_split, y_train_split)
|
||||
|
||||
for train_index, val_index in kf:
|
||||
if shuffle:
|
||||
@@ -724,10 +772,10 @@ class GenericTask(Task):
|
||||
if not is_spark_dataframe:
|
||||
y_train, y_val = y_train_split[train_index], y_train_split[val_index]
|
||||
if weight is not None:
|
||||
fit_kwargs["sample_weight"], weight_val = (
|
||||
weight[train_index],
|
||||
weight[val_index],
|
||||
fit_kwargs["sample_weight"] = (
|
||||
weight[train_index] if isinstance(weight, np.ndarray) else weight.iloc[train_index]
|
||||
)
|
||||
weight_val = weight[val_index] if isinstance(weight, np.ndarray) else weight.iloc[val_index]
|
||||
if groups is not None:
|
||||
fit_kwargs["groups"] = (
|
||||
groups[train_index] if isinstance(groups, np.ndarray) else groups.iloc[train_index]
|
||||
@@ -766,8 +814,6 @@ class GenericTask(Task):
|
||||
if is_spark_dataframe:
|
||||
X_train.spark.unpersist() # uncache data to free memory
|
||||
X_val.spark.unpersist() # uncache data to free memory
|
||||
if budget and time.time() - start_time >= budget:
|
||||
break
|
||||
val_loss, metric = cv_score_agg_func(val_loss_folds, log_metric_folds)
|
||||
n = total_fold_num
|
||||
pred_time /= n
|
||||
@@ -810,27 +856,23 @@ class GenericTask(Task):
|
||||
elif self.is_ts_forecastpanel():
|
||||
estimator_list = ["tft"]
|
||||
else:
|
||||
estimator_list = [
|
||||
"lgbm",
|
||||
"rf",
|
||||
"xgboost",
|
||||
"extra_tree",
|
||||
"xgb_limitdepth",
|
||||
"lgbm_spark",
|
||||
"rf_spark",
|
||||
"sgd",
|
||||
]
|
||||
try:
|
||||
import catboost
|
||||
|
||||
estimator_list = [
|
||||
"lgbm",
|
||||
"rf",
|
||||
"catboost",
|
||||
"xgboost",
|
||||
"extra_tree",
|
||||
"xgb_limitdepth",
|
||||
"lgbm_spark",
|
||||
]
|
||||
estimator_list += ["catboost"]
|
||||
except ImportError:
|
||||
estimator_list = [
|
||||
"lgbm",
|
||||
"rf",
|
||||
"xgboost",
|
||||
"extra_tree",
|
||||
"xgb_limitdepth",
|
||||
"lgbm_spark",
|
||||
]
|
||||
pass
|
||||
|
||||
# if self.is_ts_forecast():
|
||||
# # catboost is removed because it has a `name` parameter, making it incompatible with hcrystalball
|
||||
# if "catboost" in estimator_list:
|
||||
@@ -862,9 +904,7 @@ class GenericTask(Task):
|
||||
return metric
|
||||
|
||||
if self.is_nlp():
|
||||
from flaml.automl.nlp.utils import (
|
||||
load_default_huggingface_metric_for_task,
|
||||
)
|
||||
from flaml.automl.nlp.utils import load_default_huggingface_metric_for_task
|
||||
|
||||
return load_default_huggingface_metric_for_task(self.name)
|
||||
elif self.is_binary():
|
||||
|
||||
@@ -192,7 +192,7 @@ class Task(ABC):
|
||||
* Valid str options depend on different tasks.
|
||||
For classification tasks, valid choices are
|
||||
["auto", 'stratified', 'uniform', 'time', 'group']. "auto" -> stratified.
|
||||
For regression tasks, valid choices are ["auto", 'uniform', 'time'].
|
||||
For regression tasks, valid choices are ["auto", 'uniform', 'time', 'group'].
|
||||
"auto" -> uniform.
|
||||
For time series forecast tasks, must be "auto" or 'time'.
|
||||
For ranking task, must be "auto" or 'group'.
|
||||
|
||||
@@ -36,11 +36,17 @@ class TimeSeriesTask(Task):
|
||||
LGBM_TS,
|
||||
RF_TS,
|
||||
SARIMAX,
|
||||
Average,
|
||||
CatBoost_TS,
|
||||
ExtraTrees_TS,
|
||||
HoltWinters,
|
||||
LassoLars_TS,
|
||||
Naive,
|
||||
Orbit,
|
||||
Prophet,
|
||||
SeasonalAverage,
|
||||
SeasonalNaive,
|
||||
TCNEstimator,
|
||||
TemporalFusionTransformerEstimator,
|
||||
XGBoost_TS,
|
||||
XGBoostLimitDepth_TS,
|
||||
@@ -57,8 +63,19 @@ class TimeSeriesTask(Task):
|
||||
"holt-winters": HoltWinters,
|
||||
"catboost": CatBoost_TS,
|
||||
"tft": TemporalFusionTransformerEstimator,
|
||||
"lassolars": LassoLars_TS,
|
||||
"tcn": TCNEstimator,
|
||||
"snaive": SeasonalNaive,
|
||||
"naive": Naive,
|
||||
"savg": SeasonalAverage,
|
||||
"avg": Average,
|
||||
}
|
||||
|
||||
if self._estimators["tcn"] is None:
|
||||
# remove TCN if import failed
|
||||
del self._estimators["tcn"]
|
||||
logger.info("Couldn't import pytorch_lightning, skipping TCN estimator")
|
||||
|
||||
try:
|
||||
from prophet import Prophet as foo
|
||||
|
||||
@@ -71,7 +88,7 @@ class TimeSeriesTask(Task):
|
||||
|
||||
self._estimators["orbit"] = Orbit
|
||||
except ImportError:
|
||||
logger.info("Couldn't import Prophet, skipping")
|
||||
logger.info("Couldn't import orbit, skipping")
|
||||
|
||||
return self._estimators
|
||||
|
||||
|
||||
@@ -1,16 +1,27 @@
|
||||
from .tft import TemporalFusionTransformerEstimator
|
||||
from .ts_data import TimeSeriesDataset
|
||||
from .ts_model import (
|
||||
ARIMA,
|
||||
LGBM_TS,
|
||||
RF_TS,
|
||||
SARIMAX,
|
||||
Average,
|
||||
CatBoost_TS,
|
||||
ExtraTrees_TS,
|
||||
HoltWinters,
|
||||
LassoLars_TS,
|
||||
Naive,
|
||||
Orbit,
|
||||
Prophet,
|
||||
SeasonalAverage,
|
||||
SeasonalNaive,
|
||||
TimeSeriesEstimator,
|
||||
XGBoost_TS,
|
||||
XGBoostLimitDepth_TS,
|
||||
)
|
||||
|
||||
try:
|
||||
from .tcn import TCNEstimator
|
||||
except ImportError:
|
||||
TCNEstimator = None
|
||||
|
||||
from .ts_data import TimeSeriesDataset
|
||||
|
||||
285
flaml/automl/time_series/tcn.py
Normal file
285
flaml/automl/time_series/tcn.py
Normal file
@@ -0,0 +1,285 @@
|
||||
# This file is adapted from
|
||||
# https://github.com/locuslab/TCN/blob/master/TCN/tcn.py
|
||||
# https://github.com/locuslab/TCN/blob/master/TCN/adding_problem/add_test.py
|
||||
|
||||
import datetime
|
||||
import logging
|
||||
import time
|
||||
|
||||
import pandas as pd
|
||||
import pytorch_lightning as pl
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.optim as optim
|
||||
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
|
||||
from pytorch_lightning.loggers import TensorBoardLogger
|
||||
from torch.nn.utils import weight_norm
|
||||
from torch.utils.data import DataLoader, TensorDataset
|
||||
|
||||
from flaml import tune
|
||||
from flaml.automl.data import add_time_idx_col
|
||||
from flaml.automl.logger import logger, logger_formatter
|
||||
from flaml.automl.time_series.ts_data import TimeSeriesDataset
|
||||
from flaml.automl.time_series.ts_model import TimeSeriesEstimator
|
||||
|
||||
|
||||
class Chomp1d(nn.Module):
|
||||
def __init__(self, chomp_size):
|
||||
super().__init__()
|
||||
self.chomp_size = chomp_size
|
||||
|
||||
def forward(self, x):
|
||||
return x[:, :, : -self.chomp_size].contiguous()
|
||||
|
||||
|
||||
class TemporalBlock(nn.Module):
|
||||
def __init__(self, n_inputs, n_outputs, kernel_size, stride, dilation, padding, dropout=0.2):
|
||||
super().__init__()
|
||||
self.conv1 = weight_norm(
|
||||
nn.Conv1d(n_inputs, n_outputs, kernel_size, stride=stride, padding=padding, dilation=dilation)
|
||||
)
|
||||
self.chomp1 = Chomp1d(padding)
|
||||
self.relu1 = nn.ReLU()
|
||||
self.dropout1 = nn.Dropout(dropout)
|
||||
|
||||
self.conv2 = weight_norm(
|
||||
nn.Conv1d(n_outputs, n_outputs, kernel_size, stride=stride, padding=padding, dilation=dilation)
|
||||
)
|
||||
self.chomp2 = Chomp1d(padding)
|
||||
self.relu2 = nn.ReLU()
|
||||
self.dropout2 = nn.Dropout(dropout)
|
||||
|
||||
self.net = nn.Sequential(
|
||||
self.conv1, self.chomp1, self.relu1, self.dropout1, self.conv2, self.chomp2, self.relu2, self.dropout2
|
||||
)
|
||||
self.downsample = nn.Conv1d(n_inputs, n_outputs, 1) if n_inputs != n_outputs else None
|
||||
self.relu = nn.ReLU()
|
||||
self.init_weights()
|
||||
|
||||
def init_weights(self):
|
||||
self.conv1.weight.data.normal_(0, 0.01)
|
||||
self.conv2.weight.data.normal_(0, 0.01)
|
||||
if self.downsample is not None:
|
||||
self.downsample.weight.data.normal_(0, 0.01)
|
||||
|
||||
def forward(self, x):
|
||||
out = self.net(x)
|
||||
res = x if self.downsample is None else self.downsample(x)
|
||||
return self.relu(out + res)
|
||||
|
||||
|
||||
class TCNForecaster(nn.Module):
|
||||
def __init__(
|
||||
self,
|
||||
input_feature_num,
|
||||
num_outputs,
|
||||
num_channels,
|
||||
kernel_size=2,
|
||||
dropout=0.2,
|
||||
):
|
||||
super().__init__()
|
||||
layers = []
|
||||
num_levels = len(num_channels)
|
||||
for i in range(num_levels):
|
||||
dilation_size = 2**i
|
||||
in_channels = input_feature_num if i == 0 else num_channels[i - 1]
|
||||
out_channels = num_channels[i]
|
||||
layers += [
|
||||
TemporalBlock(
|
||||
in_channels,
|
||||
out_channels,
|
||||
kernel_size,
|
||||
stride=1,
|
||||
dilation=dilation_size,
|
||||
padding=(kernel_size - 1) * dilation_size,
|
||||
dropout=dropout,
|
||||
)
|
||||
]
|
||||
|
||||
self.network = nn.Sequential(*layers)
|
||||
self.linear = nn.Linear(num_channels[-1], num_outputs)
|
||||
|
||||
def forward(self, x):
|
||||
y1 = self.network(x)
|
||||
return self.linear(y1[:, :, -1])
|
||||
|
||||
|
||||
class TCNForecasterLightningModule(pl.LightningModule):
|
||||
def __init__(self, model: TCNForecaster, learning_rate: float = 1e-3):
|
||||
super().__init__()
|
||||
self.model = model
|
||||
self.learning_rate = learning_rate
|
||||
self.loss_fn = nn.MSELoss()
|
||||
|
||||
def forward(self, x):
|
||||
return self.model(x)
|
||||
|
||||
def step(self, batch, batch_idx):
|
||||
x, y = batch
|
||||
y_hat = self.model(x)
|
||||
loss = self.loss_fn(y_hat, y)
|
||||
return loss
|
||||
|
||||
def training_step(self, batch, batch_idx):
|
||||
loss = self.step(batch, batch_idx)
|
||||
self.log("train_loss", loss)
|
||||
return loss
|
||||
|
||||
def validation_step(self, batch, batch_idx):
|
||||
loss = self.step(batch, batch_idx)
|
||||
self.log("val_loss", loss)
|
||||
return loss
|
||||
|
||||
def configure_optimizers(self):
|
||||
return torch.optim.Adam(self.parameters(), lr=self.learning_rate)
|
||||
|
||||
|
||||
class DataframeDataset(torch.utils.data.Dataset):
|
||||
def __init__(self, dataframe, target_column, features_columns, sequence_length, train=True):
|
||||
self.data = torch.tensor(dataframe[features_columns].to_numpy(), dtype=torch.float)
|
||||
self.sequence_length = sequence_length
|
||||
if train:
|
||||
self.labels = torch.tensor(dataframe[target_column].to_numpy(), dtype=torch.float)
|
||||
self.is_train = train
|
||||
|
||||
def __len__(self):
|
||||
return len(self.data) - self.sequence_length + 1
|
||||
|
||||
def __getitem__(self, idx):
|
||||
data = self.data[idx : idx + self.sequence_length]
|
||||
data = data.permute(1, 0)
|
||||
if self.is_train:
|
||||
label = self.labels[idx : idx + self.sequence_length]
|
||||
return data, label
|
||||
else:
|
||||
return data
|
||||
|
||||
|
||||
class TCNEstimator(TimeSeriesEstimator):
|
||||
"""The class for tuning TCN Forecaster"""
|
||||
|
||||
@classmethod
|
||||
def search_space(cls, data, task, pred_horizon, **params):
|
||||
space = {
|
||||
"num_levels": {
|
||||
"domain": tune.randint(lower=4, upper=20), # hidden = 2^num_hidden
|
||||
"init_value": 4,
|
||||
},
|
||||
"num_hidden": {
|
||||
"domain": tune.randint(lower=4, upper=8), # hidden = 2^num_hidden
|
||||
"init_value": 5,
|
||||
},
|
||||
"kernel_size": {
|
||||
"domain": tune.choice([2, 3, 5, 7]), # common choices for kernel size
|
||||
"init_value": 3,
|
||||
},
|
||||
"dropout": {
|
||||
"domain": tune.uniform(lower=0.0, upper=0.5), # standard range for dropout
|
||||
"init_value": 0.1,
|
||||
},
|
||||
"learning_rate": {
|
||||
"domain": tune.loguniform(lower=1e-4, upper=1e-1), # typical range for learning rate
|
||||
"init_value": 1e-3,
|
||||
},
|
||||
}
|
||||
return space
|
||||
|
||||
def __init__(self, task="ts_forecast", n_jobs=1, **params):
|
||||
super().__init__(task, **params)
|
||||
logging.getLogger("pytorch_lightning").setLevel(logging.WARNING)
|
||||
|
||||
def fit(self, X_train: TimeSeriesDataset, y_train=None, budget=None, **kwargs):
|
||||
start_time = time.time()
|
||||
if budget is not None:
|
||||
deltabudget = datetime.timedelta(seconds=budget)
|
||||
else:
|
||||
deltabudget = None
|
||||
X_train = self.enrich(X_train)
|
||||
super().fit(X_train, y_train, budget, **kwargs)
|
||||
|
||||
self.batch_size = kwargs.get("batch_size", 64)
|
||||
self.horizon = kwargs.get("period", 1)
|
||||
self.feature_cols = X_train.time_varying_known_reals
|
||||
self.target_col = X_train.target_names[0]
|
||||
|
||||
train_dataset = DataframeDataset(
|
||||
X_train.train_data,
|
||||
self.target_col,
|
||||
self.feature_cols,
|
||||
self.horizon,
|
||||
)
|
||||
train_loader = DataLoader(train_dataset, batch_size=self.batch_size, shuffle=False)
|
||||
if not X_train.test_data.empty:
|
||||
val_dataset = DataframeDataset(
|
||||
X_train.test_data,
|
||||
self.target_col,
|
||||
self.feature_cols,
|
||||
self.horizon,
|
||||
)
|
||||
else:
|
||||
val_dataset = DataframeDataset(
|
||||
X_train.train_data.sample(frac=0.2, random_state=kwargs.get("random_state", 0)),
|
||||
self.target_col,
|
||||
self.feature_cols,
|
||||
self.horizon,
|
||||
)
|
||||
|
||||
val_loader = DataLoader(val_dataset, batch_size=self.batch_size, shuffle=False)
|
||||
|
||||
model = TCNForecaster(
|
||||
len(self.feature_cols),
|
||||
self.horizon,
|
||||
[2 ** self.params["num_hidden"]] * self.params["num_levels"],
|
||||
self.params["kernel_size"],
|
||||
self.params["dropout"],
|
||||
)
|
||||
|
||||
pl_module = TCNForecasterLightningModule(model, self.params["learning_rate"])
|
||||
|
||||
# Training loop
|
||||
# gpus is deprecated in v1.7 and removed in v2.0
|
||||
# accelerator="auto" can cast all condition.
|
||||
trainer = pl.Trainer(
|
||||
max_epochs=kwargs.get("max_epochs", 10),
|
||||
accelerator="auto",
|
||||
callbacks=[
|
||||
EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=10, verbose=False, mode="min"),
|
||||
LearningRateMonitor(),
|
||||
],
|
||||
logger=TensorBoardLogger(kwargs.get("log_dir", "logs/lightning_logs")), # logging results to a tensorboard
|
||||
max_time=deltabudget,
|
||||
enable_model_summary=False,
|
||||
enable_progress_bar=False,
|
||||
)
|
||||
trainer.fit(
|
||||
pl_module,
|
||||
train_dataloaders=train_loader,
|
||||
val_dataloaders=val_loader,
|
||||
)
|
||||
best_model = trainer.model
|
||||
self._model = best_model
|
||||
train_time = time.time() - start_time
|
||||
return train_time
|
||||
|
||||
def predict(self, X):
|
||||
X = self.enrich(X)
|
||||
if isinstance(X, TimeSeriesDataset):
|
||||
df = X.X_val
|
||||
else:
|
||||
df = X
|
||||
dataset = DataframeDataset(
|
||||
df,
|
||||
self.target_col,
|
||||
self.feature_cols,
|
||||
self.horizon,
|
||||
train=False,
|
||||
)
|
||||
data_loader = DataLoader(dataset, batch_size=self.batch_size, shuffle=False)
|
||||
self._model.eval()
|
||||
raw_preds = []
|
||||
for batch_x in data_loader:
|
||||
raw_pred = self._model(batch_x)
|
||||
raw_preds.append(raw_pred)
|
||||
raw_preds = torch.cat(raw_preds, dim=0)
|
||||
preds = pd.Series(raw_preds.detach().numpy().ravel())
|
||||
return preds
|
||||
@@ -393,7 +393,7 @@ class DataTransformerTS:
|
||||
|
||||
for column in X.columns:
|
||||
# sklearn/utils/validation.py needs int/float values
|
||||
if X[column].dtype.name in ("object", "category"):
|
||||
if X[column].dtype.name in ("object", "category", "string"):
|
||||
if (
|
||||
# drop columns where all values are the same
|
||||
X[column].nunique() == 1
|
||||
|
||||
@@ -26,6 +26,7 @@ from flaml.automl.data import TS_TIMESTAMP_COL, TS_VALUE_COL
|
||||
from flaml.automl.model import (
|
||||
CatBoostEstimator,
|
||||
ExtraTreesEstimator,
|
||||
LassoLarsEstimator,
|
||||
LGBMEstimator,
|
||||
RandomForestEstimator,
|
||||
SKLearnEstimator,
|
||||
@@ -631,6 +632,125 @@ class HoltWinters(StatsModelsEstimator):
|
||||
return train_time
|
||||
|
||||
|
||||
class SimpleForecaster(StatsModelsEstimator):
|
||||
"""Base class for Naive Forecaster like Seasonal Naive, Naive, Seasonal Average, Average"""
|
||||
|
||||
@classmethod
|
||||
def _search_space(cls, data: TimeSeriesDataset, task: Task, pred_horizon: int, **params):
|
||||
return {
|
||||
"season": {
|
||||
"domain": tune.randint(1, pred_horizon),
|
||||
"init_value": pred_horizon,
|
||||
}
|
||||
}
|
||||
|
||||
def joint_preprocess(self, X_train, y_train=None):
|
||||
X_train = self.enrich(X_train)
|
||||
|
||||
self.regressors = []
|
||||
|
||||
if isinstance(X_train, TimeSeriesDataset):
|
||||
data = X_train
|
||||
target_col = data.target_names[0]
|
||||
# this class only supports univariate regression
|
||||
train_df = data.train_data[self.regressors + [target_col]]
|
||||
train_df.index = to_datetime(data.train_data[data.time_col])
|
||||
else:
|
||||
target_col = TS_VALUE_COL
|
||||
train_df = self._join(X_train, y_train)
|
||||
|
||||
self.time_col = data.time_col
|
||||
self.target_names = data.target_names
|
||||
|
||||
train_df = self._preprocess(train_df)
|
||||
return train_df, target_col
|
||||
|
||||
def fit(self, X_train, y_train=None, budget=None, **kwargs):
|
||||
import warnings
|
||||
|
||||
warnings.filterwarnings("ignore")
|
||||
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
|
||||
|
||||
self.season = self.params.get("season", 1)
|
||||
current_time = time.time()
|
||||
super().fit(X_train, y_train, budget=budget, **kwargs)
|
||||
|
||||
train_df, target_col = self.joint_preprocess(X_train, y_train)
|
||||
|
||||
model = SimpleExpSmoothing(
|
||||
train_df[[target_col]],
|
||||
)
|
||||
with suppress_stdout_stderr():
|
||||
model = model.fit(smoothing_level=self.smoothing_level)
|
||||
train_time = time.time() - current_time
|
||||
self._model = model
|
||||
return train_time
|
||||
|
||||
|
||||
class SeasonalNaive(SimpleForecaster):
|
||||
smoothing_level = 1.0
|
||||
|
||||
def predict(self, X, **kwargs):
|
||||
if isinstance(X, int):
|
||||
forecasts = []
|
||||
for i in range(X):
|
||||
forecast = self._model.forecast(steps=self.season)[0]
|
||||
forecasts.append(forecast)
|
||||
return pd.Series(forecasts)
|
||||
else:
|
||||
return super().predict(X, **kwargs)
|
||||
|
||||
|
||||
class Naive(SimpleForecaster):
|
||||
smoothing_level = 0.0
|
||||
|
||||
@classmethod
|
||||
def _search_space(cls, data: TimeSeriesDataset, task: Task, pred_horizon: int, **params):
|
||||
return {}
|
||||
|
||||
def predict(self, X, **kwargs):
|
||||
if isinstance(X, int):
|
||||
last_observation = self._model.params["initial_level"]
|
||||
return pd.Series([last_observation] * X)
|
||||
else:
|
||||
return super().predict(X, **kwargs)
|
||||
|
||||
|
||||
class SeasonalAverage(SimpleForecaster):
|
||||
def fit(self, X_train, y_train=None, budget=None, **kwargs):
|
||||
from statsmodels.tsa.ar_model import AutoReg, ar_select_order
|
||||
|
||||
start_time = time.time()
|
||||
|
||||
self.season = kwargs.get("season", 1) # seasonality period
|
||||
train_df, target_col = self.joint_preprocess(X_train, y_train)
|
||||
selection_res = ar_select_order(train_df[target_col], maxlag=self.season)
|
||||
|
||||
# Fit autoregressive model with optimal order
|
||||
model = AutoReg(train_df[target_col], lags=selection_res.ar_lags)
|
||||
self._model = model.fit()
|
||||
end_time = time.time()
|
||||
|
||||
return end_time - start_time
|
||||
|
||||
|
||||
class Average(SimpleForecaster):
|
||||
@classmethod
|
||||
def _search_space(cls, data: TimeSeriesDataset, task: Task, pred_horizon: int, **params):
|
||||
return {}
|
||||
|
||||
def fit(self, X_train, y_train=None, budget=None, **kwargs):
|
||||
from statsmodels.tsa.ar_model import AutoReg
|
||||
|
||||
start_time = time.time()
|
||||
train_df, target_col = self.joint_preprocess(X_train, y_train)
|
||||
model = AutoReg(train_df[target_col], lags=0)
|
||||
self._model = model.fit()
|
||||
end_time = time.time()
|
||||
|
||||
return end_time - start_time
|
||||
|
||||
|
||||
class TS_SKLearn(TimeSeriesEstimator):
|
||||
"""The class for tuning SKLearn Regressors for time-series forecasting"""
|
||||
|
||||
@@ -757,3 +877,7 @@ class XGBoostLimitDepth_TS(TS_SKLearn):
|
||||
# catboost regressor is invalid because it has a `name` parameter, making it incompatible with hcrystalball
|
||||
class CatBoost_TS(TS_SKLearn):
|
||||
base_class = CatBoostEstimator
|
||||
|
||||
|
||||
class LassoLars_TS(TS_SKLearn):
|
||||
base_class = LassoLarsEstimator
|
||||
|
||||
@@ -11,7 +11,7 @@ from typing import IO
|
||||
logger = logging.getLogger("flaml.automl")
|
||||
|
||||
|
||||
class TrainingLogRecord(object):
|
||||
class TrainingLogRecord:
|
||||
def __init__(
|
||||
self,
|
||||
record_id: int,
|
||||
@@ -52,7 +52,7 @@ class TrainingLogCheckPoint(TrainingLogRecord):
|
||||
self.curr_best_record_id = curr_best_record_id
|
||||
|
||||
|
||||
class TrainingLogWriter(object):
|
||||
class TrainingLogWriter:
|
||||
def __init__(self, output_filename: str):
|
||||
self.output_filename = output_filename
|
||||
self.file = None
|
||||
@@ -79,7 +79,7 @@ class TrainingLogWriter(object):
|
||||
sample_size,
|
||||
):
|
||||
if self.file is None:
|
||||
raise IOError("Call open() to open the output file first.")
|
||||
raise OSError("Call open() to open the output file first.")
|
||||
if validation_loss is None:
|
||||
raise ValueError("TEST LOSS NONE ERROR!!!")
|
||||
record = TrainingLogRecord(
|
||||
@@ -109,7 +109,7 @@ class TrainingLogWriter(object):
|
||||
|
||||
def checkpoint(self):
|
||||
if self.file is None:
|
||||
raise IOError("Call open() to open the output file first.")
|
||||
raise OSError("Call open() to open the output file first.")
|
||||
if self.current_best_loss_record_id is None:
|
||||
logger.warning("flaml.training_log: checkpoint() called before any record is written, skipped.")
|
||||
return
|
||||
@@ -124,7 +124,7 @@ class TrainingLogWriter(object):
|
||||
self.file = None # for pickle
|
||||
|
||||
|
||||
class TrainingLogReader(object):
|
||||
class TrainingLogReader:
|
||||
def __init__(self, filename: str):
|
||||
self.filename = filename
|
||||
self.file = None
|
||||
@@ -134,7 +134,7 @@ class TrainingLogReader(object):
|
||||
|
||||
def records(self):
|
||||
if self.file is None:
|
||||
raise IOError("Call open() before reading log file.")
|
||||
raise OSError("Call open() before reading log file.")
|
||||
for line in self.file:
|
||||
data = json.loads(line)
|
||||
if len(data) == 1:
|
||||
@@ -149,7 +149,7 @@ class TrainingLogReader(object):
|
||||
|
||||
def get_record(self, record_id) -> TrainingLogRecord:
|
||||
if self.file is None:
|
||||
raise IOError("Call open() before reading log file.")
|
||||
raise OSError("Call open() before reading log file.")
|
||||
for rec in self.records():
|
||||
if rec.record_id == record_id:
|
||||
return rec
|
||||
|
||||
@@ -69,7 +69,7 @@ def build_portfolio(meta_features, regret, strategy):
|
||||
|
||||
def load_json(filename):
|
||||
"""Returns the contents of json file filename."""
|
||||
with open(filename, "r") as f:
|
||||
with open(filename) as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
|
||||
@@ -43,7 +43,7 @@ def meta_feature(task, X_train, y_train, meta_feature_names):
|
||||
# 'numpy.ndarray' object has no attribute 'select_dtypes'
|
||||
this_feature.append(1) # all features are numeric
|
||||
else:
|
||||
raise ValueError("Feature {} not implemented. ".format(each_feature_name))
|
||||
raise ValueError(f"Feature {each_feature_name} not implemented. ")
|
||||
|
||||
return this_feature
|
||||
|
||||
@@ -57,7 +57,7 @@ def load_config_predictor(estimator_name, task, location=None):
|
||||
task = "multiclass" if task == "multi" else task # TODO: multi -> multiclass?
|
||||
try:
|
||||
location = location or LOCATION
|
||||
with open(f"{location}/{estimator_name}/{task}.json", "r") as f:
|
||||
with open(f"{location}/{estimator_name}/{task}.json") as f:
|
||||
CONFIG_PREDICTORS[key] = predictor = json.load(f)
|
||||
except FileNotFoundError:
|
||||
raise FileNotFoundError(f"Portfolio has not been built for {estimator_name} on {task} task.")
|
||||
|
||||
0
flaml/fabric/__init__.py
Normal file
0
flaml/fabric/__init__.py
Normal file
1021
flaml/fabric/mlflow.py
Normal file
1021
flaml/fabric/mlflow.py
Normal file
File diff suppressed because it is too large
Load Diff
37
flaml/tune/logger.py
Normal file
37
flaml/tune/logger.py
Normal file
@@ -0,0 +1,37 @@
|
||||
import logging
|
||||
import os
|
||||
|
||||
|
||||
class ColoredFormatter(logging.Formatter):
|
||||
# ANSI escape codes for colors
|
||||
COLORS = {
|
||||
# logging.DEBUG: "\033[36m", # Cyan
|
||||
# logging.INFO: "\033[32m", # Green
|
||||
logging.WARNING: "\033[33m", # Yellow
|
||||
logging.ERROR: "\033[31m", # Red
|
||||
logging.CRITICAL: "\033[1;31m", # Bright Red
|
||||
}
|
||||
RESET = "\033[0m" # Reset to default
|
||||
|
||||
def __init__(self, fmt, datefmt, use_color=True):
|
||||
super().__init__(fmt, datefmt)
|
||||
self.use_color = use_color
|
||||
|
||||
def format(self, record):
|
||||
formatted = super().format(record)
|
||||
if self.use_color:
|
||||
color = self.COLORS.get(record.levelno, "")
|
||||
if color:
|
||||
return f"{color}{formatted}{self.RESET}"
|
||||
return formatted
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
use_color = True
|
||||
if os.getenv("FLAML_LOG_NO_COLOR"):
|
||||
use_color = False
|
||||
|
||||
logger_formatter = ColoredFormatter(
|
||||
"[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s", "%m-%d %H:%M:%S", use_color
|
||||
)
|
||||
logger.propagate = False
|
||||
@@ -109,7 +109,7 @@ class FLOW2(Searcher):
|
||||
else:
|
||||
mode = "min"
|
||||
|
||||
super(FLOW2, self).__init__(metric=metric, mode=mode)
|
||||
super().__init__(metric=metric, mode=mode)
|
||||
# internally minimizes, so "max" => -1
|
||||
if mode == "max":
|
||||
self.metric_op = -1.0
|
||||
@@ -350,7 +350,7 @@ class FLOW2(Searcher):
|
||||
else:
|
||||
assert (
|
||||
self.lexico_objectives["tolerances"][k_metric][-1] == "%"
|
||||
), "String tolerance of {} should use %% as the suffix".format(k_metric)
|
||||
), f"String tolerance of {k_metric} should use %% as the suffix"
|
||||
tolerance_bound = self._f_best[k_metric] * (
|
||||
1 + 0.01 * float(self.lexico_objectives["tolerances"][k_metric].replace("%", ""))
|
||||
)
|
||||
@@ -385,7 +385,7 @@ class FLOW2(Searcher):
|
||||
else:
|
||||
assert (
|
||||
self.lexico_objectives["tolerances"][k_metric][-1] == "%"
|
||||
), "String tolerance of {} should use %% as the suffix".format(k_metric)
|
||||
), f"String tolerance of {k_metric} should use %% as the suffix"
|
||||
tolerance_bound = self._f_best[k_metric] * (
|
||||
1 + 0.01 * float(self.lexico_objectives["tolerances"][k_metric].replace("%", ""))
|
||||
)
|
||||
|
||||
@@ -319,7 +319,7 @@ class ChampionFrontierSearcher(BaseSearcher):
|
||||
candidate_configs = [set(seed_interactions) | set(item) for item in space]
|
||||
final_candidate_configs = []
|
||||
for c in candidate_configs:
|
||||
new_c = set([e for e in c if len(e) > 1])
|
||||
new_c = {e for e in c if len(e) > 1}
|
||||
final_candidate_configs.append(new_c)
|
||||
return final_candidate_configs
|
||||
|
||||
|
||||
@@ -191,7 +191,7 @@ class ConcurrencyLimiter(Searcher):
|
||||
self.batch = batch
|
||||
self.live_trials = set()
|
||||
self.cached_results = {}
|
||||
super(ConcurrencyLimiter, self).__init__(metric=self.searcher.metric, mode=self.searcher.mode)
|
||||
super().__init__(metric=self.searcher.metric, mode=self.searcher.mode)
|
||||
|
||||
def suggest(self, trial_id: str) -> Optional[Dict]:
|
||||
assert trial_id not in self.live_trials, f"Trial ID {trial_id} must be unique: already found in set."
|
||||
@@ -285,25 +285,21 @@ def validate_warmstart(
|
||||
"""
|
||||
if points_to_evaluate:
|
||||
if not isinstance(points_to_evaluate, list):
|
||||
raise TypeError("points_to_evaluate expected to be a list, got {}.".format(type(points_to_evaluate)))
|
||||
raise TypeError(f"points_to_evaluate expected to be a list, got {type(points_to_evaluate)}.")
|
||||
for point in points_to_evaluate:
|
||||
if not isinstance(point, (dict, list)):
|
||||
raise TypeError(f"points_to_evaluate expected to include list or dict, " f"got {point}.")
|
||||
|
||||
if validate_point_name_lengths and (not len(point) == len(parameter_names)):
|
||||
raise ValueError(
|
||||
"Dim of point {}".format(point)
|
||||
+ " and parameter_names {}".format(parameter_names)
|
||||
+ " do not match."
|
||||
)
|
||||
raise ValueError(f"Dim of point {point}" + f" and parameter_names {parameter_names}" + " do not match.")
|
||||
|
||||
if points_to_evaluate and evaluated_rewards:
|
||||
if not isinstance(evaluated_rewards, list):
|
||||
raise TypeError("evaluated_rewards expected to be a list, got {}.".format(type(evaluated_rewards)))
|
||||
raise TypeError(f"evaluated_rewards expected to be a list, got {type(evaluated_rewards)}.")
|
||||
if not len(evaluated_rewards) == len(points_to_evaluate):
|
||||
raise ValueError(
|
||||
"Dim of evaluated_rewards {}".format(evaluated_rewards)
|
||||
+ " and points_to_evaluate {}".format(points_to_evaluate)
|
||||
f"Dim of evaluated_rewards {evaluated_rewards}"
|
||||
+ f" and points_to_evaluate {points_to_evaluate}"
|
||||
+ " do not match."
|
||||
)
|
||||
|
||||
@@ -547,7 +543,7 @@ class OptunaSearch(Searcher):
|
||||
evaluated_rewards: Optional[List] = None,
|
||||
):
|
||||
assert ot is not None, "Optuna must be installed! Run `pip install optuna`."
|
||||
super(OptunaSearch, self).__init__(metric=metric, mode=mode)
|
||||
super().__init__(metric=metric, mode=mode)
|
||||
|
||||
if isinstance(space, dict) and space:
|
||||
resolved_vars, domain_vars, grid_vars = parse_spec_vars(space)
|
||||
|
||||
@@ -252,7 +252,7 @@ def _try_resolve(v) -> Tuple[bool, Any]:
|
||||
# Grid search values
|
||||
grid_values = v["grid_search"]
|
||||
if not isinstance(grid_values, list):
|
||||
raise TuneError("Grid search expected list of values, got: {}".format(grid_values))
|
||||
raise TuneError(f"Grid search expected list of values, got: {grid_values}")
|
||||
return False, Categorical(grid_values).grid()
|
||||
return True, v
|
||||
|
||||
@@ -302,13 +302,13 @@ def has_unresolved_values(spec: Dict) -> bool:
|
||||
|
||||
class _UnresolvedAccessGuard(dict):
|
||||
def __init__(self, *args, **kwds):
|
||||
super(_UnresolvedAccessGuard, self).__init__(*args, **kwds)
|
||||
super().__init__(*args, **kwds)
|
||||
self.__dict__ = self
|
||||
|
||||
def __getattribute__(self, item):
|
||||
value = dict.__getattribute__(self, item)
|
||||
if not _is_resolved(value):
|
||||
raise RecursiveDependencyError("`{}` recursively depends on {}".format(item, value))
|
||||
raise RecursiveDependencyError(f"`{item}` recursively depends on {value}")
|
||||
elif isinstance(value, dict):
|
||||
return _UnresolvedAccessGuard(value)
|
||||
else:
|
||||
|
||||
@@ -162,6 +162,10 @@ def broadcast_code(custom_code="", file_name="mylearner"):
|
||||
assert isinstance(MyLargeLGBM(), LGBMEstimator)
|
||||
```
|
||||
"""
|
||||
# Check if Spark is available
|
||||
spark_available, _ = check_spark()
|
||||
|
||||
# Write to local driver file system
|
||||
flaml_path = os.path.dirname(os.path.abspath(__file__))
|
||||
custom_code = textwrap.dedent(custom_code)
|
||||
custom_path = os.path.join(flaml_path, file_name + ".py")
|
||||
@@ -169,6 +173,24 @@ def broadcast_code(custom_code="", file_name="mylearner"):
|
||||
with open(custom_path, "w") as f:
|
||||
f.write(custom_code)
|
||||
|
||||
# If using Spark, broadcast the code content to executors
|
||||
if spark_available:
|
||||
spark = SparkSession.builder.getOrCreate()
|
||||
bc_code = spark.sparkContext.broadcast(custom_code)
|
||||
|
||||
# Execute a job to ensure the code is distributed to all executors
|
||||
def _write_code(bc):
|
||||
code = bc.value
|
||||
import os
|
||||
|
||||
module_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), file_name + ".py")
|
||||
os.makedirs(os.path.dirname(module_path), exist_ok=True)
|
||||
with open(module_path, "w") as f:
|
||||
f.write(code)
|
||||
return True
|
||||
|
||||
spark.sparkContext.parallelize(range(1)).map(lambda _: _write_code(bc_code)).collect()
|
||||
|
||||
return custom_path
|
||||
|
||||
|
||||
|
||||
@@ -110,7 +110,7 @@ class Trial:
|
||||
}
|
||||
self.metric_n_steps[metric] = {}
|
||||
for n in self.n_steps:
|
||||
key = "last-{:d}-avg".format(n)
|
||||
key = f"last-{n:d}-avg"
|
||||
self.metric_analysis[metric][key] = value
|
||||
# Store n as string for correct restore.
|
||||
self.metric_n_steps[metric][str(n)] = deque([value], maxlen=n)
|
||||
@@ -124,7 +124,7 @@ class Trial:
|
||||
self.metric_analysis[metric]["last"] = value
|
||||
|
||||
for n in self.n_steps:
|
||||
key = "last-{:d}-avg".format(n)
|
||||
key = f"last-{n:d}-avg"
|
||||
self.metric_n_steps[metric][str(n)].append(value)
|
||||
self.metric_analysis[metric][key] = sum(self.metric_n_steps[metric][str(n)]) / len(
|
||||
self.metric_n_steps[metric][str(n)]
|
||||
|
||||
@@ -21,16 +21,26 @@ except (ImportError, AssertionError):
|
||||
from .analysis import ExperimentAnalysis as EA
|
||||
else:
|
||||
ray_available = True
|
||||
|
||||
import logging
|
||||
|
||||
from flaml.tune.spark.utils import PySparkOvertimeMonitor, check_spark
|
||||
|
||||
from .logger import logger, logger_formatter
|
||||
from .result import DEFAULT_METRIC
|
||||
from .trial import Trial
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
logger.propagate = False
|
||||
try:
|
||||
import mlflow
|
||||
except ImportError:
|
||||
mlflow = None
|
||||
try:
|
||||
from flaml.fabric.mlflow import MLflowIntegration, is_autolog_enabled
|
||||
|
||||
internal_mlflow = True
|
||||
except ImportError:
|
||||
internal_mlflow = False
|
||||
|
||||
|
||||
_use_ray = True
|
||||
_runner = None
|
||||
_verbose = 0
|
||||
@@ -44,6 +54,7 @@ class ExperimentAnalysis(EA):
|
||||
"""Class for storing the experiment results."""
|
||||
|
||||
def __init__(self, trials, metric, mode, lexico_objectives=None):
|
||||
self.best_run_id = None
|
||||
try:
|
||||
super().__init__(self, None, trials, metric, mode)
|
||||
self.lexico_objectives = lexico_objectives
|
||||
@@ -128,6 +139,16 @@ class ExperimentAnalysis(EA):
|
||||
else:
|
||||
return self.best_trial.last_result
|
||||
|
||||
@property
|
||||
def best_iteration(self) -> List[str]:
|
||||
"""Help better navigate"""
|
||||
best_trial = self.best_trial
|
||||
best_trial_id = best_trial.trial_id
|
||||
for i, trial in enumerate(self.trials):
|
||||
if trial.trial_id == best_trial_id:
|
||||
return i
|
||||
return None
|
||||
|
||||
|
||||
def report(_metric=None, **kwargs):
|
||||
"""A function called by the HPO application to report final or intermediate
|
||||
@@ -174,9 +195,16 @@ def report(_metric=None, **kwargs):
|
||||
global _training_iteration
|
||||
if _use_ray:
|
||||
try:
|
||||
from ray import tune
|
||||
from ray import __version__ as ray_version
|
||||
|
||||
return tune.report(_metric, **kwargs)
|
||||
if ray_version.startswith("1."):
|
||||
from ray import tune
|
||||
|
||||
return tune.report(_metric, **kwargs)
|
||||
else: # ray>=2
|
||||
from ray.air import session
|
||||
|
||||
return session.report(metrics={"metric": _metric, **kwargs})
|
||||
except ImportError:
|
||||
# calling tune.report() outside tune.run()
|
||||
return
|
||||
@@ -234,6 +262,11 @@ def run(
|
||||
lexico_objectives: Optional[dict] = None,
|
||||
force_cancel: Optional[bool] = False,
|
||||
n_concurrent_trials: Optional[int] = 0,
|
||||
mlflow_exp_name: Optional[str] = None,
|
||||
automl_info: Optional[Tuple[float]] = None,
|
||||
extra_tag: Optional[dict] = None,
|
||||
cost_attr: Optional[str] = "auto",
|
||||
cost_budget: Optional[float] = None,
|
||||
**ray_args,
|
||||
):
|
||||
"""The function-based way of performing HPO.
|
||||
@@ -424,6 +457,10 @@ def run(
|
||||
}
|
||||
```
|
||||
force_cancel: boolean, default=False | Whether to forcely cancel the PySpark job if overtime.
|
||||
mlflow_exp_name: str, default=None | The name of the mlflow experiment. This should be specified if
|
||||
enable mlflow autologging on Spark. Otherwise it will log all the results into the experiment of the
|
||||
same name as the basename of main entry file.
|
||||
automl_info: tuple, default=None | The information of the automl run. It should be a tuple of (mlflow_log_latency,).
|
||||
n_concurrent_trials: int, default=0 | The number of concurrent trials when perform hyperparameter
|
||||
tuning with Spark. Only valid when use_spark=True and spark is required:
|
||||
`pip install flaml[spark]`. Please check
|
||||
@@ -431,6 +468,13 @@ def run(
|
||||
for more details about installing Spark. When tune.run() is called from AutoML, it will be
|
||||
overwritten by the value of `n_concurrent_trials` in AutoML. When <= 0, the concurrent trials
|
||||
will be set to the number of executors.
|
||||
extra_tag: dict, default=None | Extra tags to be added to the mlflow runs created by autologging.
|
||||
cost_attr: None or str to specify the attribute to evaluate the cost of different trials.
|
||||
Default is "auto", which means that we will automatically choose the cost attribute to use (depending
|
||||
on the nature of the resource budget). When cost_attr is set to None, cost differences between different trials will be omitted
|
||||
in our search algorithm. When cost_attr is set to a str different from "auto" and "time_total_s",
|
||||
this cost_attr must be available in the result dict of the trial.
|
||||
cost_budget: A float of the cost budget. Only valid when cost_attr is a str different from "auto" and "time_total_s".
|
||||
**ray_args: keyword arguments to pass to ray.tune.run().
|
||||
Only valid when use_ray=True.
|
||||
"""
|
||||
@@ -438,10 +482,12 @@ def run(
|
||||
global _verbose
|
||||
global _running_trial
|
||||
global _training_iteration
|
||||
global internal_mlflow
|
||||
old_use_ray = _use_ray
|
||||
old_verbose = _verbose
|
||||
old_running_trial = _running_trial
|
||||
old_training_iteration = _training_iteration
|
||||
|
||||
if log_file_name:
|
||||
dir_name = os.path.dirname(log_file_name)
|
||||
if dir_name:
|
||||
@@ -473,10 +519,6 @@ def run(
|
||||
elif not logger.hasHandlers():
|
||||
# Add the console handler.
|
||||
_ch = logging.StreamHandler(stream=sys.stdout)
|
||||
logger_formatter = logging.Formatter(
|
||||
"[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s",
|
||||
"%m-%d %H:%M:%S",
|
||||
)
|
||||
_ch.setFormatter(logger_formatter)
|
||||
logger.addHandler(_ch)
|
||||
if verbose <= 2:
|
||||
@@ -486,6 +528,13 @@ def run(
|
||||
else:
|
||||
logger.setLevel(logging.CRITICAL)
|
||||
|
||||
if internal_mlflow and not automl_info and (mlflow.active_run() or is_autolog_enabled()):
|
||||
mlflow_integration = MLflowIntegration("tune", mlflow_exp_name, extra_tag)
|
||||
evaluation_function = mlflow_integration.wrap_evaluation_function(evaluation_function)
|
||||
_internal_mlflow = not automl_info # True if mlflow_integration will be used for logging
|
||||
else:
|
||||
_internal_mlflow = False
|
||||
|
||||
from .searcher.blendsearch import CFO, BlendSearch, RandomSearch
|
||||
|
||||
if lexico_objectives is not None:
|
||||
@@ -531,7 +580,7 @@ def run(
|
||||
import optuna as _
|
||||
|
||||
SearchAlgorithm = BlendSearch
|
||||
logger.info("Using search algorithm {}.".format(SearchAlgorithm.__name__))
|
||||
logger.info(f"Using search algorithm {SearchAlgorithm.__name__}.")
|
||||
except ImportError:
|
||||
if search_alg == "BlendSearch":
|
||||
raise ValueError("To use BlendSearch, run: pip install flaml[blendsearch]")
|
||||
@@ -540,7 +589,7 @@ def run(
|
||||
logger.warning("Using CFO for search. To use BlendSearch, run: pip install flaml[blendsearch]")
|
||||
else:
|
||||
SearchAlgorithm = locals()[search_alg]
|
||||
logger.info("Using search algorithm {}.".format(SearchAlgorithm.__name__))
|
||||
logger.info(f"Using search algorithm {SearchAlgorithm.__name__}.")
|
||||
metric = metric or DEFAULT_METRIC
|
||||
search_alg = SearchAlgorithm(
|
||||
metric=metric,
|
||||
@@ -560,6 +609,8 @@ def run(
|
||||
metric_constraints=metric_constraints,
|
||||
use_incumbent_result_in_evaluation=use_incumbent_result_in_evaluation,
|
||||
lexico_objectives=lexico_objectives,
|
||||
cost_attr=cost_attr,
|
||||
cost_budget=cost_budget,
|
||||
)
|
||||
else:
|
||||
if metric is None or mode is None:
|
||||
@@ -695,10 +746,16 @@ def run(
|
||||
max_concurrent = max(1, search_alg.max_concurrent)
|
||||
else:
|
||||
max_concurrent = max(1, max_spark_parallelism)
|
||||
passed_in_n_concurrent_trials = max(n_concurrent_trials, max_concurrent)
|
||||
n_concurrent_trials = min(
|
||||
n_concurrent_trials if n_concurrent_trials > 0 else num_executors,
|
||||
max_concurrent,
|
||||
)
|
||||
if n_concurrent_trials < passed_in_n_concurrent_trials:
|
||||
logger.warning(
|
||||
f"The actual concurrent trials is {n_concurrent_trials}. You can set the environment "
|
||||
f"variable `FLAML_MAX_CONCURRENT` to '{passed_in_n_concurrent_trials}' to override the detected num of executors."
|
||||
)
|
||||
with parallel_backend("spark"):
|
||||
with Parallel(n_jobs=n_concurrent_trials, verbose=max(0, (verbose - 1) * 50)) as parallel:
|
||||
try:
|
||||
@@ -713,11 +770,15 @@ def run(
|
||||
time_budget_s = np.inf
|
||||
num_failures = 0
|
||||
upperbound_num_failures = (len(evaluated_rewards) if evaluated_rewards else 0) + max_failure
|
||||
logger.debug(f"automl_info: {automl_info}")
|
||||
while (
|
||||
time.time() - time_start < time_budget_s
|
||||
and (num_samples < 0 or num_trials < num_samples)
|
||||
and num_failures < upperbound_num_failures
|
||||
):
|
||||
if automl_info and automl_info[0] > 0 and time_budget_s < np.inf:
|
||||
time_budget_s -= automl_info[0] * n_concurrent_trials
|
||||
logger.debug(f"Remaining time budget with mlflow log latency: {time_budget_s} seconds.")
|
||||
while len(_runner.running_trials) < n_concurrent_trials:
|
||||
# suggest trials for spark
|
||||
trial_next = _runner.step()
|
||||
@@ -750,6 +811,9 @@ def run(
|
||||
trial_to_run = trials_to_run[0]
|
||||
_runner.running_trial = trial_to_run
|
||||
if result is not None:
|
||||
if _internal_mlflow:
|
||||
mlflow_integration.record_trial(result, trial_to_run, metric)
|
||||
|
||||
if isinstance(result, dict):
|
||||
if result:
|
||||
logger.info(f"Brief result: {result}")
|
||||
@@ -758,7 +822,7 @@ def run(
|
||||
# When the result returned is an empty dict, set the trial status to error
|
||||
trial_to_run.set_status(Trial.ERROR)
|
||||
else:
|
||||
logger.info("Brief result: {}".format({metric: result}))
|
||||
logger.info("Brief result: {metric: result}")
|
||||
report(_metric=result)
|
||||
_runner.stop_trial(trial_to_run)
|
||||
num_failures = 0
|
||||
@@ -768,6 +832,20 @@ def run(
|
||||
mode=mode,
|
||||
lexico_objectives=lexico_objectives,
|
||||
)
|
||||
analysis.search_space = config
|
||||
|
||||
if _internal_mlflow:
|
||||
mlflow_integration.log_tune(analysis, metric)
|
||||
# try:
|
||||
# _best_config = analysis.best_config
|
||||
# except Exception:
|
||||
# _best_config = None
|
||||
# if _best_config:
|
||||
# parallel(
|
||||
# delayed(mlflow_integration.retrain)(evaluation_function, analysis.best_config)
|
||||
# for dummy in [0]
|
||||
# )
|
||||
|
||||
return analysis
|
||||
finally:
|
||||
# recover the global variables in case of nested run
|
||||
@@ -779,6 +857,8 @@ def run(
|
||||
_runner = old_runner
|
||||
logger.handlers = old_handlers
|
||||
logger.setLevel(old_level)
|
||||
if _internal_mlflow:
|
||||
mlflow_integration.adopt_children()
|
||||
|
||||
# simple sequential run without using tune.run() from ray
|
||||
time_start = time.time()
|
||||
@@ -812,7 +892,11 @@ def run(
|
||||
result = None
|
||||
with PySparkOvertimeMonitor(time_start, time_budget_s, force_cancel):
|
||||
result = evaluation_function(trial_to_run.config)
|
||||
logger.debug(f"result in tune: {trial_to_run}, {result}")
|
||||
if result is not None:
|
||||
if _internal_mlflow:
|
||||
mlflow_integration.record_trial(result, trial_to_run, metric)
|
||||
|
||||
if isinstance(result, dict):
|
||||
if result:
|
||||
report(**result)
|
||||
@@ -838,6 +922,19 @@ def run(
|
||||
mode=mode,
|
||||
lexico_objectives=lexico_objectives,
|
||||
)
|
||||
analysis.search_space = config
|
||||
if _internal_mlflow:
|
||||
mlflow_integration.log_tune(analysis, metric)
|
||||
if analysis.best_run_id is not None:
|
||||
logger.info(f"Best MLflow run name: {analysis.best_run_name}")
|
||||
logger.info(f"Best MLflow run id: {analysis.best_run_id}")
|
||||
# try:
|
||||
# _best_config = analysis.best_config
|
||||
# except Exception:
|
||||
# _best_config = None
|
||||
# if _best_config:
|
||||
# mlflow_integration.retrain(evaluation_function, analysis.best_config)
|
||||
|
||||
return analysis
|
||||
finally:
|
||||
# recover the global variables in case of nested run
|
||||
@@ -849,6 +946,8 @@ def run(
|
||||
_runner = old_runner
|
||||
logger.handlers = old_handlers
|
||||
logger.setLevel(old_level)
|
||||
if _internal_mlflow:
|
||||
mlflow_integration.adopt_children()
|
||||
|
||||
|
||||
class Tuner:
|
||||
|
||||
@@ -1 +1 @@
|
||||
__version__ = "2.2.0"
|
||||
__version__ = "2.3.6"
|
||||
|
||||
@@ -174,7 +174,7 @@
|
||||
"import datasets\n",
|
||||
"\n",
|
||||
"seed = 41\n",
|
||||
"data = datasets.load_dataset(\"competition_math\")\n",
|
||||
"data = datasets.load_dataset(\"competition_math\", trust_remote_code=True)\n",
|
||||
"train_data = data[\"train\"].shuffle(seed=seed)\n",
|
||||
"test_data = data[\"test\"].shuffle(seed=seed)\n",
|
||||
"n_tune_data = 20\n",
|
||||
@@ -390,7 +390,7 @@
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\u001b[32m[I 2023-08-01 22:38:01,549]\u001b[0m A new study created in memory with name: optuna\u001b[0m\n"
|
||||
"\u001B[32m[I 2023-08-01 22:38:01,549]\u001B[0m A new study created in memory with name: optuna\u001B[0m\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -196,7 +196,7 @@
|
||||
"import datasets\n",
|
||||
"\n",
|
||||
"seed = 41\n",
|
||||
"data = datasets.load_dataset(\"openai_humaneval\")[\"test\"].shuffle(seed=seed)\n",
|
||||
"data = datasets.load_dataset(\"openai_humaneval\", trust_remote_code=True)[\"test\"].shuffle(seed=seed)\n",
|
||||
"n_tune_data = 20\n",
|
||||
"tune_data = [\n",
|
||||
" {\n",
|
||||
@@ -444,8 +444,8 @@
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\u001b[32m[I 2023-07-30 04:19:08,150]\u001b[0m A new study created in memory with name: optuna\u001b[0m\n",
|
||||
"\u001b[32m[I 2023-07-30 04:19:08,153]\u001b[0m A new study created in memory with name: optuna\u001b[0m\n"
|
||||
"\u001B[32m[I 2023-07-30 04:19:08,150]\u001B[0m A new study created in memory with name: optuna\u001B[0m\n",
|
||||
"\u001B[32m[I 2023-07-30 04:19:08,153]\u001B[0m A new study created in memory with name: optuna\u001B[0m\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -152,7 +152,7 @@
|
||||
"import datasets\n",
|
||||
"\n",
|
||||
"seed = 41\n",
|
||||
"data = datasets.load_dataset(\"openai_humaneval\")[\"test\"].shuffle(seed=seed)\n",
|
||||
"data = datasets.load_dataset(\"openai_humaneval\", trust_remote_code=True)[\"test\"].shuffle(seed=seed)\n",
|
||||
"data = data.select(range(len(data))).rename_column(\"prompt\", \"definition\").remove_columns([\"task_id\", \"canonical_solution\"])"
|
||||
]
|
||||
},
|
||||
|
||||
@@ -121,7 +121,7 @@
|
||||
"import datasets\n",
|
||||
"\n",
|
||||
"seed = 41\n",
|
||||
"data = datasets.load_dataset(\"competition_math\")\n",
|
||||
"data = datasets.load_dataset(\"competition_math\", trust_remote_code=True)\n",
|
||||
"train_data = data[\"train\"].shuffle(seed=seed)\n",
|
||||
"test_data = data[\"test\"].shuffle(seed=seed)\n",
|
||||
"n_tune_data = 20\n",
|
||||
|
||||
@@ -112,9 +112,7 @@
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"raw_dataset = datasets.load_dataset(\"glue\", TASK)"
|
||||
]
|
||||
"source": "raw_dataset = datasets.load_dataset(\"glue\", TASK, trust_remote_code=True)"
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
@@ -425,9 +423,7 @@
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"metric = datasets.load_metric(\"glue\", TASK)"
|
||||
]
|
||||
"source": "metric = datasets.load_metric(\"glue\", TASK, trust_remote_code=True)"
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
@@ -646,7 +642,7 @@
|
||||
"def train_distilbert(config: dict):\n",
|
||||
"\n",
|
||||
" # Load CoLA dataset and apply tokenizer\n",
|
||||
" cola_raw = datasets.load_dataset(\"glue\", TASK)\n",
|
||||
" cola_raw = datasets.load_dataset(\"glue\", TASK, trust_remote_code=True)\n",
|
||||
" cola_encoded = cola_raw.map(tokenize, batched=True)\n",
|
||||
" train_dataset, eval_dataset = cola_encoded[\"train\"], cola_encoded[\"validation\"]\n",
|
||||
"\n",
|
||||
@@ -654,7 +650,7 @@
|
||||
" MODEL_CHECKPOINT, num_labels=NUM_LABELS\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" metric = datasets.load_metric(\"glue\", TASK)\n",
|
||||
" metric = datasets.load_metric(\"glue\", TASK, trust_remote_code=True)\n",
|
||||
" def compute_metrics(eval_pred):\n",
|
||||
" predictions, labels = eval_pred\n",
|
||||
" predictions = np.argmax(predictions, axis=1)\n",
|
||||
@@ -847,7 +843,7 @@
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m Reusing dataset glue (/home/ec2-user/.cache/huggingface/datasets/glue/cola/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)\n",
|
||||
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m Reusing dataset glue (/home/ec2-user/.cache/huggingface/datasets/glue/cola/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)\n",
|
||||
" 0%| | 0/9 [00:00<?, ?ba/s]\n",
|
||||
" 22%|██▏ | 2/9 [00:00<00:00, 19.41ba/s]\n",
|
||||
" 56%|█████▌ | 5/9 [00:00<00:00, 20.98ba/s]\n",
|
||||
@@ -856,25 +852,25 @@
|
||||
"100%|██████████| 2/2 [00:00<00:00, 42.79ba/s]\n",
|
||||
" 0%| | 0/2 [00:00<?, ?ba/s]\n",
|
||||
"100%|██████████| 2/2 [00:00<00:00, 41.48ba/s]\n",
|
||||
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']\n",
|
||||
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
|
||||
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
|
||||
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']\n",
|
||||
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
|
||||
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']\n",
|
||||
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
|
||||
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
|
||||
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']\n",
|
||||
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
|
||||
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m To disable this warning, you can either:\n",
|
||||
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m \t- Avoid using `tokenizers` before the fork if possible\n",
|
||||
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m \t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
|
||||
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
|
||||
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m To disable this warning, you can either:\n",
|
||||
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m \t- Avoid using `tokenizers` before the fork if possible\n",
|
||||
"\u001b[2m\u001b[36m(pid=11344)\u001b[0m \t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
|
||||
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
|
||||
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m To disable this warning, you can either:\n",
|
||||
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m \t- Avoid using `tokenizers` before the fork if possible\n",
|
||||
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m \t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
|
||||
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
|
||||
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m To disable this warning, you can either:\n",
|
||||
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m \t- Avoid using `tokenizers` before the fork if possible\n",
|
||||
"\u001B[2m\u001B[36m(pid=11344)\u001B[0m \t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
|
||||
3
pytest.ini
Normal file
3
pytest.ini
Normal file
@@ -0,0 +1,3 @@
|
||||
[pytest]
|
||||
markers =
|
||||
spark: mark a test as requiring Spark
|
||||
29
setup.py
29
setup.py
@@ -4,7 +4,7 @@ import setuptools
|
||||
|
||||
here = os.path.abspath(os.path.dirname(__file__))
|
||||
|
||||
with open("README.md", "r", encoding="UTF-8") as fh:
|
||||
with open("README.md", encoding="UTF-8") as fh:
|
||||
long_description = fh.read()
|
||||
|
||||
|
||||
@@ -55,7 +55,8 @@ setuptools.setup(
|
||||
"lightgbm>=2.3.1",
|
||||
"xgboost>=0.90,<2.0.0",
|
||||
"scipy>=1.4.1",
|
||||
"pandas>=1.1.4",
|
||||
"pandas>=1.1.4,<2.0.0; python_version<'3.10'",
|
||||
"pandas>=1.1.4; python_version>='3.10'",
|
||||
"scikit-learn>=1.0.0",
|
||||
"thop",
|
||||
"pytest>=6.1.1",
|
||||
@@ -72,14 +73,14 @@ setuptools.setup(
|
||||
"psutil==5.8.0",
|
||||
"dataclasses",
|
||||
"transformers[torch]==4.26",
|
||||
"datasets",
|
||||
"nltk",
|
||||
"datasets<=3.5.0",
|
||||
"nltk<=3.8.1", # 3.8.2 doesn't work with mlflow
|
||||
"rouge_score",
|
||||
"hcrystalball==0.1.10",
|
||||
"seqeval",
|
||||
"pytorch-forecasting>=0.9.0,<=0.10.1; python_version<'3.11'",
|
||||
"mlflow",
|
||||
"pyspark>=3.2.0",
|
||||
# "pytorch-forecasting==0.10.1; python_version=='3.11'",
|
||||
"mlflow==2.15.1",
|
||||
"joblibspark>=0.5.0",
|
||||
"joblib<=1.3.2",
|
||||
"nbconvert",
|
||||
@@ -92,6 +93,7 @@ setuptools.setup(
|
||||
"pydantic==1.10.9",
|
||||
"sympy",
|
||||
"wolframalpha",
|
||||
"dill", # a drop in replacement of pickle
|
||||
],
|
||||
"catboost": [
|
||||
"catboost>=0.26,<1.2; python_version<'3.11'",
|
||||
@@ -117,14 +119,14 @@ setuptools.setup(
|
||||
"hf": [
|
||||
"transformers[torch]==4.26",
|
||||
"datasets",
|
||||
"nltk",
|
||||
"nltk<=3.8.1",
|
||||
"rouge_score",
|
||||
"seqeval",
|
||||
],
|
||||
"nlp": [ # for backward compatibility; hf is the new option name
|
||||
"transformers[torch]==4.26",
|
||||
"datasets",
|
||||
"nltk",
|
||||
"nltk<=3.8.1",
|
||||
"rouge_score",
|
||||
"seqeval",
|
||||
],
|
||||
@@ -139,7 +141,8 @@ setuptools.setup(
|
||||
"prophet>=1.0.1",
|
||||
"statsmodels>=0.12.2",
|
||||
"hcrystalball==0.1.10",
|
||||
"pytorch-forecasting>=0.9.0",
|
||||
"pytorch-forecasting>=0.9.0; python_version<'3.11'",
|
||||
# "pytorch-forecasting==0.10.1; python_version=='3.11'",
|
||||
"pytorch-lightning==1.9.0",
|
||||
"tensorboardX==2.6",
|
||||
],
|
||||
@@ -163,9 +166,13 @@ setuptools.setup(
|
||||
"autozero": ["scikit-learn", "pandas", "packaging"],
|
||||
},
|
||||
classifiers=[
|
||||
"Programming Language :: Python :: 3",
|
||||
"License :: OSI Approved :: MIT License",
|
||||
"Operating System :: OS Independent",
|
||||
# Specify the Python versions you support here.
|
||||
"Programming Language :: Python :: 3",
|
||||
"Programming Language :: Python :: 3.9",
|
||||
"Programming Language :: Python :: 3.10",
|
||||
"Programming Language :: Python :: 3.11",
|
||||
],
|
||||
python_requires=">=3.6",
|
||||
python_requires=">=3.9",
|
||||
)
|
||||
|
||||
@@ -178,7 +178,7 @@ def test_tsp(human_input_mode="NEVER", max_consecutive_auto_reply=10):
|
||||
class TSPUserProxyAgent(UserProxyAgent):
|
||||
def __init__(self, *args, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
with open(f"{here}/tsp_prompt.txt", "r") as f:
|
||||
with open(f"{here}/tsp_prompt.txt") as f:
|
||||
self._prompt = f.read()
|
||||
|
||||
def generate_init_message(self, question) -> str:
|
||||
|
||||
@@ -187,7 +187,7 @@ def test_humaneval(num_samples=1):
|
||||
)
|
||||
|
||||
seed = 41
|
||||
data = datasets.load_dataset("openai_humaneval")["test"].shuffle(seed=seed)
|
||||
data = datasets.load_dataset("openai_humaneval", trust_remote_code=True)["test"].shuffle(seed=seed)
|
||||
n_tune_data = 20
|
||||
tune_data = [
|
||||
{
|
||||
@@ -334,7 +334,7 @@ def test_math(num_samples=-1):
|
||||
return
|
||||
|
||||
seed = 41
|
||||
data = datasets.load_dataset("competition_math")
|
||||
data = datasets.load_dataset("competition_math", trust_remote_code=True)
|
||||
train_data = data["train"].shuffle(seed=seed)
|
||||
test_data = data["test"].shuffle(seed=seed)
|
||||
n_tune_data = 20
|
||||
@@ -356,7 +356,7 @@ def test_math(num_samples=-1):
|
||||
]
|
||||
print(
|
||||
"max tokens in tuning data's canonical solutions",
|
||||
max([len(x["solution"].split()) for x in tune_data]),
|
||||
max(len(x["solution"].split()) for x in tune_data),
|
||||
)
|
||||
print(len(tune_data), len(test_data))
|
||||
# prompt template
|
||||
|
||||
@@ -1,11 +1,15 @@
|
||||
import unittest
|
||||
from datetime import datetime
|
||||
from test.conftest import evaluate_cv_folds_with_underlying_model
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import pytest
|
||||
import scipy.sparse
|
||||
from sklearn.datasets import load_breast_cancer
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.model_selection import (
|
||||
train_test_split,
|
||||
)
|
||||
|
||||
from flaml import AutoML, tune
|
||||
from flaml.automl.model import LGBMEstimator
|
||||
@@ -420,6 +424,122 @@ class TestClassification(unittest.TestCase):
|
||||
print(automl_experiment.best_estimator)
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"estimator",
|
||||
[
|
||||
"catboost",
|
||||
"extra_tree",
|
||||
"histgb",
|
||||
"kneighbor",
|
||||
"lgbm",
|
||||
# "lrl1",
|
||||
"lrl2",
|
||||
"rf",
|
||||
"svc",
|
||||
"xgboost",
|
||||
"xgb_limitdepth",
|
||||
],
|
||||
)
|
||||
def test_reproducibility_of_classification_models(estimator: str):
|
||||
"""FLAML finds the best model for a given dataset, which it then provides to users.
|
||||
|
||||
However, there are reported issues where FLAML was providing an incorrect model - see here:
|
||||
https://github.com/microsoft/FLAML/issues/1317
|
||||
In this test we take the best model which FLAML provided us, and then retrain and test it on the
|
||||
same folds, to verify that the result is reproducible.
|
||||
"""
|
||||
automl = AutoML()
|
||||
automl_settings = {
|
||||
"max_iter": 5,
|
||||
"time_budget": -1,
|
||||
"task": "classification",
|
||||
"n_jobs": 1,
|
||||
"estimator_list": [estimator],
|
||||
"eval_method": "cv",
|
||||
"n_splits": 10,
|
||||
"metric": "f1",
|
||||
"keep_search_state": True,
|
||||
"skip_transform": True,
|
||||
}
|
||||
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
|
||||
automl.fit(X_train=X, y_train=y, **automl_settings)
|
||||
best_model = automl.model
|
||||
assert best_model is not None
|
||||
config = best_model.get_params()
|
||||
val_loss_flaml = automl.best_result["val_loss"]
|
||||
|
||||
# Take the best model, and see if we can reproduce the best result
|
||||
reproduced_val_loss, metric_for_logging, train_time, pred_time = automl._state.task.evaluate_model_CV(
|
||||
config=config,
|
||||
estimator=best_model,
|
||||
X_train_all=automl._state.X_train_all,
|
||||
y_train_all=automl._state.y_train_all,
|
||||
budget=None,
|
||||
kf=automl._state.kf,
|
||||
eval_metric="f1",
|
||||
best_val_loss=None,
|
||||
cv_score_agg_func=None,
|
||||
log_training_metric=False,
|
||||
fit_kwargs=None,
|
||||
free_mem_ratio=0,
|
||||
)
|
||||
assert pytest.approx(val_loss_flaml) == reproduced_val_loss
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"estimator",
|
||||
[
|
||||
"catboost",
|
||||
"extra_tree",
|
||||
"histgb",
|
||||
"kneighbor",
|
||||
"lgbm",
|
||||
# "lrl1",
|
||||
"lrl2",
|
||||
"svc",
|
||||
"rf",
|
||||
"xgboost",
|
||||
"xgb_limitdepth",
|
||||
],
|
||||
)
|
||||
def test_reproducibility_of_underlying_classification_models(estimator: str):
|
||||
"""FLAML finds the best model for a given dataset, which it then provides to users.
|
||||
|
||||
However, there are reported issues where FLAML was providing an incorrect model - see here:
|
||||
https://github.com/microsoft/FLAML/issues/1317
|
||||
FLAML defines FLAMLised models, which wrap around the underlying (SKLearn/XGBoost/CatBoost) model.
|
||||
Ideally, FLAMLised models should perform identically to the underlying model, when fitted
|
||||
to the same data, with no budget. This verifies that this is the case for classification models.
|
||||
In this test we take the best model which FLAML provided us, extract the underlying model,
|
||||
before retraining and testing it on the same folds - to verify that the result is reproducible.
|
||||
"""
|
||||
automl = AutoML()
|
||||
automl_settings = {
|
||||
"max_iter": 5,
|
||||
"time_budget": -1,
|
||||
"task": "classification",
|
||||
"n_jobs": 1,
|
||||
"estimator_list": [estimator],
|
||||
"eval_method": "cv",
|
||||
"n_splits": 10,
|
||||
"metric": "f1",
|
||||
"keep_search_state": True,
|
||||
"skip_transform": True,
|
||||
}
|
||||
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
|
||||
automl.fit(X_train=X, y_train=y, **automl_settings)
|
||||
best_model = automl.model
|
||||
assert best_model is not None
|
||||
val_loss_flaml = automl.best_result["val_loss"]
|
||||
reproduced_val_loss_underlying_model = np.mean(
|
||||
evaluate_cv_folds_with_underlying_model(
|
||||
automl._state.X_train_all, automl._state.y_train_all, automl._state.kf, best_model.model, "classification"
|
||||
)
|
||||
)
|
||||
|
||||
assert pytest.approx(val_loss_flaml) == reproduced_val_loss_underlying_model
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
test = TestClassification()
|
||||
test.test_preprocess()
|
||||
|
||||
@@ -125,14 +125,12 @@ def test_metric_constraints_custom():
|
||||
print(automl.estimator_list)
|
||||
print(automl.search_space)
|
||||
print(automl.points_to_evaluate)
|
||||
print("Best minimization objective on validation data: {0:.4g}".format(automl.best_loss))
|
||||
print(f"Best minimization objective on validation data: {automl.best_loss:.4g}")
|
||||
print(
|
||||
"pred_time of the best config on validation data: {0:.4g}".format(
|
||||
automl.metrics_for_best_config[1]["pred_time"]
|
||||
)
|
||||
"pred_time of the best config on validation data: {:.4g}".format(automl.metrics_for_best_config[1]["pred_time"])
|
||||
)
|
||||
print(
|
||||
"val_train_loss_gap of the best config on validation data: {0:.4g}".format(
|
||||
"val_train_loss_gap of the best config on validation data: {:.4g}".format(
|
||||
automl.metrics_for_best_config[1]["val_train_loss_gap"]
|
||||
)
|
||||
)
|
||||
|
||||
312
test/automl/test_extra_models.py
Normal file
312
test/automl/test_extra_models.py
Normal file
@@ -0,0 +1,312 @@
|
||||
import os
|
||||
import sys
|
||||
import unittest
|
||||
import warnings
|
||||
from collections import defaultdict
|
||||
|
||||
import mlflow
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import pytest
|
||||
import scipy
|
||||
from packaging.version import Version
|
||||
from sklearn.datasets import load_breast_cancer, load_diabetes, load_iris
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
from flaml import AutoML
|
||||
from flaml.automl.ml import sklearn_metric_loss_score
|
||||
from flaml.tune.spark.utils import check_spark
|
||||
|
||||
pytestmark = pytest.mark.spark
|
||||
|
||||
leaderboard = defaultdict(dict)
|
||||
|
||||
warnings.simplefilter(action="ignore")
|
||||
if sys.platform == "darwin" or "nt" in os.name:
|
||||
# skip this test if the platform is not linux
|
||||
skip_spark = True
|
||||
else:
|
||||
try:
|
||||
import pyspark
|
||||
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, RegressionEvaluator
|
||||
from pyspark.ml.feature import VectorAssembler
|
||||
|
||||
from flaml.automl.spark.utils import to_pandas_on_spark
|
||||
|
||||
spark = (
|
||||
pyspark.sql.SparkSession.builder.appName("MyApp")
|
||||
.master("local[2]")
|
||||
.config(
|
||||
"spark.jars.packages",
|
||||
(
|
||||
"com.microsoft.azure:synapseml_2.12:1.0.2,"
|
||||
"org.apache.hadoop:hadoop-azure:3.3.5,"
|
||||
"com.microsoft.azure:azure-storage:8.6.6,"
|
||||
f"org.mlflow:mlflow-spark_2.12:{mlflow.__version__}"
|
||||
if Version(mlflow.__version__) >= Version("2.9.0")
|
||||
else f"org.mlflow:mlflow-spark:{mlflow.__version__}"
|
||||
),
|
||||
)
|
||||
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
|
||||
.config("spark.sql.debug.maxToStringFields", "100")
|
||||
.config("spark.driver.extraJavaOptions", "-Xss1m")
|
||||
.config("spark.executor.extraJavaOptions", "-Xss1m")
|
||||
.getOrCreate()
|
||||
)
|
||||
spark.sparkContext._conf.set(
|
||||
"spark.mlflow.pysparkml.autolog.logModelAllowlistFile",
|
||||
"https://mmlspark.blob.core.windows.net/publicwasb/log_model_allowlist.txt",
|
||||
)
|
||||
# spark.sparkContext.setLogLevel("ERROR")
|
||||
spark_available, _ = check_spark()
|
||||
skip_spark = not spark_available
|
||||
except ImportError:
|
||||
skip_spark = True
|
||||
|
||||
|
||||
def _test_regular_models(estimator_list, task):
|
||||
if isinstance(estimator_list, str):
|
||||
estimator_list = [estimator_list]
|
||||
if task == "classification":
|
||||
load_dataset_func = load_iris
|
||||
metric = "accuracy"
|
||||
else:
|
||||
load_dataset_func = load_diabetes
|
||||
metric = "r2"
|
||||
|
||||
x, y = load_dataset_func(return_X_y=True, as_frame=True)
|
||||
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=7654321)
|
||||
|
||||
automl_experiment = AutoML()
|
||||
automl_settings = {
|
||||
"max_iter": 5,
|
||||
"task": task,
|
||||
"estimator_list": estimator_list,
|
||||
"metric": metric,
|
||||
}
|
||||
automl_experiment.fit(X_train=x_train, y_train=y_train, **automl_settings)
|
||||
predictions = automl_experiment.predict(x_test)
|
||||
score = sklearn_metric_loss_score(metric, predictions, y_test)
|
||||
for estimator_name in estimator_list:
|
||||
leaderboard[task][estimator_name] = score
|
||||
|
||||
|
||||
def _test_spark_models(estimator_list, task):
|
||||
if isinstance(estimator_list, str):
|
||||
estimator_list = [estimator_list]
|
||||
if task == "classification":
|
||||
load_dataset_func = load_iris
|
||||
evaluator = MulticlassClassificationEvaluator(
|
||||
labelCol="target", predictionCol="prediction", metricName="accuracy"
|
||||
)
|
||||
metric = "accuracy"
|
||||
|
||||
elif task == "regression":
|
||||
load_dataset_func = load_diabetes
|
||||
evaluator = RegressionEvaluator(labelCol="target", predictionCol="prediction", metricName="r2")
|
||||
metric = "r2"
|
||||
|
||||
elif task == "binary":
|
||||
load_dataset_func = load_breast_cancer
|
||||
evaluator = MulticlassClassificationEvaluator(
|
||||
labelCol="target", predictionCol="prediction", metricName="accuracy"
|
||||
)
|
||||
metric = "accuracy"
|
||||
|
||||
final_cols = ["target", "features"]
|
||||
extra_args = {}
|
||||
|
||||
if estimator_list is not None and "aft_spark" in estimator_list:
|
||||
# survival analysis task
|
||||
pd_df = pd.read_csv(
|
||||
"https://raw.githubusercontent.com/CamDavidsonPilon/lifelines/master/lifelines/datasets/rossi.csv"
|
||||
)
|
||||
pd_df.rename(columns={"week": "target"}, inplace=True)
|
||||
final_cols += ["arrest"]
|
||||
extra_args["censorCol"] = "arrest"
|
||||
else:
|
||||
pd_df = load_dataset_func(as_frame=True).frame
|
||||
|
||||
rename = {}
|
||||
for attr in pd_df.columns:
|
||||
rename[attr] = attr.replace(" ", "_")
|
||||
pd_df = pd_df.rename(columns=rename)
|
||||
df = spark.createDataFrame(pd_df)
|
||||
df = df.repartition(4)
|
||||
train, test = df.randomSplit([0.8, 0.2], seed=7654321)
|
||||
feature_cols = [col for col in df.columns if col not in ["target", "arrest"]]
|
||||
featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
|
||||
train_data = featurizer.transform(train)[final_cols]
|
||||
test_data = featurizer.transform(test)[final_cols]
|
||||
automl = AutoML()
|
||||
settings = {
|
||||
"max_iter": 1,
|
||||
"estimator_list": estimator_list, # ML learner we intend to test
|
||||
"task": task, # task type
|
||||
"metric": metric, # metric to optimize
|
||||
}
|
||||
settings.update(extra_args)
|
||||
df = to_pandas_on_spark(to_pandas_on_spark(train_data).to_spark(index_col="index"))
|
||||
|
||||
automl.fit(
|
||||
dataframe=df,
|
||||
label="target",
|
||||
**settings,
|
||||
)
|
||||
|
||||
model = automl.model.estimator
|
||||
predictions = model.transform(test_data)
|
||||
predictions.show(5)
|
||||
|
||||
score = evaluator.evaluate(predictions)
|
||||
if estimator_list is not None:
|
||||
for estimator_name in estimator_list:
|
||||
leaderboard[task][estimator_name] = score
|
||||
|
||||
|
||||
def _test_sparse_matrix_classification(estimator):
|
||||
automl_experiment = AutoML()
|
||||
automl_settings = {
|
||||
"estimator_list": [estimator],
|
||||
"time_budget": 2,
|
||||
"metric": "auto",
|
||||
"task": "classification",
|
||||
"log_file_name": "test/sparse_classification.log",
|
||||
"split_type": "uniform",
|
||||
"n_jobs": 1,
|
||||
"model_history": True,
|
||||
}
|
||||
X_train = scipy.sparse.random(1554, 21, dtype=int)
|
||||
y_train = np.random.randint(3, size=1554)
|
||||
automl_experiment.fit(X_train=X_train, y_train=y_train, **automl_settings)
|
||||
|
||||
|
||||
def load_multi_dataset():
|
||||
"""multivariate time series forecasting dataset"""
|
||||
import pandas as pd
|
||||
|
||||
# pd.set_option("display.max_rows", None, "display.max_columns", None)
|
||||
df = pd.read_csv(
|
||||
"https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/nyc_energy_consumption.csv"
|
||||
)
|
||||
# preprocessing data
|
||||
df["timeStamp"] = pd.to_datetime(df["timeStamp"])
|
||||
df = df.set_index("timeStamp")
|
||||
df = df.resample("D").mean()
|
||||
df["temp"] = df["temp"].fillna(method="ffill")
|
||||
df["precip"] = df["precip"].fillna(method="ffill")
|
||||
df = df[:-2] # last two rows are NaN for 'demand' column so remove them
|
||||
df = df.reset_index()
|
||||
|
||||
return df
|
||||
|
||||
|
||||
def _test_forecast(estimator_list, budget=10):
|
||||
if isinstance(estimator_list, str):
|
||||
estimator_list = [estimator_list]
|
||||
df = load_multi_dataset()
|
||||
# split data into train and test
|
||||
time_horizon = 180
|
||||
num_samples = df.shape[0]
|
||||
split_idx = num_samples - time_horizon
|
||||
train_df = df[:split_idx]
|
||||
test_df = df[split_idx:]
|
||||
# test dataframe must contain values for the regressors / multivariate variables
|
||||
X_test = test_df[["timeStamp", "precip", "temp"]]
|
||||
y_test = test_df["demand"]
|
||||
# return
|
||||
automl = AutoML()
|
||||
settings = {
|
||||
"time_budget": budget, # total running time in seconds
|
||||
"metric": "mape", # primary metric
|
||||
"task": "ts_forecast", # task type
|
||||
"log_file_name": "test/energy_forecast_numerical.log", # flaml log file
|
||||
"log_dir": "logs/forecast_logs", # tcn/tft log folder
|
||||
"eval_method": "holdout",
|
||||
"log_type": "all",
|
||||
"label": "demand",
|
||||
"estimator_list": estimator_list,
|
||||
}
|
||||
"""The main flaml automl API"""
|
||||
automl.fit(dataframe=train_df, **settings, period=time_horizon)
|
||||
print(automl.best_config)
|
||||
pred_y = automl.predict(X_test)
|
||||
mape = sklearn_metric_loss_score("mape", pred_y, y_test)
|
||||
for estimator_name in estimator_list:
|
||||
leaderboard["forecast"][estimator_name] = mape
|
||||
|
||||
|
||||
class TestExtraModel(unittest.TestCase):
|
||||
@unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_rf_spark(self):
|
||||
tasks = ["classification", "regression"]
|
||||
for task in tasks:
|
||||
_test_spark_models("rf_spark", task)
|
||||
|
||||
@unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_nb_spark(self):
|
||||
_test_spark_models("nb_spark", "classification")
|
||||
|
||||
@unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_glr(self):
|
||||
_test_spark_models("glr_spark", "regression")
|
||||
|
||||
@unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_lr(self):
|
||||
_test_spark_models("lr_spark", "regression")
|
||||
|
||||
@unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_svc_spark(self):
|
||||
_test_spark_models("svc_spark", "binary")
|
||||
|
||||
@unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_gbt_spark(self):
|
||||
tasks = ["binary", "regression"]
|
||||
for task in tasks:
|
||||
_test_spark_models("gbt_spark", task)
|
||||
|
||||
@unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_aft(self):
|
||||
_test_spark_models("aft_spark", "regression")
|
||||
|
||||
@unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_default_spark(self):
|
||||
_test_spark_models(None, "classification")
|
||||
|
||||
def test_svc(self):
|
||||
_test_regular_models("svc", "classification")
|
||||
_test_sparse_matrix_classification("svc")
|
||||
|
||||
def test_sgd(self):
|
||||
tasks = ["classification", "regression"]
|
||||
for task in tasks:
|
||||
_test_regular_models("sgd", task)
|
||||
_test_sparse_matrix_classification("sgd")
|
||||
|
||||
def test_enet(self):
|
||||
_test_regular_models("enet", "regression")
|
||||
|
||||
def test_lassolars(self):
|
||||
_test_regular_models("lassolars", "regression")
|
||||
_test_forecast("lassolars")
|
||||
|
||||
def test_seasonal_naive(self):
|
||||
_test_forecast("snaive")
|
||||
|
||||
def test_naive(self):
|
||||
_test_forecast("naive")
|
||||
|
||||
def test_seasonal_avg(self):
|
||||
_test_forecast("savg")
|
||||
|
||||
def test_avg(self):
|
||||
_test_forecast("avg")
|
||||
|
||||
@unittest.skipIf(skip_spark, reason="Skip on Mac or Windows")
|
||||
def test_tcn(self):
|
||||
_test_forecast("tcn")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
print(leaderboard)
|
||||
@@ -1,4 +1,5 @@
|
||||
import datetime
|
||||
import os
|
||||
import sys
|
||||
|
||||
import numpy as np
|
||||
@@ -95,6 +96,7 @@ def test_forecast_automl(budget=10, estimators_when_no_prophet=["arima", "sarima
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.skipif(sys.platform == "darwin" or "nt" in os.name, reason="skip on mac or windows")
|
||||
def test_models(budget=3):
|
||||
n = 200
|
||||
X = pd.DataFrame(
|
||||
@@ -151,6 +153,10 @@ def test_numpy():
|
||||
print(automl.predict(12))
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
sys.platform in ["darwin"],
|
||||
reason="do not run on mac os",
|
||||
)
|
||||
def test_numpy_large():
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
@@ -471,7 +477,10 @@ def test_forecast_classification(budget=5):
|
||||
def get_stalliion_data():
|
||||
from pytorch_forecasting.data.examples import get_stallion_data
|
||||
|
||||
data = get_stallion_data()
|
||||
# data = get_stallion_data()
|
||||
data = pd.read_parquet(
|
||||
"https://raw.githubusercontent.com/sktime/pytorch-forecasting/refs/heads/main/examples/data/stallion.parquet"
|
||||
)
|
||||
# add time index - For datasets with no missing values, FLAML will automate this process
|
||||
data["time_idx"] = data["date"].dt.year * 12 + data["date"].dt.month
|
||||
data["time_idx"] -= data["time_idx"].min()
|
||||
@@ -567,7 +576,7 @@ def test_forecast_panel(budget=5):
|
||||
print(f"Training duration of best run: {automl.best_config_train_time}s")
|
||||
print(automl.model.estimator)
|
||||
""" pickle and save the automl object """
|
||||
import pickle
|
||||
import dill as pickle
|
||||
|
||||
with open("automl.pkl", "wb") as f:
|
||||
pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL)
|
||||
|
||||
51
test/automl/test_max_iter_1.py
Normal file
51
test/automl/test_max_iter_1.py
Normal file
@@ -0,0 +1,51 @@
|
||||
import mlflow
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
from flaml import AutoML
|
||||
|
||||
|
||||
def test_max_iter_1():
|
||||
date_rng = pd.date_range(start="2024-01-01", periods=100, freq="H")
|
||||
X = pd.DataFrame({"ds": date_rng})
|
||||
y_train_24h = np.random.rand(len(X)) * 100
|
||||
|
||||
# AutoML
|
||||
settings = {
|
||||
"max_iter": 1,
|
||||
"estimator_list": ["xgboost", "lgbm"],
|
||||
"starting_points": {"xgboost": {}, "lgbm": {}},
|
||||
"task": "ts_forecast",
|
||||
"log_file_name": "test_max_iter_1.log",
|
||||
"seed": 41,
|
||||
"mlflow_exp_name": "TestExp-max_iter-1",
|
||||
"use_spark": False,
|
||||
"n_concurrent_trials": 1,
|
||||
"verbose": 1,
|
||||
"featurization": "off",
|
||||
"metric": "rmse",
|
||||
"mlflow_logging": True,
|
||||
}
|
||||
|
||||
automl = AutoML(**settings)
|
||||
|
||||
with mlflow.start_run(run_name="AutoMLModel-XGBoost-and-LGBM-max_iter_1"):
|
||||
automl.fit(
|
||||
X_train=X,
|
||||
y_train=y_train_24h,
|
||||
period=24,
|
||||
X_val=X,
|
||||
y_val=y_train_24h,
|
||||
split_ratio=0,
|
||||
force_cancel=False,
|
||||
)
|
||||
|
||||
assert automl.model is not None, "AutoML failed to return a model"
|
||||
assert automl.best_run_id is not None, "Best run ID should not be None with mlflow logging"
|
||||
|
||||
print("Best model:", automl.model)
|
||||
print("Best run ID:", automl.best_run_id)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_max_iter_1()
|
||||
@@ -1,3 +1,5 @@
|
||||
import pickle
|
||||
|
||||
import mlflow
|
||||
import mlflow.entities
|
||||
import pytest
|
||||
@@ -8,58 +10,113 @@ from flaml import AutoML
|
||||
|
||||
|
||||
class TestMLFlowLoggingParam:
|
||||
def test_update_and_install_requirements(self):
|
||||
import mlflow
|
||||
from sklearn import tree
|
||||
|
||||
from flaml.fabric.mlflow import update_and_install_requirements
|
||||
|
||||
with mlflow.start_run(run_name="test") as run:
|
||||
sk_model = tree.DecisionTreeClassifier()
|
||||
mlflow.sklearn.log_model(sk_model, "model", registered_model_name="test")
|
||||
|
||||
update_and_install_requirements(run_id=run.info.run_id)
|
||||
|
||||
def test_should_start_new_run_by_default(self, automl_settings):
|
||||
with mlflow.start_run():
|
||||
parent = mlflow.last_active_run()
|
||||
with mlflow.start_run() as parent_run:
|
||||
automl = AutoML()
|
||||
X_train, y_train = load_iris(return_X_y=True)
|
||||
automl.fit(X_train=X_train, y_train=y_train, **automl_settings)
|
||||
try:
|
||||
self._check_mlflow_parameters(automl, parent_run.info)
|
||||
except FileNotFoundError:
|
||||
print("[WARNING]: No file found")
|
||||
|
||||
children = self._get_child_runs(parent)
|
||||
assert len(children) >= 1, "Expected at least 1 child run, got {}".format(len(children))
|
||||
children = self._get_child_runs(parent_run)
|
||||
assert len(children) >= 1, f"Expected at least 1 child run, got {len(children)}"
|
||||
|
||||
def test_should_not_start_new_run_when_mlflow_logging_set_to_false_in_init(self, automl_settings):
|
||||
with mlflow.start_run():
|
||||
parent = mlflow.last_active_run()
|
||||
with mlflow.start_run() as parent_run:
|
||||
automl = AutoML(mlflow_logging=False)
|
||||
X_train, y_train = load_iris(return_X_y=True)
|
||||
automl.fit(X_train=X_train, y_train=y_train, **automl_settings)
|
||||
try:
|
||||
self._check_mlflow_parameters(automl, parent_run.info)
|
||||
except FileNotFoundError:
|
||||
print("[WARNING]: No file found")
|
||||
|
||||
children = self._get_child_runs(parent)
|
||||
assert len(children) == 0, "Expected 0 child runs, got {}".format(len(children))
|
||||
children = self._get_child_runs(parent_run)
|
||||
assert len(children) == 0, f"Expected 0 child runs, got {len(children)}"
|
||||
|
||||
def test_should_not_start_new_run_when_mlflow_logging_set_to_false_in_fit(self, automl_settings):
|
||||
with mlflow.start_run():
|
||||
parent = mlflow.last_active_run()
|
||||
with mlflow.start_run() as parent_run:
|
||||
automl = AutoML()
|
||||
X_train, y_train = load_iris(return_X_y=True)
|
||||
automl.fit(X_train=X_train, y_train=y_train, mlflow_logging=False, **automl_settings)
|
||||
try:
|
||||
self._check_mlflow_parameters(automl, parent_run.info)
|
||||
except FileNotFoundError:
|
||||
print("[WARNING]: No file found")
|
||||
|
||||
children = self._get_child_runs(parent)
|
||||
assert len(children) == 0, "Expected 0 child runs, got {}".format(len(children))
|
||||
children = self._get_child_runs(parent_run)
|
||||
assert len(children) == 0, f"Expected 0 child runs, got {len(children)}"
|
||||
|
||||
def test_should_start_new_run_when_mlflow_logging_set_to_true_in_fit(self, automl_settings):
|
||||
with mlflow.start_run():
|
||||
parent = mlflow.last_active_run()
|
||||
with mlflow.start_run() as parent_run:
|
||||
automl = AutoML(mlflow_logging=False)
|
||||
X_train, y_train = load_iris(return_X_y=True)
|
||||
automl.fit(X_train=X_train, y_train=y_train, mlflow_logging=True, **automl_settings)
|
||||
try:
|
||||
self._check_mlflow_parameters(automl, parent_run.info)
|
||||
except FileNotFoundError:
|
||||
print("[WARNING]: No file found")
|
||||
|
||||
children = self._get_child_runs(parent)
|
||||
assert len(children) >= 1, "Expected at least 1 child run, got {}".format(len(children))
|
||||
children = self._get_child_runs(parent_run)
|
||||
assert len(children) >= 1, f"Expected at least 1 child run, got {len(children)}"
|
||||
|
||||
@staticmethod
|
||||
def _get_child_runs(parent_run: mlflow.entities.Run) -> DataFrame:
|
||||
experiment_id = parent_run.info.experiment_id
|
||||
return mlflow.search_runs(
|
||||
[experiment_id], filter_string="tags.mlflow.parentRunId = '{}'".format(parent_run.info.run_id)
|
||||
[experiment_id], filter_string=f"tags.mlflow.parentRunId = '{parent_run.info.run_id}'"
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def _check_mlflow_parameters(automl: AutoML, run_info: mlflow.entities.RunInfo):
|
||||
with open(
|
||||
f"./mlruns/{run_info.experiment_id}/{run_info.run_id}/artifacts/automl_pipeline/model.pkl", "rb"
|
||||
) as f:
|
||||
t = pickle.load(f)
|
||||
if __name__ == "__main__":
|
||||
print(t)
|
||||
if not hasattr(automl.model._model, "_get_param_names"):
|
||||
return
|
||||
for param in automl.model._model._get_param_names():
|
||||
assert eval("t._final_estimator._model" + f".{param}") == eval(
|
||||
"automl.model._model" + f".{param}"
|
||||
), "The mlflow logging not consistent with automl model"
|
||||
if __name__ == "__main__":
|
||||
print(param, "\t", eval("automl.model._model" + f".{param}"))
|
||||
print("[INFO]: Successfully Logged")
|
||||
|
||||
@pytest.fixture(scope="class")
|
||||
def automl_settings(self):
|
||||
mlflow.end_run()
|
||||
return {
|
||||
"time_budget": 2, # in seconds
|
||||
"time_budget": 5, # in seconds
|
||||
"metric": "accuracy",
|
||||
"task": "classification",
|
||||
"log_file_name": "iris.log",
|
||||
}
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
s = TestMLFlowLoggingParam()
|
||||
automl_settings = {
|
||||
"time_budget": 5, # in seconds
|
||||
"metric": "accuracy",
|
||||
"task": "classification",
|
||||
"log_file_name": "iris.log",
|
||||
}
|
||||
s.test_should_start_new_run_by_default(automl_settings)
|
||||
s.test_should_start_new_run_when_mlflow_logging_set_to_true_in_fit(automl_settings)
|
||||
|
||||
@@ -143,4 +143,5 @@ def test_prep():
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_lrl2()
|
||||
test_prep()
|
||||
@@ -187,7 +187,6 @@ class TestMultiClass(unittest.TestCase):
|
||||
def test_custom_metric(self):
|
||||
df, y = load_iris(return_X_y=True, as_frame=True)
|
||||
df["label"] = y
|
||||
automl = AutoML()
|
||||
settings = {
|
||||
"dataframe": df,
|
||||
"label": "label",
|
||||
@@ -204,7 +203,8 @@ class TestMultiClass(unittest.TestCase):
|
||||
"pred_time_limit": 1e-5,
|
||||
"ensemble": True,
|
||||
}
|
||||
automl.fit(**settings)
|
||||
automl = AutoML(**settings) # test safe_json_dumps
|
||||
automl.fit(dataframe=df, label="label")
|
||||
print(automl.classes_)
|
||||
print(automl.model)
|
||||
print(automl.config_history)
|
||||
@@ -438,8 +438,8 @@ class TestMultiClass(unittest.TestCase):
|
||||
automl_val_accuracy = 1.0 - automl.best_loss
|
||||
print("Best ML leaner:", automl.best_estimator)
|
||||
print("Best hyperparmeter config:", automl.best_config)
|
||||
print("Best accuracy on validation data: {0:.4g}".format(automl_val_accuracy))
|
||||
print("Training duration of best run: {0:.4g} s".format(automl.best_config_train_time))
|
||||
print(f"Best accuracy on validation data: {automl_val_accuracy:.4g}")
|
||||
print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")
|
||||
|
||||
starting_points = automl.best_config_per_estimator
|
||||
print("starting_points", starting_points)
|
||||
@@ -461,8 +461,8 @@ class TestMultiClass(unittest.TestCase):
|
||||
new_automl_val_accuracy = 1.0 - new_automl.best_loss
|
||||
print("Best ML leaner:", new_automl.best_estimator)
|
||||
print("Best hyperparmeter config:", new_automl.best_config)
|
||||
print("Best accuracy on validation data: {0:.4g}".format(new_automl_val_accuracy))
|
||||
print("Training duration of best run: {0:.4g} s".format(new_automl.best_config_train_time))
|
||||
print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
|
||||
print(f"Training duration of best run: {new_automl.best_config_train_time:.4g} s")
|
||||
|
||||
def test_fit_w_starting_point_2(self, as_frame=True):
|
||||
try:
|
||||
@@ -493,8 +493,8 @@ class TestMultiClass(unittest.TestCase):
|
||||
automl_val_accuracy = 1.0 - automl.best_loss
|
||||
print("Best ML leaner:", automl.best_estimator)
|
||||
print("Best hyperparmeter config:", automl.best_config)
|
||||
print("Best accuracy on validation data: {0:.4g}".format(automl_val_accuracy))
|
||||
print("Training duration of best run: {0:.4g} s".format(automl.best_config_train_time))
|
||||
print(f"Best accuracy on validation data: {automl_val_accuracy:.4g}")
|
||||
print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")
|
||||
|
||||
starting_points = {}
|
||||
log_file_name = settings["log_file_name"]
|
||||
@@ -508,7 +508,7 @@ class TestMultiClass(unittest.TestCase):
|
||||
if learner not in starting_points:
|
||||
starting_points[learner] = []
|
||||
starting_points[learner].append(config)
|
||||
max_iter = sum([len(s) for k, s in starting_points.items()])
|
||||
max_iter = sum(len(s) for k, s in starting_points.items())
|
||||
settings_resume = {
|
||||
"time_budget": 2,
|
||||
"metric": "accuracy",
|
||||
@@ -528,7 +528,7 @@ class TestMultiClass(unittest.TestCase):
|
||||
new_automl_val_accuracy = 1.0 - new_automl.best_loss
|
||||
# print('Best ML leaner:', new_automl.best_estimator)
|
||||
# print('Best hyperparmeter config:', new_automl.best_config)
|
||||
print("Best accuracy on validation data: {0:.4g}".format(new_automl_val_accuracy))
|
||||
print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
|
||||
# print('Training duration of best run: {0:.4g} s'.format(new_automl_experiment.best_config_train_time))
|
||||
|
||||
|
||||
|
||||
@@ -65,8 +65,8 @@ def test_automl(budget=5, dataset_format="dataframe", hpo_method=None):
|
||||
""" retrieve best config and best learner """
|
||||
print("Best ML leaner:", automl.best_estimator)
|
||||
print("Best hyperparmeter config:", automl.best_config)
|
||||
print("Best accuracy on validation data: {0:.4g}".format(1 - automl.best_loss))
|
||||
print("Training duration of best run: {0:.4g} s".format(automl.best_config_train_time))
|
||||
print(f"Best accuracy on validation data: {1 - automl.best_loss:.4g}")
|
||||
print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")
|
||||
print(automl.model.estimator)
|
||||
print(automl.best_config_per_estimator)
|
||||
print("time taken to find best model:", automl.time_to_find_best_model)
|
||||
|
||||
@@ -1,9 +1,12 @@
|
||||
import unittest
|
||||
from test.conftest import evaluate_cv_folds_with_underlying_model
|
||||
|
||||
import numpy as np
|
||||
import pytest
|
||||
import scipy.sparse
|
||||
from sklearn.datasets import (
|
||||
fetch_california_housing,
|
||||
make_regression,
|
||||
)
|
||||
|
||||
from flaml import AutoML
|
||||
@@ -205,7 +208,6 @@ class TestRegression(unittest.TestCase):
|
||||
|
||||
|
||||
def test_multioutput():
|
||||
from sklearn.datasets import make_regression
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.multioutput import MultiOutputRegressor, RegressorChain
|
||||
|
||||
@@ -230,5 +232,210 @@ def test_multioutput():
|
||||
print(model.predict(X_test))
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"estimator",
|
||||
[
|
||||
"catboost",
|
||||
"enet",
|
||||
"extra_tree",
|
||||
"histgb",
|
||||
"kneighbor",
|
||||
"lgbm",
|
||||
"rf",
|
||||
"xgboost",
|
||||
"xgb_limitdepth",
|
||||
],
|
||||
)
|
||||
def test_reproducibility_of_regression_models(estimator: str):
|
||||
"""FLAML finds the best model for a given dataset, which it then provides to users.
|
||||
|
||||
However, there are reported issues where FLAML was providing an incorrect model - see here:
|
||||
https://github.com/microsoft/FLAML/issues/1317
|
||||
In this test we take the best regression model which FLAML provided us, and then retrain and test it on the
|
||||
same folds, to verify that the result is reproducible.
|
||||
"""
|
||||
automl = AutoML()
|
||||
automl_settings = {
|
||||
"max_iter": 2,
|
||||
"time_budget": -1,
|
||||
"task": "regression",
|
||||
"n_jobs": 1,
|
||||
"estimator_list": [estimator],
|
||||
"eval_method": "cv",
|
||||
"n_splits": 3,
|
||||
"metric": "r2",
|
||||
"keep_search_state": True,
|
||||
"skip_transform": True,
|
||||
"retrain_full": True,
|
||||
}
|
||||
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
|
||||
automl.fit(X_train=X, y_train=y, **automl_settings)
|
||||
best_model = automl.model
|
||||
assert best_model is not None
|
||||
config = best_model.get_params()
|
||||
val_loss_flaml = automl.best_result["val_loss"]
|
||||
|
||||
# Take the best model, and see if we can reproduce the best result
|
||||
reproduced_val_loss, metric_for_logging, train_time, pred_time = automl._state.task.evaluate_model_CV(
|
||||
config=config,
|
||||
estimator=best_model,
|
||||
X_train_all=automl._state.X_train_all,
|
||||
y_train_all=automl._state.y_train_all,
|
||||
budget=None,
|
||||
kf=automl._state.kf,
|
||||
eval_metric="r2",
|
||||
best_val_loss=None,
|
||||
cv_score_agg_func=None,
|
||||
log_training_metric=False,
|
||||
fit_kwargs=None,
|
||||
free_mem_ratio=0,
|
||||
)
|
||||
assert pytest.approx(val_loss_flaml) == reproduced_val_loss
|
||||
|
||||
|
||||
def test_reproducibility_of_catboost_regression_model():
|
||||
"""FLAML finds the best model for a given dataset, which it then provides to users.
|
||||
|
||||
However, there are reported issues around the catboost model - see here:
|
||||
https://github.com/microsoft/FLAML/issues/1317
|
||||
In this test we take the best catboost regression model which FLAML provided us, and then retrain and test it on the
|
||||
same folds, to verify that the result is reproducible.
|
||||
"""
|
||||
automl = AutoML()
|
||||
automl_settings = {
|
||||
"time_budget": 7,
|
||||
"task": "regression",
|
||||
"n_jobs": 1,
|
||||
"estimator_list": ["catboost"],
|
||||
"eval_method": "cv",
|
||||
"n_splits": 10,
|
||||
"metric": "r2",
|
||||
"keep_search_state": True,
|
||||
"skip_transform": True,
|
||||
"retrain_full": True,
|
||||
}
|
||||
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
|
||||
automl.fit(X_train=X, y_train=y, **automl_settings)
|
||||
best_model = automl.model
|
||||
assert best_model is not None
|
||||
config = best_model.get_params()
|
||||
val_loss_flaml = automl.best_result["val_loss"]
|
||||
|
||||
# Take the best model, and see if we can reproduce the best result
|
||||
reproduced_val_loss, metric_for_logging, train_time, pred_time = automl._state.task.evaluate_model_CV(
|
||||
config=config,
|
||||
estimator=best_model,
|
||||
X_train_all=automl._state.X_train_all,
|
||||
y_train_all=automl._state.y_train_all,
|
||||
budget=None,
|
||||
kf=automl._state.kf,
|
||||
eval_metric="r2",
|
||||
best_val_loss=None,
|
||||
cv_score_agg_func=None,
|
||||
log_training_metric=False,
|
||||
fit_kwargs=None,
|
||||
free_mem_ratio=0,
|
||||
)
|
||||
assert pytest.approx(val_loss_flaml) == reproduced_val_loss
|
||||
|
||||
|
||||
def test_reproducibility_of_lgbm_regression_model():
|
||||
"""FLAML finds the best model for a given dataset, which it then provides to users.
|
||||
|
||||
However, there are reported issues around LGBMs - see here:
|
||||
https://github.com/microsoft/FLAML/issues/1368
|
||||
In this test we take the best LGBM regression model which FLAML provided us, and then retrain and test it on the
|
||||
same folds, to verify that the result is reproducible.
|
||||
"""
|
||||
automl = AutoML()
|
||||
automl_settings = {
|
||||
"time_budget": 3,
|
||||
"task": "regression",
|
||||
"n_jobs": 1,
|
||||
"estimator_list": ["lgbm"],
|
||||
"eval_method": "cv",
|
||||
"n_splits": 9,
|
||||
"metric": "r2",
|
||||
"keep_search_state": True,
|
||||
"skip_transform": True,
|
||||
"retrain_full": True,
|
||||
}
|
||||
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
|
||||
automl.fit(X_train=X, y_train=y, **automl_settings)
|
||||
best_model = automl.model
|
||||
assert best_model is not None
|
||||
config = best_model.get_params()
|
||||
val_loss_flaml = automl.best_result["val_loss"]
|
||||
|
||||
# Take the best model, and see if we can reproduce the best result
|
||||
reproduced_val_loss, metric_for_logging, train_time, pred_time = automl._state.task.evaluate_model_CV(
|
||||
config=config,
|
||||
estimator=best_model,
|
||||
X_train_all=automl._state.X_train_all,
|
||||
y_train_all=automl._state.y_train_all,
|
||||
budget=None,
|
||||
kf=automl._state.kf,
|
||||
eval_metric="r2",
|
||||
best_val_loss=None,
|
||||
cv_score_agg_func=None,
|
||||
log_training_metric=False,
|
||||
fit_kwargs=None,
|
||||
free_mem_ratio=0,
|
||||
)
|
||||
assert pytest.approx(val_loss_flaml) == reproduced_val_loss or val_loss_flaml > reproduced_val_loss
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"estimator",
|
||||
[
|
||||
"catboost",
|
||||
"enet",
|
||||
"extra_tree",
|
||||
"histgb",
|
||||
"kneighbor",
|
||||
"lgbm",
|
||||
"rf",
|
||||
"xgboost",
|
||||
"xgb_limitdepth",
|
||||
],
|
||||
)
|
||||
def test_reproducibility_of_underlying_regression_models(estimator: str):
|
||||
"""FLAML finds the best model for a given dataset, which it then provides to users.
|
||||
|
||||
However, there are reported issues where FLAML was providing an incorrect model - see here:
|
||||
https://github.com/microsoft/FLAML/issues/1317
|
||||
FLAML defines FLAMLised models, which wrap around the underlying (SKLearn/XGBoost/CatBoost) model.
|
||||
Ideally, FLAMLised models should perform identically to the underlying model, when fitted
|
||||
to the same data, with no budget. This verifies that this is the case for regression models.
|
||||
In this test we take the best model which FLAML provided us, extract the underlying model,
|
||||
before retraining and testing it on the same folds - to verify that the result is reproducible.
|
||||
"""
|
||||
automl = AutoML()
|
||||
automl_settings = {
|
||||
"max_iter": 5,
|
||||
"time_budget": -1,
|
||||
"task": "regression",
|
||||
"n_jobs": 1,
|
||||
"estimator_list": [estimator],
|
||||
"eval_method": "cv",
|
||||
"n_splits": 10,
|
||||
"metric": "r2",
|
||||
"keep_search_state": True,
|
||||
"skip_transform": True,
|
||||
"retrain_full": False,
|
||||
}
|
||||
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
|
||||
automl.fit(X_train=X, y_train=y, **automl_settings)
|
||||
best_model = automl.model
|
||||
assert best_model is not None
|
||||
val_loss_flaml = automl.best_result["val_loss"]
|
||||
reproduced_val_loss_underlying_model = np.mean(
|
||||
evaluate_cv_folds_with_underlying_model(
|
||||
automl._state.X_train_all, automl._state.y_train_all, automl._state.kf, best_model.model, "regression"
|
||||
)
|
||||
)
|
||||
assert pytest.approx(val_loss_flaml) == reproduced_val_loss_underlying_model
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
|
||||
@@ -195,7 +195,7 @@ class TestScore:
|
||||
automl_settings = {
|
||||
"time_budget": 2,
|
||||
"task": "rank",
|
||||
"log_file_name": "test/{}.log".format(dataset),
|
||||
"log_file_name": f"test/{dataset}.log",
|
||||
"model_history": True,
|
||||
"groups": np.array([0] * 200 + [1] * 200 + [2] * 100), # group labels
|
||||
"learner_selector": "roundrobin",
|
||||
|
||||
@@ -1,4 +1,6 @@
|
||||
from sklearn.datasets import fetch_openml
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sklearn.datasets import fetch_openml, load_iris
|
||||
from sklearn.metrics import accuracy_score
|
||||
from sklearn.model_selection import GroupKFold, KFold, train_test_split
|
||||
|
||||
@@ -16,7 +18,7 @@ def _test(split_type):
|
||||
"time_budget": 2,
|
||||
# "metric": 'accuracy',
|
||||
"task": "classification",
|
||||
"log_file_name": "test/{}.log".format(dataset),
|
||||
"log_file_name": f"test/{dataset}.log",
|
||||
"model_history": True,
|
||||
"log_training_metric": True,
|
||||
"split_type": split_type,
|
||||
@@ -48,7 +50,7 @@ def test_time():
|
||||
_test(split_type="time")
|
||||
|
||||
|
||||
def test_groups():
|
||||
def test_groups_for_classification_task():
|
||||
from sklearn.externals._arff import ArffException
|
||||
|
||||
try:
|
||||
@@ -58,17 +60,15 @@ def test_groups():
|
||||
|
||||
X, y = load_wine(return_X_y=True)
|
||||
|
||||
import numpy as np
|
||||
|
||||
automl = AutoML()
|
||||
automl_settings = {
|
||||
"time_budget": 2,
|
||||
"task": "classification",
|
||||
"log_file_name": "test/{}.log".format(dataset),
|
||||
"log_file_name": f"test/{dataset}.log",
|
||||
"model_history": True,
|
||||
"eval_method": "cv",
|
||||
"groups": np.random.randint(low=0, high=10, size=len(y)),
|
||||
"estimator_list": ["lgbm", "rf", "xgboost", "kneighbor"],
|
||||
"estimator_list": ["catboost", "lgbm", "rf", "xgboost", "kneighbor"],
|
||||
"learner_selector": "roundrobin",
|
||||
}
|
||||
automl.fit(X, y, **automl_settings)
|
||||
@@ -88,6 +88,72 @@ def test_groups():
|
||||
automl.fit(X, y, **automl_settings)
|
||||
|
||||
|
||||
def test_groups_for_regression_task():
|
||||
"""Append nonsensical groups to iris dataset and use it to test that GroupKFold works for regression tasks"""
|
||||
iris_dict_data = load_iris(as_frame=True) # numpy arrays
|
||||
iris_data = iris_dict_data["frame"] # pandas dataframe data + target
|
||||
|
||||
rng = np.random.default_rng(42)
|
||||
iris_data["cluster"] = rng.integers(
|
||||
low=0, high=5, size=iris_data.shape[0]
|
||||
) # np.random.randint(0, 5, iris_data.shape[0])
|
||||
|
||||
automl = AutoML()
|
||||
X = iris_data[["sepal length (cm)", "sepal width (cm)", "petal length (cm)"]].to_numpy()
|
||||
y = iris_data["petal width (cm)"]
|
||||
X_train, X_test, y_train, y_test, groups_train, groups_test = train_test_split(
|
||||
X, y, iris_data["cluster"], random_state=42
|
||||
)
|
||||
automl_settings = {
|
||||
"max_iter": 5,
|
||||
"time_budget": -1,
|
||||
"metric": "r2",
|
||||
"task": "regression",
|
||||
"estimator_list": ["lgbm", "rf", "xgboost", "kneighbor"],
|
||||
"eval_method": "cv",
|
||||
"split_type": "uniform",
|
||||
"groups": groups_train,
|
||||
}
|
||||
automl.fit(X_train, y_train, **automl_settings)
|
||||
|
||||
|
||||
def test_groups_with_sample_weights():
|
||||
"""Verifies that sample weights can be used with group splits i.e. that https://github.com/microsoft/FLAML/issues/1396 remains fixed"""
|
||||
iris_dict_data = load_iris(as_frame=True) # numpy arrays
|
||||
iris_data = iris_dict_data["frame"] # pandas dataframe data + target
|
||||
iris_data["cluster"] = np.random.randint(0, 5, iris_data.shape[0])
|
||||
automl = AutoML()
|
||||
|
||||
X = iris_data[["sepal length (cm)", "sepal width (cm)", "petal length (cm)"]].to_numpy()
|
||||
y = iris_data["petal width (cm)"]
|
||||
sample_weight = pd.Series(np.random.rand(X.shape[0]))
|
||||
(
|
||||
X_train,
|
||||
X_test,
|
||||
y_train,
|
||||
y_test,
|
||||
groups_train,
|
||||
groups_test,
|
||||
sample_weight_train,
|
||||
sample_weight_test,
|
||||
) = train_test_split(X, y, iris_data["cluster"], sample_weight, random_state=42)
|
||||
automl_settings = {
|
||||
"max_iter": 5,
|
||||
"time_budget": -1,
|
||||
"metric": "r2",
|
||||
"task": "regression",
|
||||
"log_file_name": "error.log",
|
||||
"log_type": "all",
|
||||
"estimator_list": ["lgbm"],
|
||||
"eval_method": "cv",
|
||||
"split_type": "group",
|
||||
"groups": groups_train,
|
||||
"sample_weight": sample_weight_train,
|
||||
}
|
||||
automl.fit(X_train, y_train, **automl_settings)
|
||||
assert automl.model is not None
|
||||
|
||||
|
||||
def test_stratified_groupkfold():
|
||||
from minio.error import ServerError
|
||||
from sklearn.model_selection import StratifiedGroupKFold
|
||||
@@ -108,6 +174,7 @@ def test_stratified_groupkfold():
|
||||
"split_type": splitter,
|
||||
"groups": X_train["Airline"],
|
||||
"estimator_list": [
|
||||
"catboost",
|
||||
"lgbm",
|
||||
"rf",
|
||||
"xgboost",
|
||||
@@ -136,7 +203,7 @@ def test_rank():
|
||||
automl_settings = {
|
||||
"time_budget": 2,
|
||||
"task": "rank",
|
||||
"log_file_name": "test/{}.log".format(dataset),
|
||||
"log_file_name": f"test/{dataset}.log",
|
||||
"model_history": True,
|
||||
"eval_method": "cv",
|
||||
"groups": np.array([0] * 200 + [1] * 200 + [2] * 200 + [3] * 200 + [4] * 100 + [5] * 100), # group labels
|
||||
@@ -149,7 +216,7 @@ def test_rank():
|
||||
"time_budget": 2,
|
||||
"task": "rank",
|
||||
"metric": "ndcg@5", # 5 can be replaced by any number
|
||||
"log_file_name": "test/{}.log".format(dataset),
|
||||
"log_file_name": f"test/{dataset}.log",
|
||||
"model_history": True,
|
||||
"groups": [200] * 4 + [100] * 2, # alternative way: group counts
|
||||
# "estimator_list": ['lgbm', 'xgboost'], # list of ML learners
|
||||
@@ -188,7 +255,7 @@ def test_object():
|
||||
automl_settings = {
|
||||
"time_budget": 2,
|
||||
"task": "classification",
|
||||
"log_file_name": "test/{}.log".format(dataset),
|
||||
"log_file_name": f"test/{dataset}.log",
|
||||
"model_history": True,
|
||||
"log_training_metric": True,
|
||||
"split_type": TestKFold(5),
|
||||
@@ -203,4 +270,4 @@ def test_object():
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_groups()
|
||||
test_groups_for_classification_task()
|
||||
|
||||
@@ -29,8 +29,8 @@ class TestWarmStart(unittest.TestCase):
|
||||
automl_val_accuracy = 1.0 - automl.best_loss
|
||||
print("Best ML leaner:", automl.best_estimator)
|
||||
print("Best hyperparmeter config:", automl.best_config)
|
||||
print("Best accuracy on validation data: {0:.4g}".format(automl_val_accuracy))
|
||||
print("Training duration of best run: {0:.4g} s".format(automl.best_config_train_time))
|
||||
print(f"Best accuracy on validation data: {automl_val_accuracy:.4g}")
|
||||
print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")
|
||||
# 1. Get starting points from previous experiments.
|
||||
starting_points = automl.best_config_per_estimator
|
||||
print("starting_points", starting_points)
|
||||
@@ -97,8 +97,8 @@ class TestWarmStart(unittest.TestCase):
|
||||
new_automl_val_accuracy = 1.0 - new_automl.best_loss
|
||||
print("Best ML leaner:", new_automl.best_estimator)
|
||||
print("Best hyperparmeter config:", new_automl.best_config)
|
||||
print("Best accuracy on validation data: {0:.4g}".format(new_automl_val_accuracy))
|
||||
print("Training duration of best run: {0:.4g} s".format(new_automl.best_config_train_time))
|
||||
print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
|
||||
print(f"Training duration of best run: {new_automl.best_config_train_time:.4g} s")
|
||||
|
||||
def test_nobudget(self):
|
||||
automl = AutoML()
|
||||
|
||||
42
test/conftest.py
Normal file
42
test/conftest.py
Normal file
@@ -0,0 +1,42 @@
|
||||
from typing import Any, Dict, List, Union
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from catboost import CatBoostClassifier, CatBoostRegressor, Pool
|
||||
from sklearn.metrics import f1_score, r2_score
|
||||
|
||||
|
||||
def evaluate_cv_folds_with_underlying_model(X_train_all, y_train_all, kf, model: Any, task: str) -> pd.DataFrame:
|
||||
"""Mimic the FLAML CV process to calculate the metrics across each fold.
|
||||
|
||||
:param X_train_all: X training data
|
||||
:param y_train_all: y training data
|
||||
:param kf: The splitter object to use to generate the folds
|
||||
:param model: The estimator to fit to the data during the CV process
|
||||
:param task: classification or regression
|
||||
:return: An array containing the metrics
|
||||
"""
|
||||
rng = np.random.RandomState(2020)
|
||||
all_fold_metrics: List[Dict[str, Union[int, float]]] = []
|
||||
for train_index, val_index in kf.split(X_train_all, y_train_all):
|
||||
X_train_split, y_train_split = X_train_all, y_train_all
|
||||
train_index = rng.permutation(train_index)
|
||||
X_train = X_train_split.iloc[train_index]
|
||||
X_val = X_train_split.iloc[val_index]
|
||||
y_train, y_val = y_train_split[train_index], y_train_split[val_index]
|
||||
model_type = type(model)
|
||||
if model_type is not CatBoostClassifier and model_type is not CatBoostRegressor:
|
||||
model.fit(X_train, y_train)
|
||||
else:
|
||||
use_best_model = True
|
||||
n = max(int(len(y_train) * 0.9), len(y_train) - 1000) if use_best_model else len(y_train)
|
||||
X_tr, y_tr = (X_train)[:n], y_train[:n]
|
||||
eval_set = Pool(data=X_train[n:], label=y_train[n:], cat_features=[]) if use_best_model else None
|
||||
model.fit(X_tr, y_tr, eval_set=eval_set, use_best_model=True)
|
||||
y_pred_classes = model.predict(X_val)
|
||||
if task == "classification":
|
||||
reproduced_metric = 1 - f1_score(y_val, y_pred_classes)
|
||||
else:
|
||||
reproduced_metric = 1 - r2_score(y_val, y_pred_classes)
|
||||
all_fold_metrics.append(reproduced_metric)
|
||||
return all_fold_metrics
|
||||
@@ -30,7 +30,7 @@ def test_hf_data():
|
||||
|
||||
import json
|
||||
|
||||
with open("seqclass.log", "r") as fin:
|
||||
with open("seqclass.log") as fin:
|
||||
for line in fin:
|
||||
each_log = json.loads(line.strip("\n"))
|
||||
if "validation_loss" in each_log:
|
||||
|
||||
@@ -24,6 +24,8 @@ model_path_list = [
|
||||
if sys.platform.startswith("darwin") and sys.version_info[0] == 3 and sys.version_info[1] == 11:
|
||||
pytest.skip("skipping Python 3.11 on MacOS", allow_module_level=True)
|
||||
|
||||
pytestmark = pytest.mark.spark # set to spark as parallel testing raised RuntimeError
|
||||
|
||||
|
||||
def test_switch_1_1():
|
||||
data_idx, model_path_idx = 0, 0
|
||||
|
||||
@@ -5,6 +5,8 @@ import sys
|
||||
import pytest
|
||||
from utils import get_automl_settings, get_toy_data_seqclassification
|
||||
|
||||
pytestmark = pytest.mark.spark # set to spark as parallel testing raised MlflowException of changing parameter
|
||||
|
||||
|
||||
@pytest.mark.skipif(sys.platform in ["darwin", "win32"], reason="do not run on mac os or windows")
|
||||
def test_cv():
|
||||
|
||||
@@ -44,7 +44,7 @@ def test_tokenclassification_idlabel():
|
||||
# perf test
|
||||
import json
|
||||
|
||||
with open("seqclass.log", "r") as fin:
|
||||
with open("seqclass.log") as fin:
|
||||
for line in fin:
|
||||
each_log = json.loads(line.strip("\n"))
|
||||
if "validation_loss" in each_log:
|
||||
@@ -86,7 +86,7 @@ def test_tokenclassification_tokenlabel():
|
||||
# perf test
|
||||
import json
|
||||
|
||||
with open("seqclass.log", "r") as fin:
|
||||
with open("seqclass.log") as fin:
|
||||
for line in fin:
|
||||
each_log = json.loads(line.strip("\n"))
|
||||
if "validation_loss" in each_log:
|
||||
|
||||
@@ -10,6 +10,10 @@ from flaml.default import portfolio
|
||||
if sys.platform.startswith("darwin") and sys.version_info[0] == 3 and sys.version_info[1] == 11:
|
||||
pytest.skip("skipping Python 3.11 on MacOS", allow_module_level=True)
|
||||
|
||||
pytestmark = (
|
||||
pytest.mark.spark
|
||||
) # set to spark as parallel testing raised ValueError: Feature NonExisting not implemented.
|
||||
|
||||
|
||||
def pop_args(fit_kwargs):
|
||||
fit_kwargs.pop("max_iter", None)
|
||||
|
||||
@@ -25,7 +25,7 @@ logger = logging.getLogger("mnist_AutoML")
|
||||
|
||||
class Net(nn.Module):
|
||||
def __init__(self, hidden_size):
|
||||
super(Net, self).__init__()
|
||||
super().__init__()
|
||||
self.conv1 = nn.Conv2d(1, 20, 5, 1)
|
||||
self.conv2 = nn.Conv2d(20, 50, 5, 1)
|
||||
self.fc1 = nn.Linear(4 * 4 * 50, hidden_size)
|
||||
|
||||
@@ -3,10 +3,13 @@ import sys
|
||||
import warnings
|
||||
|
||||
import mlflow
|
||||
import numpy as np
|
||||
import pytest
|
||||
import sklearn.datasets as skds
|
||||
from packaging.version import Version
|
||||
|
||||
from flaml import AutoML
|
||||
from flaml.automl.data import auto_convert_dtypes_pandas, auto_convert_dtypes_spark, get_random_dataframe
|
||||
from flaml.tune.spark.utils import check_spark
|
||||
|
||||
warnings.simplefilter(action="ignore")
|
||||
@@ -20,23 +23,26 @@ else:
|
||||
|
||||
from flaml.automl.spark.utils import to_pandas_on_spark
|
||||
|
||||
postfix_version = "-spark3.3," if pyspark.__version__ > "3.2" else ","
|
||||
spark = (
|
||||
pyspark.sql.SparkSession.builder.appName("MyApp")
|
||||
.master("local[2]")
|
||||
.config(
|
||||
"spark.jars.packages",
|
||||
(
|
||||
f"com.microsoft.azure:synapseml_2.12:0.11.3{postfix_version}"
|
||||
"com.microsoft.azure:synapseml_2.12:1.0.4,"
|
||||
"org.apache.hadoop:hadoop-azure:3.3.5,"
|
||||
"com.microsoft.azure:azure-storage:8.6.6,"
|
||||
f"org.mlflow:mlflow-spark:2.6.0"
|
||||
f"org.mlflow:mlflow-spark_2.12:{mlflow.__version__}"
|
||||
if Version(mlflow.__version__) >= Version("2.9.0")
|
||||
else f"org.mlflow:mlflow-spark:{mlflow.__version__}"
|
||||
),
|
||||
)
|
||||
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
|
||||
.config("spark.sql.debug.maxToStringFields", "100")
|
||||
.config("spark.driver.extraJavaOptions", "-Xss1m")
|
||||
.config("spark.executor.extraJavaOptions", "-Xss1m")
|
||||
# .config("spark.executor.memory", "48G")
|
||||
# .config("spark.driver.memory", "48G")
|
||||
.getOrCreate()
|
||||
)
|
||||
spark.sparkContext._conf.set(
|
||||
@@ -49,8 +55,12 @@ else:
|
||||
except ImportError:
|
||||
skip_spark = True
|
||||
|
||||
if sys.version_info >= (3, 11):
|
||||
skip_py311 = True
|
||||
else:
|
||||
skip_py311 = False
|
||||
|
||||
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
|
||||
|
||||
|
||||
def _test_spark_synapseml_lightgbm(spark=None, task="classification"):
|
||||
@@ -159,10 +169,11 @@ def test_spark_input_df():
|
||||
settings = {
|
||||
"time_budget": 30, # total running time in seconds
|
||||
"metric": "roc_auc",
|
||||
"estimator_list": ["lgbm_spark"], # list of ML learners; we tune lightgbm in this example
|
||||
# "estimator_list": ["lgbm_spark"], # list of ML learners; we tune lightgbm in this example
|
||||
"task": "classification", # task type
|
||||
"log_file_name": "flaml_experiment.log", # flaml log file
|
||||
"seed": 7654321, # random seed
|
||||
"eval_method": "holdout",
|
||||
}
|
||||
df = to_pandas_on_spark(to_pandas_on_spark(train_data).to_spark(index_col="index"))
|
||||
|
||||
@@ -176,17 +187,17 @@ def test_spark_input_df():
|
||||
try:
|
||||
model = automl.model.estimator
|
||||
predictions = model.transform(test_data)
|
||||
predictions.show()
|
||||
|
||||
# from synapse.ml.train import ComputeModelStatistics
|
||||
|
||||
# metrics = ComputeModelStatistics(
|
||||
# evaluationMetric="classification",
|
||||
# labelCol="Bankrupt?",
|
||||
# scoredLabelsCol="prediction",
|
||||
# ).transform(predictions)
|
||||
# metrics.show()
|
||||
from synapse.ml.train import ComputeModelStatistics
|
||||
|
||||
if not skip_py311:
|
||||
# ComputeModelStatistics doesn't support python 3.11
|
||||
metrics = ComputeModelStatistics(
|
||||
evaluationMetric="classification",
|
||||
labelCol="Bankrupt?",
|
||||
scoredLabelsCol="prediction",
|
||||
).transform(predictions)
|
||||
metrics.show()
|
||||
except AttributeError:
|
||||
print("No fitted model because of too short training time.")
|
||||
|
||||
@@ -207,16 +218,173 @@ def test_spark_input_df():
|
||||
assert "No estimator is left." in str(excinfo.value)
|
||||
|
||||
|
||||
def _test_spark_large_df():
|
||||
"""Test with large dataframe, should not run in pipeline."""
|
||||
import os
|
||||
import time
|
||||
|
||||
import pandas as pd
|
||||
from pyspark.sql import functions as F
|
||||
|
||||
import flaml
|
||||
|
||||
os.environ["FLAML_MAX_CONCURRENT"] = "8"
|
||||
start_time = time.time()
|
||||
|
||||
def load_higgs():
|
||||
# 11M rows, 29 columns, 1.1GB
|
||||
df = (
|
||||
spark.read.format("csv")
|
||||
.option("header", False)
|
||||
.option("inferSchema", True)
|
||||
.load("/datadrive/datasets/HIGGS.csv")
|
||||
.withColumnRenamed("_c0", "target")
|
||||
.withColumn("target", F.col("target").cast("integer"))
|
||||
.limit(1000000)
|
||||
.fillna(0)
|
||||
.na.drop(how="any")
|
||||
.repartition(64)
|
||||
.cache()
|
||||
)
|
||||
print("Number of rows in data: ", df.count())
|
||||
return df
|
||||
|
||||
def load_bosch():
|
||||
# 1.184M rows, 969 cols, 1.5GB
|
||||
df = (
|
||||
spark.read.format("csv")
|
||||
.option("header", True)
|
||||
.option("inferSchema", True)
|
||||
.load("/datadrive/datasets/train_numeric.csv")
|
||||
.withColumnRenamed("Response", "target")
|
||||
.withColumn("target", F.col("target").cast("integer"))
|
||||
.limit(1000000)
|
||||
.fillna(0)
|
||||
.drop("Id")
|
||||
.repartition(64)
|
||||
.cache()
|
||||
)
|
||||
print("Number of rows in data: ", df.count())
|
||||
return df
|
||||
|
||||
def prepare_data(dataset_name="higgs"):
|
||||
df = load_higgs() if dataset_name == "higgs" else load_bosch()
|
||||
train, test = df.randomSplit([0.75, 0.25], seed=7654321)
|
||||
feature_cols = [col for col in df.columns if col not in ["target", "arrest"]]
|
||||
final_cols = ["target", "features"]
|
||||
featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
|
||||
train_data = featurizer.transform(train)[final_cols]
|
||||
test_data = featurizer.transform(test)[final_cols]
|
||||
train_data = to_pandas_on_spark(to_pandas_on_spark(train_data).to_spark(index_col="index"))
|
||||
return train_data, test_data
|
||||
|
||||
train_data, test_data = prepare_data("higgs")
|
||||
end_time = time.time()
|
||||
print("time cost in minutes for prepare data: ", (end_time - start_time) / 60)
|
||||
automl = flaml.AutoML()
|
||||
automl_settings = {
|
||||
"max_iter": 3,
|
||||
"time_budget": 7200,
|
||||
"metric": "accuracy",
|
||||
"task": "classification",
|
||||
"seed": 1234,
|
||||
"eval_method": "holdout",
|
||||
}
|
||||
automl.fit(dataframe=train_data, label="target", ensemble=False, **automl_settings)
|
||||
model = automl.model.estimator
|
||||
predictions = model.transform(test_data)
|
||||
predictions.show(5)
|
||||
end_time = time.time()
|
||||
print("time cost in minutes: ", (end_time - start_time) / 60)
|
||||
|
||||
|
||||
def test_get_random_dataframe():
|
||||
# Test with default parameters
|
||||
df = get_random_dataframe(n_rows=50, ratio_none=0.2, seed=123)
|
||||
assert df.shape == (50, 14) # Default is 200 rows and 14 columns
|
||||
|
||||
# Test column types
|
||||
assert "timestamp" in df.columns and np.issubdtype(df["timestamp"].dtype, np.datetime64)
|
||||
assert "id" in df.columns and np.issubdtype(df["id"].dtype, np.integer)
|
||||
assert "score" in df.columns and np.issubdtype(df["score"].dtype, np.floating)
|
||||
assert "category" in df.columns and df["category"].dtype.name == "category"
|
||||
|
||||
|
||||
def test_auto_convert_dtypes_pandas():
|
||||
# Create a test DataFrame with various types
|
||||
import pandas as pd
|
||||
|
||||
test_df = pd.DataFrame(
|
||||
{
|
||||
"int_col": ["1", "2", "3", "4", "5", "6", "6"],
|
||||
"float_col": ["1.1", "2.2", "3.3", "NULL", "5.5", "6.6", "6.6"],
|
||||
"date_col": ["2021-01-01", "2021-02-01", "NA", "2021-04-01", "2021-05-01", "2021-06-01", "2021-06-01"],
|
||||
"cat_col": ["A", "B", "A", "A", "B", "A", "B"],
|
||||
"string_col": ["text1", "text2", "text3", "text4", "text5", "text6", "text7"],
|
||||
}
|
||||
)
|
||||
|
||||
# Convert dtypes
|
||||
converted_df, schema = auto_convert_dtypes_pandas(test_df)
|
||||
|
||||
# Check conversions
|
||||
assert schema["int_col"] == "int"
|
||||
assert schema["float_col"] == "double"
|
||||
assert schema["date_col"] == "timestamp"
|
||||
assert schema["cat_col"] == "category"
|
||||
assert schema["string_col"] == "string"
|
||||
|
||||
|
||||
def test_auto_convert_dtypes_spark():
|
||||
"""Test auto_convert_dtypes_spark function with various data types."""
|
||||
import pandas as pd
|
||||
|
||||
# Create a test DataFrame with various types
|
||||
test_pdf = pd.DataFrame(
|
||||
{
|
||||
"int_col": ["1", "2", "3", "4", "NA"],
|
||||
"float_col": ["1.1", "2.2", "3.3", "NULL", "5.5"],
|
||||
"date_col": ["2021-01-01", "2021-02-01", "NA", "2021-04-01", "2021-05-01"],
|
||||
"cat_col": ["A", "B", "A", "C", "B"],
|
||||
"string_col": ["text1", "text2", "text3", "text4", "text5"],
|
||||
}
|
||||
)
|
||||
|
||||
# Convert pandas DataFrame to Spark DataFrame
|
||||
test_df = spark.createDataFrame(test_pdf)
|
||||
|
||||
# Convert dtypes
|
||||
converted_df, schema = auto_convert_dtypes_spark(test_df)
|
||||
|
||||
# Check conversions
|
||||
assert schema["int_col"] == "int"
|
||||
assert schema["float_col"] == "double"
|
||||
assert schema["date_col"] == "timestamp"
|
||||
assert schema["cat_col"] == "string" # Conceptual category in schema
|
||||
assert schema["string_col"] == "string"
|
||||
|
||||
# Verify the actual data types from the Spark DataFrame
|
||||
spark_dtypes = dict(converted_df.dtypes)
|
||||
assert spark_dtypes["int_col"] == "int"
|
||||
assert spark_dtypes["float_col"] == "double"
|
||||
assert spark_dtypes["date_col"] == "timestamp"
|
||||
assert spark_dtypes["cat_col"] == "string" # In Spark, categories are still strings
|
||||
assert spark_dtypes["string_col"] == "string"
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_spark_synapseml_classification()
|
||||
test_spark_synapseml_regression()
|
||||
test_spark_synapseml_rank()
|
||||
test_spark_input_df()
|
||||
test_get_random_dataframe()
|
||||
test_auto_convert_dtypes_pandas()
|
||||
test_auto_convert_dtypes_spark()
|
||||
|
||||
# import cProfile
|
||||
# import pstats
|
||||
# from pstats import SortKey
|
||||
|
||||
# cProfile.run("test_spark_input_df()", "test_spark_input_df.profile")
|
||||
# p = pstats.Stats("test_spark_input_df.profile")
|
||||
# p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats("utils.py")
|
||||
# cProfile.run("_test_spark_large_df()", "_test_spark_large_df.profile")
|
||||
# p = pstats.Stats("_test_spark_large_df.profile")
|
||||
# p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(50)
|
||||
|
||||
@@ -25,7 +25,7 @@ os.environ["FLAML_MAX_CONCURRENT"] = "2"
|
||||
spark_available, _ = check_spark()
|
||||
skip_spark = not spark_available
|
||||
|
||||
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
|
||||
|
||||
|
||||
def test_parallel_xgboost(hpo_method=None, data_size=1000):
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
import os
|
||||
import unittest
|
||||
|
||||
import pytest
|
||||
from sklearn.datasets import load_wine
|
||||
|
||||
from flaml import AutoML
|
||||
@@ -24,6 +25,8 @@ if os.path.exists(os.path.join(os.getcwd(), "test", "spark", "custom_mylearner.p
|
||||
else:
|
||||
skip_my_learner = True
|
||||
|
||||
pytestmark = pytest.mark.spark
|
||||
|
||||
|
||||
class TestEnsemble(unittest.TestCase):
|
||||
def setUp(self) -> None:
|
||||
|
||||
@@ -9,7 +9,7 @@ from flaml.tune.spark.utils import check_spark
|
||||
spark_available, _ = check_spark()
|
||||
skip_spark = not spark_available
|
||||
|
||||
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
|
||||
|
||||
os.environ["FLAML_MAX_CONCURRENT"] = "2"
|
||||
|
||||
@@ -41,8 +41,8 @@ def base_automl(n_concurrent_trials=1, use_ray=False, use_spark=False, verbose=0
|
||||
|
||||
print("Best ML leaner:", automl.best_estimator)
|
||||
print("Best hyperparmeter config:", automl.best_config)
|
||||
print("Best accuracy on validation data: {0:.4g}".format(1 - automl.best_loss))
|
||||
print("Training duration of best run: {0:.4g} s".format(automl.best_config_train_time))
|
||||
print(f"Best accuracy on validation data: {1 - automl.best_loss:.4g}")
|
||||
print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")
|
||||
|
||||
|
||||
def test_both_ray_spark():
|
||||
|
||||
343
test/spark/test_mlflow.py
Normal file
343
test/spark/test_mlflow.py
Normal file
@@ -0,0 +1,343 @@
|
||||
import importlib
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
import warnings
|
||||
|
||||
import mlflow
|
||||
import pytest
|
||||
from packaging.version import Version
|
||||
from sklearn.datasets import fetch_california_housing, load_diabetes
|
||||
from sklearn.ensemble import RandomForestRegressor
|
||||
from sklearn.metrics import r2_score
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
import flaml
|
||||
from flaml.automl.spark.utils import to_pandas_on_spark
|
||||
|
||||
try:
|
||||
import pyspark
|
||||
from pyspark.ml.evaluation import RegressionEvaluator
|
||||
from pyspark.ml.feature import VectorAssembler
|
||||
except ImportError:
|
||||
pass
|
||||
pytestmark = pytest.mark.spark
|
||||
warnings.filterwarnings("ignore")
|
||||
|
||||
skip_spark = importlib.util.find_spec("pyspark") is None
|
||||
client = mlflow.tracking.MlflowClient()
|
||||
|
||||
if (sys.platform.startswith("darwin") or sys.platform.startswith("nt")) and (
|
||||
sys.version_info[0] == 3 and sys.version_info[1] >= 10
|
||||
):
|
||||
# TODO: remove this block when tests are stable
|
||||
# Below tests will fail, but the functions run without error if run individually.
|
||||
# test_tune_autolog_parentrun_nonparallel()
|
||||
# test_tune_autolog_noparentrun_nonparallel()
|
||||
# test_tune_noautolog_parentrun_nonparallel()
|
||||
# test_tune_noautolog_noparentrun_nonparallel()
|
||||
pytest.skip("skipping MacOS and Windows for python 3.10 and 3.11", allow_module_level=True)
|
||||
|
||||
"""
|
||||
The spark used in below tests should be initiated in test_0sparkml.py when run with pytest.
|
||||
"""
|
||||
|
||||
|
||||
def _sklearn_tune(config):
|
||||
is_autolog = config.pop("is_autolog")
|
||||
is_parent_run = config.pop("is_parent_run")
|
||||
is_parallel = config.pop("is_parallel")
|
||||
X, y = load_diabetes(return_X_y=True, as_frame=True)
|
||||
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.25)
|
||||
rf = RandomForestRegressor(**config)
|
||||
rf.fit(train_x, train_y)
|
||||
pred = rf.predict(test_x)
|
||||
r2 = r2_score(test_y, pred)
|
||||
if not is_autolog and not is_parent_run and not is_parallel:
|
||||
with mlflow.start_run(nested=True):
|
||||
mlflow.log_metric("r2", r2)
|
||||
return {"r2": r2}
|
||||
|
||||
|
||||
def _test_tune(is_autolog, is_parent_run, is_parallel):
|
||||
mlflow.end_run()
|
||||
mlflow_exp_name = f"test_mlflow_integration_{int(time.time())}"
|
||||
mlflow_experiment = mlflow.set_experiment(mlflow_exp_name)
|
||||
params = {
|
||||
"n_estimators": flaml.tune.randint(100, 1000),
|
||||
"min_samples_leaf": flaml.tune.randint(1, 10),
|
||||
"is_autolog": is_autolog,
|
||||
"is_parent_run": is_parent_run,
|
||||
"is_parallel": is_parallel,
|
||||
}
|
||||
if is_autolog:
|
||||
mlflow.autolog()
|
||||
else:
|
||||
mlflow.autolog(disable=True)
|
||||
if is_parent_run:
|
||||
mlflow.start_run(run_name=f"tune_autolog_{is_autolog}_sparktrial_{is_parallel}")
|
||||
flaml.tune.run(
|
||||
_sklearn_tune,
|
||||
params,
|
||||
metric="r2",
|
||||
mode="max",
|
||||
num_samples=3,
|
||||
use_spark=True if is_parallel else False,
|
||||
n_concurrent_trials=2 if is_parallel else 1,
|
||||
mlflow_exp_name=mlflow_exp_name,
|
||||
)
|
||||
mlflow.end_run() # end current run
|
||||
mlflow.autolog(disable=True)
|
||||
return mlflow_experiment.experiment_id
|
||||
|
||||
|
||||
def _check_mlflow_logging(possible_num_runs, metric, is_parent_run, experiment_id, is_automl=False, skip_tags=False):
|
||||
if isinstance(possible_num_runs, int):
|
||||
possible_num_runs = [possible_num_runs]
|
||||
if is_parent_run:
|
||||
parent_run = mlflow.last_active_run()
|
||||
child_runs = client.search_runs(
|
||||
experiment_ids=[experiment_id],
|
||||
filter_string=f"tags.mlflow.parentRunId = '{parent_run.info.run_id}'",
|
||||
)
|
||||
else:
|
||||
child_runs = client.search_runs(experiment_ids=[experiment_id])
|
||||
experiment_name = client.get_experiment(experiment_id).name
|
||||
metrics = [metric in run.data.metrics for run in child_runs]
|
||||
tags = ["flaml.version" in run.data.tags for run in child_runs]
|
||||
params = ["learner" in run.data.params for run in child_runs]
|
||||
assert (
|
||||
len(child_runs) in possible_num_runs
|
||||
), f"The number of child runs is not correct on experiment {experiment_name}."
|
||||
if possible_num_runs[0] > 0:
|
||||
assert all(metrics), f"The metrics are not logged correctly on experiment {experiment_name}."
|
||||
assert (
|
||||
all(tags) if not skip_tags else True
|
||||
), f"The tags are not logged correctly on experiment {experiment_name}."
|
||||
assert (
|
||||
all(params) if is_automl else True
|
||||
), f"The params are not logged correctly on experiment {experiment_name}."
|
||||
# mlflow.delete_experiment(experiment_id)
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_tune_autolog_parentrun_parallel():
|
||||
experiment_id = _test_tune(is_autolog=True, is_parent_run=True, is_parallel=True)
|
||||
_check_mlflow_logging([4, 3], "r2", True, experiment_id)
|
||||
|
||||
|
||||
def test_tune_autolog_parentrun_nonparallel():
|
||||
experiment_id = _test_tune(is_autolog=True, is_parent_run=True, is_parallel=False)
|
||||
_check_mlflow_logging(3, "r2", True, experiment_id)
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_tune_autolog_noparentrun_parallel():
|
||||
experiment_id = _test_tune(is_autolog=True, is_parent_run=False, is_parallel=True)
|
||||
_check_mlflow_logging([4, 3], "r2", False, experiment_id)
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_tune_noautolog_parentrun_parallel():
|
||||
experiment_id = _test_tune(is_autolog=False, is_parent_run=True, is_parallel=True)
|
||||
_check_mlflow_logging([4, 3], "r2", True, experiment_id)
|
||||
|
||||
|
||||
def test_tune_autolog_noparentrun_nonparallel():
|
||||
experiment_id = _test_tune(is_autolog=True, is_parent_run=False, is_parallel=False)
|
||||
_check_mlflow_logging(3, "r2", False, experiment_id)
|
||||
|
||||
|
||||
def test_tune_noautolog_parentrun_nonparallel():
|
||||
experiment_id = _test_tune(is_autolog=False, is_parent_run=True, is_parallel=False)
|
||||
_check_mlflow_logging(3, "r2", True, experiment_id)
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_tune_noautolog_noparentrun_parallel():
|
||||
experiment_id = _test_tune(is_autolog=False, is_parent_run=False, is_parallel=True)
|
||||
_check_mlflow_logging(0, "r2", False, experiment_id)
|
||||
|
||||
|
||||
def test_tune_noautolog_noparentrun_nonparallel():
|
||||
experiment_id = _test_tune(is_autolog=False, is_parent_run=False, is_parallel=False)
|
||||
_check_mlflow_logging(3, "r2", False, experiment_id, skip_tags=True)
|
||||
|
||||
|
||||
def _test_automl_sparkdata(is_autolog, is_parent_run):
|
||||
mlflow.end_run()
|
||||
mlflow_exp_name = f"test_mlflow_integration_{int(time.time())}"
|
||||
mlflow_experiment = mlflow.set_experiment(mlflow_exp_name)
|
||||
if is_autolog:
|
||||
mlflow.autolog()
|
||||
else:
|
||||
mlflow.autolog(disable=True)
|
||||
if is_parent_run:
|
||||
mlflow.start_run(run_name=f"automl_sparkdata_autolog_{is_autolog}")
|
||||
spark = pyspark.sql.SparkSession.builder.getOrCreate()
|
||||
pd_df = load_diabetes(as_frame=True).frame
|
||||
df = spark.createDataFrame(pd_df)
|
||||
df = df.repartition(4).cache()
|
||||
train, test = df.randomSplit([0.8, 0.2], seed=1)
|
||||
feature_cols = df.columns[:-1]
|
||||
featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
|
||||
train_data = featurizer.transform(train)["target", "features"]
|
||||
featurizer.transform(test)["target", "features"]
|
||||
automl = flaml.AutoML()
|
||||
settings = {
|
||||
"max_iter": 3,
|
||||
"metric": "mse",
|
||||
"task": "regression", # task type
|
||||
"log_file_name": "flaml_experiment.log", # flaml log file
|
||||
"mlflow_exp_name": mlflow_exp_name,
|
||||
"log_type": "all",
|
||||
"n_splits": 2,
|
||||
"model_history": True,
|
||||
}
|
||||
df = to_pandas_on_spark(to_pandas_on_spark(train_data).to_spark(index_col="index"))
|
||||
automl.fit(
|
||||
dataframe=df,
|
||||
label="target",
|
||||
**settings,
|
||||
)
|
||||
mlflow.end_run() # end current run
|
||||
mlflow.autolog(disable=True)
|
||||
return mlflow_experiment.experiment_id
|
||||
|
||||
|
||||
def _test_automl_nonsparkdata(is_autolog, is_parent_run):
|
||||
mlflow_exp_name = f"test_mlflow_integration_{int(time.time())}"
|
||||
mlflow_experiment = mlflow.set_experiment(mlflow_exp_name)
|
||||
if is_autolog:
|
||||
mlflow.autolog()
|
||||
else:
|
||||
mlflow.autolog(disable=True)
|
||||
if is_parent_run:
|
||||
mlflow.start_run(run_name=f"automl_nonsparkdata_autolog_{is_autolog}")
|
||||
automl_experiment = flaml.AutoML()
|
||||
automl_settings = {
|
||||
"max_iter": 3,
|
||||
"metric": "r2",
|
||||
"task": "regression",
|
||||
"n_concurrent_trials": 2,
|
||||
"use_spark": True,
|
||||
"mlflow_exp_name": None if is_parent_run else mlflow_exp_name,
|
||||
"log_type": "all",
|
||||
"n_splits": 2,
|
||||
"model_history": True,
|
||||
}
|
||||
X, y = load_diabetes(return_X_y=True, as_frame=True)
|
||||
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.25)
|
||||
automl_experiment.fit(X_train=train_x, y_train=train_y, **automl_settings)
|
||||
mlflow.end_run() # end current run
|
||||
mlflow.autolog(disable=True)
|
||||
return mlflow_experiment.experiment_id
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_automl_sparkdata_autolog_parentrun():
|
||||
experiment_id = _test_automl_sparkdata(is_autolog=True, is_parent_run=True)
|
||||
_check_mlflow_logging(3, "mse", True, experiment_id, is_automl=True)
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_automl_sparkdata_autolog_noparentrun():
|
||||
experiment_id = _test_automl_sparkdata(is_autolog=True, is_parent_run=False)
|
||||
_check_mlflow_logging(3, "mse", False, experiment_id, is_automl=True)
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_automl_sparkdata_noautolog_parentrun():
|
||||
experiment_id = _test_automl_sparkdata(is_autolog=False, is_parent_run=True)
|
||||
_check_mlflow_logging(3, "mse", True, experiment_id, is_automl=True)
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_automl_sparkdata_noautolog_noparentrun():
|
||||
experiment_id = _test_automl_sparkdata(is_autolog=False, is_parent_run=False)
|
||||
_check_mlflow_logging(0, "mse", False, experiment_id, is_automl=True) # no logging
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_automl_nonsparkdata_autolog_parentrun():
|
||||
experiment_id = _test_automl_nonsparkdata(is_autolog=True, is_parent_run=True)
|
||||
_check_mlflow_logging([4, 3], "r2", True, experiment_id, is_automl=True)
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_automl_nonsparkdata_autolog_noparentrun():
|
||||
experiment_id = _test_automl_nonsparkdata(is_autolog=True, is_parent_run=False)
|
||||
_check_mlflow_logging([4, 3], "r2", False, experiment_id, is_automl=True)
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_automl_nonsparkdata_noautolog_parentrun():
|
||||
experiment_id = _test_automl_nonsparkdata(is_autolog=False, is_parent_run=True)
|
||||
_check_mlflow_logging([4, 3], "r2", True, experiment_id, is_automl=True)
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_automl_nonsparkdata_noautolog_noparentrun():
|
||||
experiment_id = _test_automl_nonsparkdata(is_autolog=False, is_parent_run=False)
|
||||
_check_mlflow_logging(0, "r2", False, experiment_id, is_automl=True) # no logging
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_exit_pyspark_autolog():
|
||||
import pyspark
|
||||
|
||||
spark = pyspark.sql.SparkSession.builder.getOrCreate()
|
||||
spark.sparkContext._gateway.shutdown_callback_server() # this is to avoid stucking
|
||||
mlflow.autolog(disable=True)
|
||||
|
||||
|
||||
def _init_spark_for_main():
|
||||
import pyspark
|
||||
|
||||
spark = (
|
||||
pyspark.sql.SparkSession.builder.appName("MyApp")
|
||||
.master("local[2]")
|
||||
.config(
|
||||
"spark.jars.packages",
|
||||
(
|
||||
"com.microsoft.azure:synapseml_2.12:1.0.4,"
|
||||
"org.apache.hadoop:hadoop-azure:3.3.5,"
|
||||
"com.microsoft.azure:azure-storage:8.6.6,"
|
||||
f"org.mlflow:mlflow-spark_2.12:{mlflow.__version__}"
|
||||
if Version(mlflow.__version__) >= Version("2.9.0")
|
||||
else f"org.mlflow:mlflow-spark:{mlflow.__version__}"
|
||||
),
|
||||
)
|
||||
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
|
||||
.config("spark.sql.debug.maxToStringFields", "100")
|
||||
.config("spark.driver.extraJavaOptions", "-Xss1m")
|
||||
.config("spark.executor.extraJavaOptions", "-Xss1m")
|
||||
.getOrCreate()
|
||||
)
|
||||
spark.sparkContext._conf.set(
|
||||
"spark.mlflow.pysparkml.autolog.logModelAllowlistFile",
|
||||
"https://mmlspark.blob.core.windows.net/publicwasb/log_model_allowlist.txt",
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
_init_spark_for_main()
|
||||
|
||||
# test_tune_autolog_parentrun_parallel()
|
||||
# test_tune_autolog_parentrun_nonparallel()
|
||||
test_tune_autolog_noparentrun_parallel() # TODO: runs not removed
|
||||
# test_tune_noautolog_parentrun_parallel()
|
||||
# test_tune_autolog_noparentrun_nonparallel()
|
||||
# test_tune_noautolog_parentrun_nonparallel()
|
||||
# test_tune_noautolog_noparentrun_parallel()
|
||||
# test_tune_noautolog_noparentrun_nonparallel()
|
||||
# test_automl_sparkdata_autolog_parentrun()
|
||||
# test_automl_sparkdata_autolog_noparentrun()
|
||||
# test_automl_sparkdata_noautolog_parentrun()
|
||||
# test_automl_sparkdata_noautolog_noparentrun()
|
||||
# test_automl_nonsparkdata_autolog_parentrun()
|
||||
# test_automl_nonsparkdata_autolog_noparentrun() # TODO: runs not removed
|
||||
# test_automl_nonsparkdata_noautolog_parentrun()
|
||||
# test_automl_nonsparkdata_noautolog_noparentrun()
|
||||
|
||||
test_exit_pyspark_autolog()
|
||||
@@ -2,6 +2,7 @@ import os
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
import pytest
|
||||
import scipy.sparse
|
||||
from sklearn.datasets import load_iris, load_wine
|
||||
|
||||
@@ -12,6 +13,7 @@ from flaml.tune.spark.utils import check_spark
|
||||
|
||||
spark_available, _ = check_spark()
|
||||
skip_spark = not spark_available
|
||||
pytestmark = pytest.mark.spark
|
||||
|
||||
os.environ["FLAML_MAX_CONCURRENT"] = "2"
|
||||
|
||||
@@ -344,8 +346,8 @@ class TestMultiClass(unittest.TestCase):
|
||||
automl_val_accuracy = 1.0 - automl_experiment.best_loss
|
||||
print("Best ML leaner:", automl_experiment.best_estimator)
|
||||
print("Best hyperparmeter config:", automl_experiment.best_config)
|
||||
print("Best accuracy on validation data: {0:.4g}".format(automl_val_accuracy))
|
||||
print("Training duration of best run: {0:.4g} s".format(automl_experiment.best_config_train_time))
|
||||
print(f"Best accuracy on validation data: {automl_val_accuracy:.4g}")
|
||||
print(f"Training duration of best run: {automl_experiment.best_config_train_time:.4g} s")
|
||||
|
||||
starting_points = automl_experiment.best_config_per_estimator
|
||||
print("starting_points", starting_points)
|
||||
@@ -369,8 +371,8 @@ class TestMultiClass(unittest.TestCase):
|
||||
new_automl_val_accuracy = 1.0 - new_automl_experiment.best_loss
|
||||
print("Best ML leaner:", new_automl_experiment.best_estimator)
|
||||
print("Best hyperparmeter config:", new_automl_experiment.best_config)
|
||||
print("Best accuracy on validation data: {0:.4g}".format(new_automl_val_accuracy))
|
||||
print("Training duration of best run: {0:.4g} s".format(new_automl_experiment.best_config_train_time))
|
||||
print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
|
||||
print(f"Training duration of best run: {new_automl_experiment.best_config_train_time:.4g} s")
|
||||
|
||||
def test_fit_w_starting_points_list(self, as_frame=True):
|
||||
automl_experiment = AutoML()
|
||||
@@ -394,8 +396,8 @@ class TestMultiClass(unittest.TestCase):
|
||||
automl_val_accuracy = 1.0 - automl_experiment.best_loss
|
||||
print("Best ML leaner:", automl_experiment.best_estimator)
|
||||
print("Best hyperparmeter config:", automl_experiment.best_config)
|
||||
print("Best accuracy on validation data: {0:.4g}".format(automl_val_accuracy))
|
||||
print("Training duration of best run: {0:.4g} s".format(automl_experiment.best_config_train_time))
|
||||
print(f"Best accuracy on validation data: {automl_val_accuracy:.4g}")
|
||||
print(f"Training duration of best run: {automl_experiment.best_config_train_time:.4g} s")
|
||||
|
||||
starting_points = {}
|
||||
log_file_name = automl_settings["log_file_name"]
|
||||
@@ -409,7 +411,7 @@ class TestMultiClass(unittest.TestCase):
|
||||
if learner not in starting_points:
|
||||
starting_points[learner] = []
|
||||
starting_points[learner].append(config)
|
||||
max_iter = sum([len(s) for k, s in starting_points.items()])
|
||||
max_iter = sum(len(s) for k, s in starting_points.items())
|
||||
automl_settings_resume = {
|
||||
"time_budget": 2,
|
||||
"metric": "accuracy",
|
||||
@@ -431,7 +433,7 @@ class TestMultiClass(unittest.TestCase):
|
||||
new_automl_val_accuracy = 1.0 - new_automl_experiment.best_loss
|
||||
# print('Best ML leaner:', new_automl_experiment.best_estimator)
|
||||
# print('Best hyperparmeter config:', new_automl_experiment.best_config)
|
||||
print("Best accuracy on validation data: {0:.4g}".format(new_automl_val_accuracy))
|
||||
print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
|
||||
# print('Training duration of best run: {0:.4g} s'.format(new_automl_experiment.best_config_train_time))
|
||||
|
||||
|
||||
|
||||
@@ -9,7 +9,7 @@ from flaml.tune.spark.utils import check_spark
|
||||
spark_available, _ = check_spark()
|
||||
skip_spark = not spark_available
|
||||
|
||||
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
|
||||
|
||||
here = os.path.abspath(os.path.dirname(__file__))
|
||||
os.environ["FLAML_MAX_CONCURRENT"] = "2"
|
||||
|
||||
@@ -25,7 +25,7 @@ try:
|
||||
except ImportError:
|
||||
skip_spark = True
|
||||
|
||||
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
|
||||
|
||||
|
||||
def test_overtime():
|
||||
@@ -55,7 +55,7 @@ def test_overtime():
|
||||
start_time = time.time()
|
||||
automl_experiment.fit(**automl_settings)
|
||||
elapsed_time = time.time() - start_time
|
||||
print("time budget: {:.2f}s, actual elapsed time: {:.2f}s".format(time_budget, elapsed_time))
|
||||
print(f"time budget: {time_budget:.2f}s, actual elapsed time: {elapsed_time:.2f}s")
|
||||
# assert abs(elapsed_time - time_budget) < 5 # cancel assertion because github VM sometimes is super slow, causing the test to fail
|
||||
print(automl_experiment.predict(df))
|
||||
print(automl_experiment.model)
|
||||
|
||||
@@ -11,7 +11,7 @@ from flaml.tune.spark.utils import check_spark
|
||||
spark_available, _ = check_spark()
|
||||
skip_spark = not spark_available
|
||||
|
||||
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
|
||||
|
||||
os.environ["FLAML_MAX_CONCURRENT"] = "2"
|
||||
|
||||
@@ -75,8 +75,8 @@ def run_automl(budget=3, dataset_format="dataframe", hpo_method=None):
|
||||
""" retrieve best config and best learner """
|
||||
print("Best ML leaner:", automl.best_estimator)
|
||||
print("Best hyperparmeter config:", automl.best_config)
|
||||
print("Best accuracy on validation data: {0:.4g}".format(1 - automl.best_loss))
|
||||
print("Training duration of best run: {0:.4g} s".format(automl.best_config_train_time))
|
||||
print(f"Best accuracy on validation data: {1 - automl.best_loss:.4g}")
|
||||
print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")
|
||||
print(automl.model.estimator)
|
||||
print(automl.best_config_per_estimator)
|
||||
print("time taken to find best model:", automl.time_to_find_best_model)
|
||||
|
||||
@@ -14,7 +14,7 @@ from flaml.tune.spark.utils import check_spark
|
||||
spark_available, _ = check_spark()
|
||||
skip_spark = not spark_available
|
||||
|
||||
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
|
||||
|
||||
os.environ["FLAML_MAX_CONCURRENT"] = "2"
|
||||
X, y = load_breast_cancer(return_X_y=True)
|
||||
|
||||
@@ -36,7 +36,7 @@ except ImportError:
|
||||
print("Spark is not installed. Skip all spark tests.")
|
||||
skip_spark = True
|
||||
|
||||
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
|
||||
|
||||
|
||||
def test_with_parameters_spark():
|
||||
@@ -167,7 +167,7 @@ def test_len_labels():
|
||||
assert len_labels(y1) == 4
|
||||
ll, la = len_labels(y2, return_labels=True)
|
||||
assert ll == 4
|
||||
assert set(la.to_numpy()) == set([1, 2, 5, 4])
|
||||
assert set(la.to_numpy()) == {1, 2, 5, 4}
|
||||
|
||||
|
||||
def test_unique_value_first_index():
|
||||
|
||||
@@ -50,11 +50,11 @@ def oml_to_vw_w_grouping(X, y, ds_dir, fname, orginal_dim, group_num, grouping_m
|
||||
for i in range(len(X)):
|
||||
NS_content = []
|
||||
for zz in range(len(group_indexes)):
|
||||
ns_features = " ".join("{}:{:.6f}".format(ind, X[i][ind]) for ind in group_indexes[zz])
|
||||
ns_features = " ".join(f"{ind}:{X[i][ind]:.6f}" for ind in group_indexes[zz])
|
||||
NS_content.append(ns_features)
|
||||
ns_line = "{} |{}".format(
|
||||
str(y[i]),
|
||||
"|".join("{} {}".format(NS_LIST[j], NS_content[j]) for j in range(len(group_indexes))),
|
||||
"|".join(f"{NS_LIST[j]} {NS_content[j]}" for j in range(len(group_indexes))),
|
||||
)
|
||||
f.write(ns_line)
|
||||
f.write("\n")
|
||||
@@ -67,7 +67,7 @@ def save_vw_dataset_w_ns(X, y, did, ds_dir, max_ns_num, is_regression):
|
||||
"""convert openml dataset to vw example and save to file"""
|
||||
print("is_regression", is_regression)
|
||||
if is_regression:
|
||||
fname = "ds_{}_{}_{}.vw".format(did, max_ns_num, 0)
|
||||
fname = f"ds_{did}_{max_ns_num}_{0}.vw"
|
||||
print("dataset size", X.shape[0], X.shape[1])
|
||||
print("saving data", did, ds_dir, fname)
|
||||
dim = X.shape[1]
|
||||
@@ -131,7 +131,7 @@ def load_vw_dataset(did, ds_dir, is_regression, max_ns_num):
|
||||
|
||||
if is_regression:
|
||||
# the second field specifies the largest number of namespaces using.
|
||||
fname = "ds_{}_{}_{}.vw".format(did, max_ns_num, 0)
|
||||
fname = f"ds_{did}_{max_ns_num}_{0}.vw"
|
||||
vw_dataset_file = os.path.join(ds_dir, fname)
|
||||
# if file does not exist, generate and save the datasets
|
||||
if not os.path.exists(vw_dataset_file) or os.stat(vw_dataset_file).st_size < 1000:
|
||||
@@ -139,7 +139,7 @@ def load_vw_dataset(did, ds_dir, is_regression, max_ns_num):
|
||||
print(ds_dir, vw_dataset_file)
|
||||
if not os.path.exists(ds_dir):
|
||||
os.makedirs(ds_dir)
|
||||
with open(os.path.join(ds_dir, fname), "r") as f:
|
||||
with open(os.path.join(ds_dir, fname)) as f:
|
||||
vw_content = f.read().splitlines()
|
||||
print(type(vw_content), len(vw_content))
|
||||
return vw_content
|
||||
|
||||
@@ -59,6 +59,17 @@ def _test_hf_data():
|
||||
except requests.exceptions.ConnectionError:
|
||||
return
|
||||
|
||||
# Tests will only run if there is a GPU available
|
||||
try:
|
||||
import ray
|
||||
|
||||
pg = ray.util.placement_group([{"CPU": 1, "GPU": 1}])
|
||||
|
||||
if not pg.wait(timeout_seconds=10): # Wait 10 seconds for resources
|
||||
raise RuntimeError("No available node types can fulfill resource request!")
|
||||
except RuntimeError:
|
||||
return
|
||||
|
||||
custom_sent_keys = ["sentence1", "sentence2"]
|
||||
label_key = "label"
|
||||
|
||||
|
||||
@@ -75,10 +75,10 @@ def test_lexiflow():
|
||||
layers = []
|
||||
in_features = 28 * 28
|
||||
for i in range(n_layers):
|
||||
out_features = configuration["n_units_l{}".format(i)]
|
||||
out_features = configuration[f"n_units_l{i}"]
|
||||
layers.append(nn.Linear(in_features, out_features))
|
||||
layers.append(nn.ReLU())
|
||||
p = configuration["dropout_{}".format(i)]
|
||||
p = configuration[f"dropout_{i}"]
|
||||
layers.append(nn.Dropout(p))
|
||||
in_features = out_features
|
||||
layers.append(nn.Linear(in_features, 10))
|
||||
|
||||
@@ -24,7 +24,7 @@ try:
|
||||
# __net_begin__
|
||||
class Net(nn.Module):
|
||||
def __init__(self, l1=120, l2=84):
|
||||
super(Net, self).__init__()
|
||||
super().__init__()
|
||||
self.conv1 = nn.Conv2d(3, 6, 5)
|
||||
self.pool = nn.MaxPool2d(2, 2)
|
||||
self.conv2 = nn.Conv2d(6, 16, 5)
|
||||
@@ -277,7 +277,7 @@ def cifar10_main(method="BlendSearch", num_samples=10, max_num_epochs=100, gpus_
|
||||
logger.info(f"#trials={len(result.trials)}")
|
||||
logger.info(f"time={time.time()-start_time}")
|
||||
best_trial = result.get_best_trial("loss", "min", "all")
|
||||
logger.info("Best trial config: {}".format(best_trial.config))
|
||||
logger.info(f"Best trial config: {best_trial.config}")
|
||||
logger.info("Best trial final validation loss: {}".format(best_trial.metric_analysis["loss"]["min"]))
|
||||
logger.info("Best trial final validation accuracy: {}".format(best_trial.metric_analysis["accuracy"]["max"]))
|
||||
|
||||
@@ -296,7 +296,7 @@ def cifar10_main(method="BlendSearch", num_samples=10, max_num_epochs=100, gpus_
|
||||
best_trained_model.load_state_dict(model_state)
|
||||
|
||||
test_acc = _test_accuracy(best_trained_model, device)
|
||||
logger.info("Best trial test set accuracy: {}".format(test_acc))
|
||||
logger.info(f"Best trial test set accuracy: {test_acc}")
|
||||
|
||||
|
||||
# __main_end__
|
||||
|
||||
5795
tutorials/Automl2024DemoAutoMLTask.ipynb
Normal file
5795
tutorials/Automl2024DemoAutoMLTask.ipynb
Normal file
File diff suppressed because one or more lines are too long
2894
tutorials/Automl2024DemoTuneLLM.ipynb
Normal file
2894
tutorials/Automl2024DemoTuneLLM.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
@@ -1,5 +1,6 @@
|
||||
Please find tutorials on FLAML below:
|
||||
|
||||
- [AutoML 2024](flaml-tutorial-automl-24.md)
|
||||
- [PyData Seattle 2023](flaml-tutorial-pydata-23.md)
|
||||
- [A hands-on tutorial on FLAML presented at KDD 2022](flaml-tutorial-kdd-22.md)
|
||||
- [A lab forum on FLAML at AAAI 2023](flaml-tutorial-aaai-23.md)
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user