mirror of
https://github.com/microsoft/FLAML.git
synced 2026-02-11 19:29:17 +08:00
Compare commits
76 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
bc1e4dc5ea | ||
|
|
158ff7d99e | ||
|
|
a5021152d2 | ||
|
|
fc4efe3510 | ||
|
|
cd0e9fb0d2 | ||
|
|
a9c0a9e30a | ||
|
|
a05b669de3 | ||
|
|
6e59103e86 | ||
|
|
d9e74031e0 | ||
|
|
7ec1414e9b | ||
|
|
9233a52736 | ||
|
|
7ac076d544 | ||
|
|
3d489f1aaa | ||
|
|
c64eeb5e8d | ||
|
|
bf35f98a24 | ||
|
|
1687ca9a94 | ||
|
|
7a597adcc9 | ||
|
|
4ea9650f99 | ||
|
|
fa1a32afb6 | ||
|
|
5eb7d623b0 | ||
|
|
22dcfcd3c0 | ||
|
|
d7208b32d0 | ||
|
|
5f1aa2dda8 | ||
|
|
67bdcde4d5 | ||
|
|
46a406edd4 | ||
|
|
f1817ea7b1 | ||
|
|
f6a5163e6a | ||
|
|
e64b486528 | ||
|
|
a74354f7a9 | ||
|
|
ced1d6f331 | ||
|
|
bb213e7ebd | ||
|
|
d241e8de90 | ||
|
|
0b138d9193 | ||
|
|
1c9835dc0a | ||
|
|
1285700d7a | ||
|
|
7f42bece89 | ||
|
|
e19107407b | ||
|
|
f5d6693253 | ||
|
|
d4e43c50a2 | ||
|
|
13aec414ea | ||
|
|
bb16dcde93 | ||
|
|
be81a76da9 | ||
|
|
2d16089529 | ||
|
|
01c3c83653 | ||
|
|
9b66103f7c | ||
|
|
48dfd72e64 | ||
|
|
dec92e5b02 | ||
|
|
22911ea1ef | ||
|
|
12183e5f73 | ||
|
|
c2b25310fc | ||
|
|
0f9420590d | ||
|
|
5107c506b4 | ||
|
|
9e219ef8dc | ||
|
|
6e4083743b | ||
|
|
17e95edd9e | ||
|
|
468bc62d27 | ||
|
|
437c239c11 | ||
|
|
8e753f1092 | ||
|
|
a3b57e11d4 | ||
|
|
a80dcf9925 | ||
|
|
7157af44e0 | ||
|
|
1798c4591e | ||
|
|
dd26263330 | ||
|
|
2ba5f8bed1 | ||
|
|
d0a11958a5 | ||
|
|
0ef9b00a75 | ||
|
|
840f76e5e5 | ||
|
|
d8b7d25b80 | ||
|
|
6d53929803 | ||
|
|
c038fbca07 | ||
|
|
6a99202492 | ||
|
|
42d1dcfa0e | ||
|
|
b83c8a7d3b | ||
|
|
b9194cdcf2 | ||
|
|
9a1f6b0291 | ||
|
|
07f4413aae |
@@ -1,5 +1,7 @@
|
||||
[run]
|
||||
branch = True
|
||||
source = flaml
|
||||
source =
|
||||
flaml
|
||||
omit =
|
||||
*test*
|
||||
*/test/*
|
||||
*/flaml/autogen/*
|
||||
|
||||
243
.github/copilot-instructions.md
vendored
Normal file
243
.github/copilot-instructions.md
vendored
Normal file
@@ -0,0 +1,243 @@
|
||||
# GitHub Copilot Instructions for FLAML
|
||||
|
||||
## Project Overview
|
||||
|
||||
FLAML (Fast Library for Automated Machine Learning & Tuning) is a lightweight Python library for efficient automation of machine learning and AI operations. It automates workflow based on large language models, machine learning models, etc. and optimizes their performance.
|
||||
|
||||
**Key Components:**
|
||||
|
||||
- `flaml/automl/`: AutoML functionality for classification and regression
|
||||
- `flaml/tune/`: Generic hyperparameter tuning
|
||||
- `flaml/default/`: Zero-shot AutoML with default configurations
|
||||
- `flaml/autogen/`: Legacy autogen code (note: AutoGen has moved to a separate repository)
|
||||
- `flaml/fabric/`: Microsoft Fabric integration
|
||||
- `test/`: Comprehensive test suite
|
||||
|
||||
## Build and Test Commands
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# Basic installation
|
||||
pip install -e .
|
||||
|
||||
# Install with test dependencies
|
||||
pip install -e .[test]
|
||||
|
||||
# Install with automl dependencies
|
||||
pip install -e .[automl]
|
||||
|
||||
# Install with forecast dependencies (Linux only)
|
||||
pip install -e .[forecast]
|
||||
```
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
# Run all tests (excluding autogen)
|
||||
pytest test/ --ignore=test/autogen --reruns 2 --reruns-delay 10
|
||||
|
||||
# Run tests with coverage
|
||||
coverage run -a -m pytest test --ignore=test/autogen --reruns 2 --reruns-delay 10
|
||||
coverage xml
|
||||
|
||||
# Check dependencies
|
||||
python test/check_dependency.py
|
||||
```
|
||||
|
||||
### Linting and Formatting
|
||||
|
||||
```bash
|
||||
# Run pre-commit hooks
|
||||
pre-commit run --all-files
|
||||
|
||||
# Format with black (line length: 120)
|
||||
black . --line-length 120
|
||||
|
||||
# Run ruff for linting and auto-fix
|
||||
ruff check . --fix
|
||||
```
|
||||
|
||||
## Code Style and Formatting
|
||||
|
||||
### Python Style
|
||||
|
||||
- **Line length:** 120 characters (configured in both Black and Ruff)
|
||||
- **Formatter:** Black (v23.3.0+)
|
||||
- **Linter:** Ruff with Pyflakes and pycodestyle rules
|
||||
- **Import sorting:** Use isort (via Ruff)
|
||||
- **Python version:** Supports Python >= 3.10 (full support for 3.10, 3.11, 3.12 and 3.13)
|
||||
|
||||
### Code Quality Rules
|
||||
|
||||
- Follow Black formatting conventions
|
||||
- Keep imports sorted and organized
|
||||
- Avoid unused imports (F401) - these are flagged but not auto-fixed
|
||||
- Avoid wildcard imports (F403) where possible
|
||||
- Complexity: Max McCabe complexity of 10
|
||||
- Use type hints where appropriate
|
||||
- Write clear docstrings for public APIs
|
||||
|
||||
### Pre-commit Hooks
|
||||
|
||||
The repository uses pre-commit hooks for:
|
||||
|
||||
- Checking for large files, AST syntax, YAML/TOML/JSON validity
|
||||
- Detecting merge conflicts and private keys
|
||||
- Trailing whitespace and end-of-file fixes
|
||||
- pyupgrade for Python 3.8+ syntax
|
||||
- Black formatting
|
||||
- Markdown formatting (mdformat with GFM and frontmatter support)
|
||||
- Ruff linting with auto-fix
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Test Organization
|
||||
|
||||
- Tests are in the `test/` directory, organized by module
|
||||
- `test/automl/`: AutoML feature tests
|
||||
- `test/tune/`: Hyperparameter tuning tests
|
||||
- `test/default/`: Zero-shot AutoML tests
|
||||
- `test/nlp/`: NLP-related tests
|
||||
- `test/spark/`: Spark integration tests
|
||||
|
||||
### Test Requirements
|
||||
|
||||
- Write tests for new functionality
|
||||
- Ensure tests pass on multiple Python versions (3.10, 3.11, 3.12 and 3.13)
|
||||
- Tests should work on both Ubuntu and Windows
|
||||
- Use pytest markers for platform-specific tests (e.g., `@pytest.mark.spark`)
|
||||
- Tests should be idempotent and not depend on external state
|
||||
- Use `--reruns 2 --reruns-delay 10` for flaky tests
|
||||
|
||||
### Coverage
|
||||
|
||||
- Aim for good test coverage on new code
|
||||
- Coverage reports are generated for Python 3.11 builds
|
||||
- Coverage reports are uploaded to Codecov
|
||||
|
||||
## Git Workflow and Best Practices
|
||||
|
||||
### Branching
|
||||
|
||||
- Main branch: `main`
|
||||
- Create feature branches from `main`
|
||||
- PR reviews are required before merging
|
||||
|
||||
### Commit Messages
|
||||
|
||||
- Use clear, descriptive commit messages
|
||||
- Reference issue numbers when applicable
|
||||
- ALWAYS run `pre-commit run --all-files` before each commit to avoid formatting issues
|
||||
|
||||
### Pull Requests
|
||||
|
||||
- Ensure all tests pass before requesting review
|
||||
- Update documentation if adding new features
|
||||
- Follow the PR template in `.github/PULL_REQUEST_TEMPLATE.md`
|
||||
- ALWAYS run `pre-commit run --all-files` before each commit to avoid formatting issues
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
flaml/
|
||||
├── automl/ # AutoML functionality
|
||||
├── tune/ # Hyperparameter tuning
|
||||
├── default/ # Zero-shot AutoML
|
||||
├── autogen/ # Legacy autogen (deprecated, moved to separate repo)
|
||||
├── fabric/ # Microsoft Fabric integration
|
||||
├── onlineml/ # Online learning
|
||||
└── version.py # Version information
|
||||
|
||||
test/ # Test suite
|
||||
├── automl/
|
||||
├── tune/
|
||||
├── default/
|
||||
├── nlp/
|
||||
└── spark/
|
||||
|
||||
notebook/ # Example notebooks
|
||||
website/ # Documentation website
|
||||
```
|
||||
|
||||
## Dependencies and Package Management
|
||||
|
||||
### Core Dependencies
|
||||
|
||||
- NumPy >= 1.17
|
||||
- Python >= 3.10 (officially supported: 3.10, 3.11, 3.12 and 3.13)
|
||||
|
||||
### Optional Dependencies
|
||||
|
||||
- `[automl]`: lightgbm, xgboost, scipy, pandas, scikit-learn
|
||||
- `[test]`: Full test suite dependencies
|
||||
- `[spark]`: PySpark and joblib dependencies
|
||||
- `[forecast]`: holidays, prophet, statsmodels, hcrystalball, pytorch-forecasting, pytorch-lightning, tensorboardX
|
||||
- `[hf]`: Hugging Face transformers and datasets
|
||||
- See `setup.py` for complete list
|
||||
|
||||
### Version Constraints
|
||||
|
||||
- Be mindful of Python version-specific dependencies (check setup.py)
|
||||
- XGBoost versions differ based on Python version
|
||||
- NumPy 2.0+ only for Python >= 3.13
|
||||
- Some features (like vowpalwabbit) only work with older Python versions
|
||||
|
||||
## Boundaries and Restrictions
|
||||
|
||||
### Do NOT Modify
|
||||
|
||||
- `.git/` directory and Git configuration
|
||||
- `LICENSE` file
|
||||
- Version information in `flaml/version.py` (unless explicitly updating version)
|
||||
- GitHub Actions workflows without careful consideration
|
||||
- Existing test files unless fixing bugs or adding coverage
|
||||
|
||||
### Be Cautious With
|
||||
|
||||
- `setup.py`: Changes to dependencies should be carefully reviewed
|
||||
- `pyproject.toml`: Linting and testing configuration
|
||||
- `.pre-commit-config.yaml`: Pre-commit hook configuration
|
||||
- Backward compatibility: FLAML is a library with external users
|
||||
|
||||
### Security Considerations
|
||||
|
||||
- Never commit secrets or API keys
|
||||
- Be careful with external data sources in tests
|
||||
- Validate user inputs in public APIs
|
||||
- Follow secure coding practices for ML operations
|
||||
|
||||
## Special Notes
|
||||
|
||||
### AutoGen Migration
|
||||
|
||||
- AutoGen has moved to a separate repository: https://github.com/microsoft/autogen
|
||||
- The `flaml/autogen/` directory contains legacy code
|
||||
- Tests in `test/autogen/` are ignored in the main test suite
|
||||
- Direct users to the new AutoGen repository for AutoGen-related issues
|
||||
|
||||
### Platform-Specific Considerations
|
||||
|
||||
- Some tests only run on Linux (e.g., forecast tests with prophet)
|
||||
- Windows and Ubuntu are the primary supported platforms
|
||||
- macOS support exists but requires special libomp setup for lgbm/xgboost
|
||||
|
||||
### Performance
|
||||
|
||||
- FLAML focuses on efficient automation and tuning
|
||||
- Consider computational cost when adding new features
|
||||
- Optimize for low resource usage where possible
|
||||
|
||||
## Documentation
|
||||
|
||||
- Main documentation: https://microsoft.github.io/FLAML/
|
||||
- Update documentation when adding new features
|
||||
- Provide clear examples in docstrings
|
||||
- Add notebook examples for significant new features
|
||||
|
||||
## Contributing
|
||||
|
||||
- Follow the contributing guide: https://microsoft.github.io/FLAML/docs/Contribute
|
||||
- Sign the Microsoft CLA when making your first contribution
|
||||
- Be respectful and follow the Microsoft Open Source Code of Conduct
|
||||
- Join the Discord community for discussions: https://discord.gg/Cppx2vSPVP
|
||||
21
.github/workflows/CD.yml
vendored
21
.github/workflows/CD.yml
vendored
@@ -12,26 +12,17 @@ jobs:
|
||||
deploy:
|
||||
strategy:
|
||||
matrix:
|
||||
os: ['ubuntu-latest']
|
||||
python-version: [3.8]
|
||||
os: ["ubuntu-latest"]
|
||||
python-version: ["3.12"]
|
||||
runs-on: ${{ matrix.os }}
|
||||
environment: package
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v3
|
||||
- name: Cache conda
|
||||
uses: actions/cache@v3
|
||||
uses: actions/checkout@v4
|
||||
- name: Set up Python ${{ matrix.python-version }}
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
path: ~/conda_pkgs_dir
|
||||
key: conda-${{ matrix.os }}-python-${{ matrix.python-version }}-${{ hashFiles('environment.yml') }}
|
||||
- name: Setup Miniconda
|
||||
uses: conda-incubator/setup-miniconda@v2
|
||||
with:
|
||||
auto-update-conda: true
|
||||
auto-activate-base: false
|
||||
activate-environment: hcrystalball
|
||||
python-version: ${{ matrix.python-version }}
|
||||
use-only-tar-bz2: true
|
||||
- name: Install from source
|
||||
# This is required for the pre-commit tests
|
||||
shell: pwsh
|
||||
@@ -42,7 +33,7 @@ jobs:
|
||||
- name: Build
|
||||
shell: pwsh
|
||||
run: |
|
||||
pip install twine
|
||||
pip install twine wheel setuptools
|
||||
python setup.py sdist bdist_wheel
|
||||
- name: Publish to PyPI
|
||||
env:
|
||||
|
||||
8
.github/workflows/deploy-website.yml
vendored
8
.github/workflows/deploy-website.yml
vendored
@@ -37,11 +37,11 @@ jobs:
|
||||
- name: setup python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: "3.8"
|
||||
python-version: "3.12"
|
||||
- name: pydoc-markdown install
|
||||
run: |
|
||||
python -m pip install --upgrade pip
|
||||
pip install pydoc-markdown==4.5.0
|
||||
pip install pydoc-markdown==4.7.0 setuptools
|
||||
- name: pydoc-markdown run
|
||||
run: |
|
||||
pydoc-markdown
|
||||
@@ -73,11 +73,11 @@ jobs:
|
||||
- name: setup python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: "3.8"
|
||||
python-version: "3.12"
|
||||
- name: pydoc-markdown install
|
||||
run: |
|
||||
python -m pip install --upgrade pip
|
||||
pip install pydoc-markdown==4.5.0
|
||||
pip install pydoc-markdown==4.7.0 setuptools
|
||||
- name: pydoc-markdown run
|
||||
run: |
|
||||
pydoc-markdown
|
||||
|
||||
17
.github/workflows/openai.yml
vendored
17
.github/workflows/openai.yml
vendored
@@ -4,14 +4,15 @@
|
||||
name: OpenAI
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
branches: ['main']
|
||||
paths:
|
||||
- 'flaml/autogen/**'
|
||||
- 'test/autogen/**'
|
||||
- 'notebook/autogen_openai_completion.ipynb'
|
||||
- 'notebook/autogen_chatgpt_gpt4.ipynb'
|
||||
- '.github/workflows/openai.yml'
|
||||
workflow_dispatch:
|
||||
# pull_request:
|
||||
# branches: ['main']
|
||||
# paths:
|
||||
# - 'flaml/autogen/**'
|
||||
# - 'test/autogen/**'
|
||||
# - 'notebook/autogen_openai_completion.ipynb'
|
||||
# - 'notebook/autogen_chatgpt_gpt4.ipynb'
|
||||
# - '.github/workflows/openai.yml'
|
||||
|
||||
permissions: {}
|
||||
|
||||
|
||||
4
.github/workflows/pre-commit.yml
vendored
4
.github/workflows/pre-commit.yml
vendored
@@ -1,9 +1,7 @@
|
||||
name: Code formatting
|
||||
|
||||
# see: https://help.github.com/en/actions/reference/events-that-trigger-workflows
|
||||
on: # Trigger the workflow on push or pull request, but only for the main branch
|
||||
push:
|
||||
branches: [main]
|
||||
on:
|
||||
pull_request: {}
|
||||
|
||||
defaults:
|
||||
|
||||
118
.github/workflows/python-package.yml
vendored
118
.github/workflows/python-package.yml
vendored
@@ -14,10 +14,20 @@ on:
|
||||
- 'setup.py'
|
||||
pull_request:
|
||||
branches: ['main']
|
||||
paths:
|
||||
- 'flaml/**'
|
||||
- 'test/**'
|
||||
- 'notebook/**'
|
||||
- '.github/workflows/python-package.yml'
|
||||
- 'setup.py'
|
||||
merge_group:
|
||||
types: [checks_requested]
|
||||
schedule:
|
||||
# Every other day at 02:00 UTC
|
||||
- cron: '0 2 */2 * *'
|
||||
|
||||
permissions: {}
|
||||
permissions:
|
||||
contents: write
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref }}
|
||||
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
||||
@@ -29,8 +39,8 @@ jobs:
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
os: [ubuntu-latest, macos-latest, windows-2019]
|
||||
python-version: ["3.8", "3.9", "3.10", "3.11"]
|
||||
os: [ubuntu-latest, windows-latest]
|
||||
python-version: ["3.10", "3.11", "3.12", "3.13"]
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- name: Set up Python ${{ matrix.python-version }}
|
||||
@@ -38,7 +48,7 @@ jobs:
|
||||
with:
|
||||
python-version: ${{ matrix.python-version }}
|
||||
- name: On mac, install libomp to facilitate lgbm and xgboost install
|
||||
if: matrix.os == 'macOS-latest'
|
||||
if: matrix.os == 'macos-latest'
|
||||
run: |
|
||||
brew update
|
||||
brew install libomp
|
||||
@@ -50,76 +60,82 @@ jobs:
|
||||
export LDFLAGS="$LDFLAGS -Wl,-rpath,/usr/local/opt/libomp/lib -L/usr/local/opt/libomp/lib -lomp"
|
||||
- name: Install packages and dependencies
|
||||
run: |
|
||||
python -m pip install --upgrade pip wheel
|
||||
python -m pip install --upgrade pip wheel setuptools
|
||||
pip install -e .
|
||||
python -c "import flaml"
|
||||
pip install -e .[test]
|
||||
- name: On Ubuntu python 3.10, install pyspark 3.4.1
|
||||
if: matrix.python-version == '3.10' && matrix.os == 'ubuntu-latest'
|
||||
run: |
|
||||
pip install pyspark==3.4.1
|
||||
pip list | grep "pyspark"
|
||||
- name: On Ubuntu python 3.11, install pyspark 3.5.1
|
||||
if: matrix.python-version == '3.11' && matrix.os == 'ubuntu-latest'
|
||||
run: |
|
||||
pip install pyspark==3.5.1
|
||||
pip list | grep "pyspark"
|
||||
- name: If linux and python<3.11, install ray 2
|
||||
if: matrix.os == 'ubuntu-latest' && matrix.python-version != '3.11'
|
||||
- name: On Ubuntu python 3.12, install pyspark 4.0.1
|
||||
if: matrix.python-version == '3.12' && matrix.os == 'ubuntu-latest'
|
||||
run: |
|
||||
pip install "ray[tune]<2.5.0"
|
||||
- name: If mac and python 3.10, install ray and xgboost 1
|
||||
if: matrix.os == 'macOS-latest' && matrix.python-version == '3.10'
|
||||
pip install pyspark==4.0.1
|
||||
pip list | grep "pyspark"
|
||||
- name: On Ubuntu python 3.13, install pyspark 4.1.0
|
||||
if: matrix.python-version == '3.13' && matrix.os == 'ubuntu-latest'
|
||||
run: |
|
||||
pip install -e .[ray]
|
||||
# use macOS to test xgboost 1, but macOS also supports xgboost 2
|
||||
pip install "xgboost<2"
|
||||
- name: If linux, install prophet on python < 3.9
|
||||
if: matrix.os == 'ubuntu-latest' && matrix.python-version == '3.8'
|
||||
pip install pyspark==4.1.0
|
||||
pip list | grep "pyspark"
|
||||
# # TODO: support ray
|
||||
# - name: If linux and python<3.11, install ray 2
|
||||
# if: matrix.os == 'ubuntu-latest' && matrix.python-version < '3.11'
|
||||
# run: |
|
||||
# pip install "ray[tune]<2.5.0"
|
||||
- name: Install prophet when on linux
|
||||
if: matrix.os == 'ubuntu-latest'
|
||||
run: |
|
||||
pip install -e .[forecast]
|
||||
- name: Install vw on python < 3.10
|
||||
if: matrix.python-version == '3.8' || matrix.python-version == '3.9'
|
||||
# TODO: support vw for python 3.10+
|
||||
- name: If linux and python<3.10, install vw
|
||||
if: matrix.os == 'ubuntu-latest' && matrix.python-version < '3.10'
|
||||
run: |
|
||||
pip install -e .[vw]
|
||||
- name: Test with pytest
|
||||
if: matrix.python-version != '3.10'
|
||||
- name: Pip freeze
|
||||
run: |
|
||||
pytest test
|
||||
pip freeze
|
||||
- name: Check dependencies
|
||||
run: |
|
||||
python test/check_dependency.py
|
||||
- name: Clear pip cache
|
||||
run: |
|
||||
pip cache purge
|
||||
- name: Test with pytest
|
||||
timeout-minutes: 120
|
||||
if: matrix.python-version != '3.11'
|
||||
run: |
|
||||
pytest test/ --ignore=test/autogen --reruns 2 --reruns-delay 10
|
||||
- name: Coverage
|
||||
if: matrix.python-version == '3.10'
|
||||
timeout-minutes: 120
|
||||
if: matrix.python-version == '3.11'
|
||||
run: |
|
||||
pip install coverage
|
||||
coverage run -a -m pytest test
|
||||
coverage run -a -m pytest test --ignore=test/autogen --reruns 2 --reruns-delay 10
|
||||
coverage xml
|
||||
- name: Upload coverage to Codecov
|
||||
if: matrix.python-version == '3.10'
|
||||
if: matrix.python-version == '3.11'
|
||||
uses: codecov/codecov-action@v3
|
||||
with:
|
||||
file: ./coverage.xml
|
||||
flags: unittests
|
||||
- name: Save dependencies
|
||||
if: github.ref == 'refs/heads/main'
|
||||
shell: bash
|
||||
run: |
|
||||
git config --global user.name 'github-actions[bot]'
|
||||
git config --global user.email 'github-actions[bot]@users.noreply.github.com'
|
||||
git config advice.addIgnoredFile false
|
||||
|
||||
# docs:
|
||||
BRANCH=unit-tests-installed-dependencies
|
||||
git fetch origin
|
||||
git checkout -B "$BRANCH" "origin/$BRANCH"
|
||||
|
||||
# runs-on: ubuntu-latest
|
||||
|
||||
# steps:
|
||||
# - uses: actions/checkout@v3
|
||||
# - name: Setup Python
|
||||
# uses: actions/setup-python@v4
|
||||
# with:
|
||||
# python-version: '3.8'
|
||||
# - name: Compile documentation
|
||||
# run: |
|
||||
# pip install -e .
|
||||
# python -m pip install sphinx sphinx_rtd_theme
|
||||
# cd docs
|
||||
# make html
|
||||
# - name: Deploy to GitHub pages
|
||||
# if: ${{ github.ref == 'refs/heads/main' }}
|
||||
# uses: JamesIves/github-pages-deploy-action@3.6.2
|
||||
# with:
|
||||
# GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
# BRANCH: gh-pages
|
||||
# FOLDER: docs/_build/html
|
||||
# CLEAN: true
|
||||
pip freeze > installed_all_dependencies_${{ matrix.python-version }}_${{ matrix.os }}.txt
|
||||
python test/check_dependency.py > installed_first_tier_dependencies_${{ matrix.python-version }}_${{ matrix.os }}.txt
|
||||
git add installed_*dependencies*.txt
|
||||
mv coverage.xml ./coverage_${{ matrix.python-version }}_${{ matrix.os }}.xml || true
|
||||
git add -f ./coverage_${{ matrix.python-version }}_${{ matrix.os }}.xml || true
|
||||
git commit -m "Update installed dependencies for Python ${{ matrix.python-version }} on ${{ matrix.os }}" || exit 0
|
||||
git push origin "$BRANCH" --force
|
||||
|
||||
7
.gitignore
vendored
7
.gitignore
vendored
@@ -60,6 +60,7 @@ coverage.xml
|
||||
.hypothesis/
|
||||
.pytest_cache/
|
||||
cover/
|
||||
junit
|
||||
|
||||
# Translations
|
||||
*.mo
|
||||
@@ -172,7 +173,7 @@ test/default
|
||||
test/housing.json
|
||||
test/nlp/default/transformer_ms/seq-classification.json
|
||||
|
||||
flaml/fabric/fanova/_fanova.c
|
||||
flaml/fabric/fanova/*fanova.c
|
||||
# local config files
|
||||
*.config.local
|
||||
|
||||
@@ -184,3 +185,7 @@ notebook/lightning_logs/
|
||||
lightning_logs/
|
||||
flaml/autogen/extensions/tmp/
|
||||
test/autogen/my_tmp/
|
||||
catboost_*
|
||||
|
||||
# Internal configs
|
||||
.pypirc
|
||||
|
||||
@@ -36,7 +36,7 @@ repos:
|
||||
- id: black
|
||||
|
||||
- repo: https://github.com/executablebooks/mdformat
|
||||
rev: 0.7.17
|
||||
rev: 0.7.22
|
||||
hooks:
|
||||
- id: mdformat
|
||||
additional_dependencies:
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
# basic setup
|
||||
FROM mcr.microsoft.com/devcontainers/python:3.8
|
||||
FROM mcr.microsoft.com/devcontainers/python:3.10
|
||||
RUN apt-get update && apt-get -y update
|
||||
RUN apt-get install -y sudo git npm
|
||||
|
||||
|
||||
@@ -4,8 +4,8 @@ This repository incorporates material as listed below or described in the code.
|
||||
|
||||
## Component. Ray.
|
||||
|
||||
Code in tune/\[analysis.py, sample.py, trial.py, result.py\],
|
||||
searcher/\[suggestion.py, variant_generator.py\], and scheduler/trial_scheduler.py is adapted from
|
||||
Code in tune/[analysis.py, sample.py, trial.py, result.py],
|
||||
searcher/[suggestion.py, variant_generator.py], and scheduler/trial_scheduler.py is adapted from
|
||||
https://github.com/ray-project/ray/blob/master/python/ray/tune/
|
||||
|
||||
## Open Source License/Copyright Notice.
|
||||
|
||||
56
README.md
56
README.md
@@ -14,15 +14,9 @@
|
||||
<br>
|
||||
</p>
|
||||
|
||||
:fire: FLAML supports AutoML and Hyperparameter Tuning in [Microsoft Fabric Data Science](https://learn.microsoft.com/en-us/fabric/data-science/automated-machine-learning-fabric). In addition, we've introduced Python 3.11 support, along with a range of new estimators, and comprehensive integration with MLflow—thanks to contributions from the Microsoft Fabric product team.
|
||||
:fire: FLAML supports AutoML and Hyperparameter Tuning in [Microsoft Fabric Data Science](https://learn.microsoft.com/en-us/fabric/data-science/automated-machine-learning-fabric). In addition, we've introduced Python 3.11 and 3.12 support, along with a range of new estimators, and comprehensive integration with MLflow—thanks to contributions from the Microsoft Fabric product team.
|
||||
|
||||
:fire: Heads-up: We have migrated [AutoGen](https://microsoft.github.io/autogen/) into a dedicated [github repository](https://github.com/microsoft/autogen). Alongside this move, we have also launched a dedicated [Discord](https://discord.gg/pAbnFJrkgZ) server and a [website](https://microsoft.github.io/autogen/) for comprehensive documentation.
|
||||
|
||||
:fire: The automated multi-agent chat framework in [AutoGen](https://microsoft.github.io/autogen/) is in preview from v2.0.0.
|
||||
|
||||
:fire: FLAML is highlighted in OpenAI's [cookbook](https://github.com/openai/openai-cookbook#related-resources-from-around-the-web).
|
||||
|
||||
:fire: [autogen](https://microsoft.github.io/autogen/) is released with support for ChatGPT and GPT-4, based on [Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference](https://arxiv.org/abs/2303.04673).
|
||||
:fire: Heads-up: [AutoGen](https://microsoft.github.io/autogen/) has moved to a dedicated [GitHub repository](https://github.com/microsoft/autogen). FLAML no longer includes the `autogen` module—please use AutoGen directly.
|
||||
|
||||
## What is FLAML
|
||||
|
||||
@@ -30,7 +24,7 @@ FLAML is a lightweight Python library for efficient automation of machine
|
||||
learning and AI operations. It automates workflow based on large language models, machine learning models, etc.
|
||||
and optimizes their performance.
|
||||
|
||||
- FLAML enables building next-gen GPT-X applications based on multi-agent conversations with minimal effort. It simplifies the orchestration, automation and optimization of a complex GPT-X workflow. It maximizes the performance of GPT-X models and augments their weakness.
|
||||
- FLAML enables economical automation and tuning for ML/AI workflows, including model selection and hyperparameter optimization under resource constraints.
|
||||
- For common machine learning tasks like classification and regression, it quickly finds quality models for user-provided data with low computational resources. It is easy to customize or extend. Users can find their desired customizability from a smooth range.
|
||||
- It supports fast and economical automatic tuning (e.g., inference hyperparameters for foundation models, configurations in MLOps/LMOps workflows, pipelines, mathematical/statistical models, algorithms, computing experiments, software configurations), capable of handling large search space with heterogeneous evaluation cost and complex constraints/guidance/early stopping.
|
||||
|
||||
@@ -40,16 +34,16 @@ FLAML has a .NET implementation in [ML.NET](http://dot.net/ml), an open-source,
|
||||
|
||||
## Installation
|
||||
|
||||
FLAML requires **Python version >= 3.8**. It can be installed from pip:
|
||||
The latest version of FLAML requires **Python >= 3.10 and < 3.14**. While other Python versions may work for core components, full model support is not guaranteed. FLAML can be installed via `pip`:
|
||||
|
||||
```bash
|
||||
pip install flaml
|
||||
```
|
||||
|
||||
Minimal dependencies are installed without extra options. You can install extra options based on the feature you need. For example, use the following to install the dependencies needed by the [`autogen`](https://microsoft.github.io/autogen/) package.
|
||||
Minimal dependencies are installed without extra options. You can install extra options based on the feature you need. For example, use the following to install the dependencies needed by the [`automl`](https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML) module.
|
||||
|
||||
```bash
|
||||
pip install "flaml[autogen]"
|
||||
pip install "flaml[automl]"
|
||||
```
|
||||
|
||||
Find more options in [Installation](https://microsoft.github.io/FLAML/docs/Installation).
|
||||
@@ -57,39 +51,6 @@ Each of the [`notebook examples`](https://github.com/microsoft/FLAML/tree/main/n
|
||||
|
||||
## Quickstart
|
||||
|
||||
- (New) The [autogen](https://microsoft.github.io/autogen/) package enables the next-gen GPT-X applications with a generic multi-agent conversation framework.
|
||||
It offers customizable and conversable agents which integrate LLMs, tools and human.
|
||||
By automating chat among multiple capable agents, one can easily make them collectively perform tasks autonomously or with human feedback, including tasks that require using tools via code. For example,
|
||||
|
||||
```python
|
||||
from flaml import autogen
|
||||
|
||||
assistant = autogen.AssistantAgent("assistant")
|
||||
user_proxy = autogen.UserProxyAgent("user_proxy")
|
||||
user_proxy.initiate_chat(
|
||||
assistant,
|
||||
message="Show me the YTD gain of 10 largest technology companies as of today.",
|
||||
)
|
||||
# This initiates an automated chat between the two agents to solve the task
|
||||
```
|
||||
|
||||
Autogen also helps maximize the utility out of the expensive LLMs such as ChatGPT and GPT-4. It offers a drop-in replacement of `openai.Completion` or `openai.ChatCompletion` with powerful functionalites like tuning, caching, templating, filtering. For example, you can optimize generations by LLM with your own tuning data, success metrics and budgets.
|
||||
|
||||
```python
|
||||
# perform tuning
|
||||
config, analysis = autogen.Completion.tune(
|
||||
data=tune_data,
|
||||
metric="success",
|
||||
mode="max",
|
||||
eval_func=eval_func,
|
||||
inference_budget=0.05,
|
||||
optimization_budget=3,
|
||||
num_samples=-1,
|
||||
)
|
||||
# perform inference for a test instance
|
||||
response = autogen.Completion.create(context=test_instance, **config)
|
||||
```
|
||||
|
||||
- With three lines of code, you can start using this economical and fast
|
||||
AutoML engine as a [scikit-learn style estimator](https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML).
|
||||
|
||||
@@ -111,7 +72,10 @@ automl.fit(X_train, y_train, task="classification", estimator_list=["lgbm"])
|
||||
|
||||
```python
|
||||
from flaml import tune
|
||||
tune.run(evaluation_function, config={…}, low_cost_partial_config={…}, time_budget_s=3600)
|
||||
|
||||
tune.run(
|
||||
evaluation_function, config={…}, low_cost_partial_config={…}, time_budget_s=3600
|
||||
)
|
||||
```
|
||||
|
||||
- [Zero-shot AutoML](https://microsoft.github.io/FLAML/docs/Use-Cases/Zero-Shot-AutoML) allows using the existing training API from lightgbm, xgboost etc. while getting the benefit of AutoML in choosing high-performance hyperparameter configurations per task.
|
||||
|
||||
@@ -12,7 +12,7 @@ If you believe you have found a security vulnerability in any Microsoft-owned re
|
||||
|
||||
Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).
|
||||
|
||||
If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
|
||||
If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
|
||||
|
||||
You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
|
||||
|
||||
|
||||
@@ -1,4 +1,5 @@
|
||||
import logging
|
||||
import warnings
|
||||
|
||||
try:
|
||||
from flaml.automl import AutoML, logger_formatter
|
||||
@@ -12,7 +13,8 @@ from flaml.version import __version__
|
||||
|
||||
# Set the root logger.
|
||||
logger = logging.getLogger(__name__)
|
||||
logger.setLevel(logging.INFO)
|
||||
if logger.level == logging.NOTSET:
|
||||
logger.setLevel(logging.INFO)
|
||||
|
||||
if not has_automl:
|
||||
logger.warning("flaml.automl is not available. Please install flaml[automl] to enable AutoML functionalities.")
|
||||
warnings.warn("flaml.automl is not available. Please install flaml[automl] to enable AutoML functionalities.")
|
||||
|
||||
@@ -1,3 +1,12 @@
|
||||
import warnings
|
||||
|
||||
from .agentchat import *
|
||||
from .code_utils import DEFAULT_MODEL, FAST_MODEL
|
||||
from .oai import *
|
||||
|
||||
warnings.warn(
|
||||
"The `flaml.autogen` module is deprecated and will be removed in a future release. "
|
||||
"Please refer to `https://github.com/microsoft/autogen` for latest usage.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
|
||||
@@ -156,7 +156,7 @@ class MathUserProxyAgent(UserProxyAgent):
|
||||
when the number of auto reply reaches the max_consecutive_auto_reply or when is_termination_msg is True.
|
||||
default_auto_reply (str or dict or None): the default auto reply message when no code execution or llm based reply is generated.
|
||||
max_invalid_q_per_step (int): (ADDED) the maximum number of invalid queries per step.
|
||||
**kwargs (dict): other kwargs in [UserProxyAgent](user_proxy_agent#__init__).
|
||||
**kwargs (dict): other kwargs in [UserProxyAgent](../user_proxy_agent#__init__).
|
||||
"""
|
||||
super().__init__(
|
||||
name=name,
|
||||
|
||||
@@ -123,7 +123,7 @@ class RetrieveUserProxyAgent(UserProxyAgent):
|
||||
can be found at `https://www.sbert.net/docs/pretrained_models.html`. The default model is a
|
||||
fast model. If you want to use a high performance model, `all-mpnet-base-v2` is recommended.
|
||||
- customized_prompt (Optional, str): the customized prompt for the retrieve chat. Default is None.
|
||||
**kwargs (dict): other kwargs in [UserProxyAgent](user_proxy_agent#__init__).
|
||||
**kwargs (dict): other kwargs in [UserProxyAgent](../user_proxy_agent#__init__).
|
||||
"""
|
||||
super().__init__(
|
||||
name=name,
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,7 +1,7 @@
|
||||
try:
|
||||
from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor
|
||||
except ImportError:
|
||||
pass
|
||||
except ImportError as e:
|
||||
print(f"scikit-learn is required for HistGradientBoostingEstimator. Please install it; error: {e}")
|
||||
|
||||
from flaml import tune
|
||||
from flaml.automl.model import SKLearnEstimator
|
||||
|
||||
@@ -2,13 +2,18 @@
|
||||
# * Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
# * Licensed under the MIT License. See LICENSE file in the
|
||||
# * project root for license information.
|
||||
import json
|
||||
import os
|
||||
from datetime import datetime
|
||||
import random
|
||||
import re
|
||||
import uuid
|
||||
from datetime import datetime, timedelta
|
||||
from decimal import ROUND_HALF_UP, Decimal
|
||||
from typing import TYPE_CHECKING, Union
|
||||
|
||||
import numpy as np
|
||||
|
||||
from flaml.automl.spark import DataFrame, Series, pd, ps, psDataFrame, psSeries
|
||||
from flaml.automl.spark import DataFrame, F, Series, T, pd, ps, psDataFrame, psSeries
|
||||
from flaml.automl.training_log import training_log_reader
|
||||
|
||||
try:
|
||||
@@ -19,6 +24,7 @@ except ImportError:
|
||||
if TYPE_CHECKING:
|
||||
from flaml.automl.task import Task
|
||||
|
||||
|
||||
TS_TIMESTAMP_COL = "ds"
|
||||
TS_VALUE_COL = "y"
|
||||
|
||||
@@ -45,7 +51,10 @@ def load_openml_dataset(dataset_id, data_dir=None, random_state=0, dataset_forma
|
||||
"""
|
||||
import pickle
|
||||
|
||||
import openml
|
||||
try:
|
||||
import openml
|
||||
except ImportError:
|
||||
openml = None
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
filename = "openml_ds" + str(dataset_id) + ".pkl"
|
||||
@@ -56,15 +65,15 @@ def load_openml_dataset(dataset_id, data_dir=None, random_state=0, dataset_forma
|
||||
dataset = pickle.load(f)
|
||||
else:
|
||||
print("download dataset from openml")
|
||||
dataset = openml.datasets.get_dataset(dataset_id)
|
||||
dataset = openml.datasets.get_dataset(dataset_id) if openml else None
|
||||
if not os.path.exists(data_dir):
|
||||
os.makedirs(data_dir)
|
||||
with open(filepath, "wb") as f:
|
||||
pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL)
|
||||
print("Dataset name:", dataset.name)
|
||||
print("Dataset name:", dataset.name) if dataset else None
|
||||
try:
|
||||
X, y, *__ = dataset.get_data(target=dataset.default_target_attribute, dataset_format=dataset_format)
|
||||
except ValueError:
|
||||
except (ValueError, AttributeError, TypeError):
|
||||
from sklearn.datasets import fetch_openml
|
||||
|
||||
X, y = fetch_openml(data_id=dataset_id, return_X_y=True)
|
||||
@@ -445,3 +454,343 @@ class DataTransformer:
|
||||
def group_counts(groups):
|
||||
_, i, c = np.unique(groups, return_counts=True, return_index=True)
|
||||
return c[np.argsort(i)]
|
||||
|
||||
|
||||
def get_random_dataframe(n_rows: int = 200, ratio_none: float = 0.1, seed: int = 42) -> DataFrame:
|
||||
"""Generate a random pandas DataFrame with various data types for testing.
|
||||
This function creates a DataFrame with multiple column types including:
|
||||
- Timestamps
|
||||
- Integers
|
||||
- Floats
|
||||
- Categorical values
|
||||
- Booleans
|
||||
- Lists (tags)
|
||||
- Decimal strings
|
||||
- UUIDs
|
||||
- Binary data (as hex strings)
|
||||
- JSON blobs
|
||||
- Nullable text fields
|
||||
Parameters
|
||||
----------
|
||||
n_rows : int, default=200
|
||||
Number of rows in the generated DataFrame
|
||||
ratio_none : float, default=0.1
|
||||
Probability of generating None values in applicable columns
|
||||
seed : int, default=42
|
||||
Random seed for reproducibility
|
||||
Returns
|
||||
-------
|
||||
pd.DataFrame
|
||||
A DataFrame with 14 columns of various data types
|
||||
Examples
|
||||
--------
|
||||
>>> df = get_random_dataframe(100, 0.05, 123)
|
||||
>>> df.shape
|
||||
(100, 14)
|
||||
>>> df.dtypes
|
||||
timestamp datetime64[ns]
|
||||
id int64
|
||||
score float64
|
||||
status object
|
||||
flag object
|
||||
count object
|
||||
value object
|
||||
tags object
|
||||
rating object
|
||||
uuid object
|
||||
binary object
|
||||
json_blob object
|
||||
category category
|
||||
nullable_text object
|
||||
dtype: object
|
||||
"""
|
||||
|
||||
np.random.seed(seed)
|
||||
random.seed(seed)
|
||||
|
||||
def random_tags():
|
||||
tags = ["AI", "ML", "data", "robotics", "vision"]
|
||||
return random.sample(tags, k=random.randint(1, 3)) if random.random() > ratio_none else None
|
||||
|
||||
def random_decimal():
|
||||
return (
|
||||
str(Decimal(random.uniform(1, 5)).quantize(Decimal("0.01"), rounding=ROUND_HALF_UP))
|
||||
if random.random() > ratio_none
|
||||
else None
|
||||
)
|
||||
|
||||
def random_json_blob():
|
||||
blob = {"a": random.randint(1, 10), "b": random.random()}
|
||||
return json.dumps(blob) if random.random() > ratio_none else None
|
||||
|
||||
def random_binary():
|
||||
return bytes(random.randint(0, 255) for _ in range(4)).hex() if random.random() > ratio_none else None
|
||||
|
||||
data = {
|
||||
"timestamp": [
|
||||
datetime(2020, 1, 1) + timedelta(days=np.random.randint(0, 1000)) if np.random.rand() > ratio_none else None
|
||||
for _ in range(n_rows)
|
||||
],
|
||||
"id": range(1, n_rows + 1),
|
||||
"score": np.random.uniform(0, 100, n_rows),
|
||||
"status": np.random.choice(
|
||||
["active", "inactive", "pending", None],
|
||||
size=n_rows,
|
||||
p=[(1 - ratio_none) / 3, (1 - ratio_none) / 3, (1 - ratio_none) / 3, ratio_none],
|
||||
),
|
||||
"flag": np.random.choice(
|
||||
[True, False, None], size=n_rows, p=[(1 - ratio_none) / 2, (1 - ratio_none) / 2, ratio_none]
|
||||
),
|
||||
"count": [np.random.randint(0, 100) if np.random.rand() > ratio_none else None for _ in range(n_rows)],
|
||||
"value": [round(np.random.normal(50, 15), 2) if np.random.rand() > ratio_none else None for _ in range(n_rows)],
|
||||
"tags": [random_tags() for _ in range(n_rows)],
|
||||
"rating": [random_decimal() for _ in range(n_rows)],
|
||||
"uuid": [str(uuid.uuid4()) if np.random.rand() > ratio_none else None for _ in range(n_rows)],
|
||||
"binary": [random_binary() for _ in range(n_rows)],
|
||||
"json_blob": [random_json_blob() for _ in range(n_rows)],
|
||||
"category": pd.Categorical(
|
||||
np.random.choice(
|
||||
["A", "B", "C", None],
|
||||
size=n_rows,
|
||||
p=[(1 - ratio_none) / 3, (1 - ratio_none) / 3, (1 - ratio_none) / 3, ratio_none],
|
||||
)
|
||||
),
|
||||
"nullable_text": [random.choice(["Good", "Bad", "Average", None]) for _ in range(n_rows)],
|
||||
}
|
||||
|
||||
return pd.DataFrame(data)
|
||||
|
||||
|
||||
def auto_convert_dtypes_spark(
|
||||
df: psDataFrame,
|
||||
na_values: list = None,
|
||||
category_threshold: float = 0.3,
|
||||
convert_threshold: float = 0.6,
|
||||
sample_ratio: float = 0.1,
|
||||
) -> tuple[psDataFrame, dict]:
|
||||
"""Automatically convert data types in a PySpark DataFrame using heuristics.
|
||||
|
||||
This function analyzes a sample of the DataFrame to infer appropriate data types
|
||||
and applies the conversions. It handles timestamps, numeric values, booleans,
|
||||
and categorical fields.
|
||||
|
||||
Args:
|
||||
df: A PySpark DataFrame to convert.
|
||||
na_values: List of strings to be considered as NA/NaN. Defaults to
|
||||
['NA', 'na', 'NULL', 'null', ''].
|
||||
category_threshold: Maximum ratio of unique values to total values
|
||||
to consider a column categorical. Defaults to 0.3.
|
||||
convert_threshold: Minimum ratio of successfully converted values required
|
||||
to apply a type conversion. Defaults to 0.6.
|
||||
sample_ratio: Fraction of data to sample for type inference. Defaults to 0.1.
|
||||
|
||||
Returns:
|
||||
tuple: (The DataFrame with converted types, A dictionary mapping column names to
|
||||
their inferred types as strings)
|
||||
|
||||
Note:
|
||||
- 'category' in the schema dict is conceptual as PySpark doesn't have a true
|
||||
category type like pandas
|
||||
- The function uses sampling for efficiency with large datasets
|
||||
"""
|
||||
n_rows = df.count()
|
||||
if na_values is None:
|
||||
na_values = ["NA", "na", "NULL", "null", ""]
|
||||
|
||||
# Normalize NA-like values
|
||||
for colname, coltype in df.dtypes:
|
||||
if coltype == "string":
|
||||
df = df.withColumn(
|
||||
colname,
|
||||
F.when(F.trim(F.lower(F.col(colname))).isin([v.lower() for v in na_values]), None).otherwise(
|
||||
F.col(colname)
|
||||
),
|
||||
)
|
||||
|
||||
schema = {}
|
||||
for colname in df.columns:
|
||||
# Sample once at an appropriate ratio
|
||||
sample_ratio_to_use = min(1.0, sample_ratio if n_rows * sample_ratio > 100 else 100 / n_rows)
|
||||
col_sample = df.select(colname).sample(withReplacement=False, fraction=sample_ratio_to_use).dropna()
|
||||
sample_count = col_sample.count()
|
||||
|
||||
inferred_type = "string" # Default
|
||||
|
||||
if col_sample.dtypes[0][1] != "string":
|
||||
schema[colname] = col_sample.dtypes[0][1]
|
||||
continue
|
||||
|
||||
if sample_count == 0:
|
||||
schema[colname] = "string"
|
||||
continue
|
||||
|
||||
# Check if timestamp
|
||||
ts_col = col_sample.withColumn("parsed", F.to_timestamp(F.col(colname)))
|
||||
|
||||
# Check numeric
|
||||
if (
|
||||
col_sample.withColumn("n", F.col(colname).cast("double")).filter("n is not null").count()
|
||||
>= sample_count * convert_threshold
|
||||
):
|
||||
# All whole numbers?
|
||||
all_whole = (
|
||||
col_sample.withColumn("n", F.col(colname).cast("double"))
|
||||
.filter("n is not null")
|
||||
.withColumn("frac", F.abs(F.col("n") % 1))
|
||||
.filter("frac > 0.000001")
|
||||
.count()
|
||||
== 0
|
||||
)
|
||||
inferred_type = "int" if all_whole else "double"
|
||||
|
||||
# Check low-cardinality (category-like)
|
||||
elif (
|
||||
sample_count > 0
|
||||
and col_sample.select(F.countDistinct(F.col(colname))).collect()[0][0] / sample_count <= category_threshold
|
||||
):
|
||||
inferred_type = "category" # Will just be string, but marked as such
|
||||
|
||||
# Check if timestamp
|
||||
elif ts_col.filter(F.col("parsed").isNotNull()).count() >= sample_count * convert_threshold:
|
||||
inferred_type = "timestamp"
|
||||
|
||||
schema[colname] = inferred_type
|
||||
|
||||
# Apply inferred schema
|
||||
for colname, inferred_type in schema.items():
|
||||
if inferred_type == "int":
|
||||
df = df.withColumn(colname, F.col(colname).cast(T.IntegerType()))
|
||||
elif inferred_type == "double":
|
||||
df = df.withColumn(colname, F.col(colname).cast(T.DoubleType()))
|
||||
elif inferred_type == "boolean":
|
||||
df = df.withColumn(
|
||||
colname,
|
||||
F.when(F.lower(F.col(colname)).isin("true", "yes", "1"), True)
|
||||
.when(F.lower(F.col(colname)).isin("false", "no", "0"), False)
|
||||
.otherwise(None),
|
||||
)
|
||||
elif inferred_type == "timestamp":
|
||||
df = df.withColumn(colname, F.to_timestamp(F.col(colname)))
|
||||
elif inferred_type == "category":
|
||||
df = df.withColumn(colname, F.col(colname).cast(T.StringType())) # Marked conceptually
|
||||
|
||||
# otherwise keep as string (or original type)
|
||||
|
||||
return df, schema
|
||||
|
||||
|
||||
def auto_convert_dtypes_pandas(
|
||||
df: DataFrame,
|
||||
na_values: list = None,
|
||||
category_threshold: float = 0.3,
|
||||
convert_threshold: float = 0.6,
|
||||
sample_ratio: float = 1.0,
|
||||
) -> tuple[DataFrame, dict]:
|
||||
"""Automatically convert data types in a pandas DataFrame using heuristics.
|
||||
|
||||
This function analyzes the DataFrame to infer appropriate data types
|
||||
and applies the conversions. It handles timestamps, timedeltas, numeric values,
|
||||
and categorical fields.
|
||||
|
||||
Args:
|
||||
df: A pandas DataFrame to convert.
|
||||
na_values: List of strings to be considered as NA/NaN. Defaults to
|
||||
['NA', 'na', 'NULL', 'null', ''].
|
||||
category_threshold: Maximum ratio of unique values to total values
|
||||
to consider a column categorical. Defaults to 0.3.
|
||||
convert_threshold: Minimum ratio of successfully converted values required
|
||||
to apply a type conversion. Defaults to 0.6.
|
||||
sample_ratio: Fraction of data to sample for type inference. Not used in pandas version
|
||||
but included for API compatibility. Defaults to 1.0.
|
||||
|
||||
Returns:
|
||||
tuple: (The DataFrame with converted types, A dictionary mapping column names to
|
||||
their inferred types as strings)
|
||||
"""
|
||||
if na_values is None:
|
||||
na_values = {"NA", "na", "NULL", "null", ""}
|
||||
# Remove the empty string separately (handled by the regex `^\s*$`)
|
||||
vals = [re.escape(v) for v in na_values if v != ""]
|
||||
# Build inner alternation group
|
||||
inner = "|".join(vals) if vals else ""
|
||||
if inner:
|
||||
pattern = re.compile(rf"^\s*(?:{inner})?\s*$")
|
||||
else:
|
||||
pattern = re.compile(r"^\s*$")
|
||||
|
||||
df_converted = df.convert_dtypes()
|
||||
schema = {}
|
||||
|
||||
# Sample if needed (for API compatibility)
|
||||
if sample_ratio < 1.0:
|
||||
df = df.sample(frac=sample_ratio)
|
||||
|
||||
n_rows = len(df)
|
||||
|
||||
for col in df.columns:
|
||||
series = df[col]
|
||||
# Replace NA-like values if string
|
||||
if series.dtype == object:
|
||||
mask = series.astype(str).str.match(pattern)
|
||||
series_cleaned = series.where(~mask, np.nan)
|
||||
else:
|
||||
series_cleaned = series
|
||||
|
||||
# Skip conversion if already non-object data type, except bool which can potentially be categorical
|
||||
if (
|
||||
not isinstance(series_cleaned.dtype, pd.BooleanDtype)
|
||||
and not isinstance(series_cleaned.dtype, pd.StringDtype)
|
||||
and series_cleaned.dtype != "object"
|
||||
):
|
||||
# Keep the original data type for non-object dtypes
|
||||
df_converted[col] = series
|
||||
schema[col] = str(series_cleaned.dtype)
|
||||
continue
|
||||
|
||||
# print(f"type: {series_cleaned.dtype}, column: {series_cleaned.name}")
|
||||
|
||||
if not isinstance(series_cleaned.dtype, pd.BooleanDtype):
|
||||
# Try numeric (int or float)
|
||||
numeric = pd.to_numeric(series_cleaned, errors="coerce")
|
||||
if numeric.notna().sum() >= n_rows * convert_threshold:
|
||||
if (numeric.dropna() % 1 == 0).all():
|
||||
try:
|
||||
df_converted[col] = numeric.astype("int") # Nullable integer
|
||||
schema[col] = "int"
|
||||
continue
|
||||
except Exception:
|
||||
pass
|
||||
df_converted[col] = numeric.astype("double")
|
||||
schema[col] = "double"
|
||||
continue
|
||||
|
||||
# Try datetime
|
||||
datetime_converted = pd.to_datetime(series_cleaned, errors="coerce")
|
||||
if datetime_converted.notna().sum() >= n_rows * convert_threshold:
|
||||
df_converted[col] = datetime_converted
|
||||
schema[col] = "timestamp"
|
||||
continue
|
||||
|
||||
# Try timedelta
|
||||
try:
|
||||
timedelta_converted = pd.to_timedelta(series_cleaned, errors="coerce")
|
||||
if timedelta_converted.notna().sum() >= n_rows * convert_threshold:
|
||||
df_converted[col] = timedelta_converted
|
||||
schema[col] = "timedelta"
|
||||
continue
|
||||
except TypeError:
|
||||
pass
|
||||
|
||||
# Try category
|
||||
try:
|
||||
unique_ratio = series_cleaned.nunique(dropna=True) / n_rows if n_rows > 0 else 1.0
|
||||
if unique_ratio <= category_threshold:
|
||||
df_converted[col] = series_cleaned.astype("category")
|
||||
schema[col] = "category"
|
||||
continue
|
||||
except Exception:
|
||||
pass
|
||||
df_converted[col] = series_cleaned.astype("string")
|
||||
schema[col] = "string"
|
||||
|
||||
return df_converted, schema
|
||||
|
||||
@@ -1,7 +1,37 @@
|
||||
import logging
|
||||
import os
|
||||
|
||||
|
||||
class ColoredFormatter(logging.Formatter):
|
||||
# ANSI escape codes for colors
|
||||
COLORS = {
|
||||
# logging.DEBUG: "\033[36m", # Cyan
|
||||
# logging.INFO: "\033[32m", # Green
|
||||
logging.WARNING: "\033[33m", # Yellow
|
||||
logging.ERROR: "\033[31m", # Red
|
||||
logging.CRITICAL: "\033[1;31m", # Bright Red
|
||||
}
|
||||
RESET = "\033[0m" # Reset to default
|
||||
|
||||
def __init__(self, fmt, datefmt, use_color=True):
|
||||
super().__init__(fmt, datefmt)
|
||||
self.use_color = use_color
|
||||
|
||||
def format(self, record):
|
||||
formatted = super().format(record)
|
||||
if self.use_color:
|
||||
color = self.COLORS.get(record.levelno, "")
|
||||
if color:
|
||||
return f"{color}{formatted}{self.RESET}"
|
||||
return formatted
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
logger_formatter = logging.Formatter(
|
||||
"[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s", "%m-%d %H:%M:%S"
|
||||
use_color = True
|
||||
if os.getenv("FLAML_LOG_NO_COLOR"):
|
||||
use_color = False
|
||||
|
||||
logger_formatter = ColoredFormatter(
|
||||
"[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s", "%m-%d %H:%M:%S", use_color
|
||||
)
|
||||
logger.propagate = False
|
||||
|
||||
@@ -127,9 +127,21 @@ def metric_loss_score(
|
||||
import datasets
|
||||
|
||||
datasets_metric_name = huggingface_submetric_to_metric.get(metric_name, metric_name.split(":")[0])
|
||||
metric = datasets.load_metric(datasets_metric_name, trust_remote_code=True)
|
||||
metric_mode = huggingface_metric_to_mode[datasets_metric_name]
|
||||
|
||||
# datasets>=3 removed load_metric; prefer evaluate if available
|
||||
try:
|
||||
import evaluate
|
||||
|
||||
metric = evaluate.load(datasets_metric_name, trust_remote_code=True)
|
||||
except Exception:
|
||||
if hasattr(datasets, "load_metric"):
|
||||
metric = datasets.load_metric(datasets_metric_name, trust_remote_code=True)
|
||||
else:
|
||||
from datasets import load_metric as _load_metric # older datasets
|
||||
|
||||
metric = _load_metric(datasets_metric_name, trust_remote_code=True)
|
||||
|
||||
if metric_name.startswith("seqeval"):
|
||||
y_processed_true = [[labels[tr] for tr in each_list] for each_list in y_processed_true]
|
||||
elif metric in ("pearsonr", "spearmanr"):
|
||||
@@ -299,14 +311,14 @@ def get_y_pred(estimator, X, eval_metric, task: Task):
|
||||
else:
|
||||
y_pred = estimator.predict(X)
|
||||
|
||||
if isinstance(y_pred, Series) or isinstance(y_pred, DataFrame):
|
||||
if isinstance(y_pred, (Series, DataFrame)):
|
||||
y_pred = y_pred.values
|
||||
|
||||
return y_pred
|
||||
|
||||
|
||||
def to_numpy(x):
|
||||
if isinstance(x, Series or isinstance(x, DataFrame)):
|
||||
if isinstance(x, (Series, DataFrame)):
|
||||
x = x.values
|
||||
else:
|
||||
x = np.ndarray(x)
|
||||
@@ -574,7 +586,7 @@ def _eval_estimator(
|
||||
|
||||
# TODO: why are integer labels being cast to str in the first place?
|
||||
|
||||
if isinstance(val_pred_y, Series) or isinstance(val_pred_y, DataFrame) or isinstance(val_pred_y, np.ndarray):
|
||||
if isinstance(val_pred_y, (Series, DataFrame, np.ndarray)):
|
||||
test = val_pred_y if isinstance(val_pred_y, np.ndarray) else val_pred_y.values
|
||||
if not np.issubdtype(test.dtype, np.number):
|
||||
# some NLP models return a list
|
||||
@@ -604,7 +616,12 @@ def _eval_estimator(
|
||||
logger.warning(f"ValueError {e} happened in `metric_loss_score`, set `val_loss` to `np.inf`")
|
||||
metric_for_logging = {"pred_time": pred_time}
|
||||
if log_training_metric:
|
||||
train_pred_y = get_y_pred(estimator, X_train, eval_metric, task)
|
||||
# For time series forecasting, X_train may be a sampled dataset whose
|
||||
# test partition can be empty. Use the training partition from X_val
|
||||
# (which is the dataset used to define y_train above) to keep shapes
|
||||
# aligned and avoid empty prediction inputs.
|
||||
X_train_for_metric = X_val.X_train if isinstance(X_val, TimeSeriesDataset) else X_train
|
||||
train_pred_y = get_y_pred(estimator, X_train_for_metric, eval_metric, task)
|
||||
metric_for_logging["train_loss"] = metric_loss_score(
|
||||
eval_metric,
|
||||
train_pred_y,
|
||||
|
||||
@@ -9,6 +9,7 @@ import os
|
||||
import shutil
|
||||
import signal
|
||||
import sys
|
||||
import threading
|
||||
import time
|
||||
import warnings
|
||||
from contextlib import contextmanager
|
||||
@@ -25,6 +26,13 @@ from sklearn.preprocessing import Normalizer
|
||||
from sklearn.svm import LinearSVC
|
||||
from xgboost import __version__ as xgboost_version
|
||||
|
||||
try:
|
||||
from sklearn.utils._tags import ClassifierTags, RegressorTags
|
||||
|
||||
SKLEARN_TAGS_AVAILABLE = True
|
||||
except ImportError:
|
||||
SKLEARN_TAGS_AVAILABLE = False
|
||||
|
||||
from flaml import tune
|
||||
from flaml.automl.data import group_counts
|
||||
from flaml.automl.spark import ERROR as SPARK_ERROR
|
||||
@@ -89,24 +97,28 @@ def limit_resource(memory_limit, time_limit):
|
||||
except ValueError:
|
||||
# According to https://bugs.python.org/issue40518, it's a mac-specific error.
|
||||
pass
|
||||
main_thread = False
|
||||
if time_limit is not None:
|
||||
alarm_set = False
|
||||
if time_limit is not None and threading.current_thread() is threading.main_thread():
|
||||
try:
|
||||
signal.signal(signal.SIGALRM, TimeoutHandler)
|
||||
signal.alarm(int(time_limit) or 1)
|
||||
main_thread = True
|
||||
alarm_set = True
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
try:
|
||||
yield
|
||||
finally:
|
||||
if main_thread:
|
||||
if alarm_set:
|
||||
signal.alarm(0)
|
||||
if memory_limit > 0:
|
||||
resource.setrlimit(resource.RLIMIT_AS, (soft, hard))
|
||||
try:
|
||||
resource.setrlimit(resource.RLIMIT_AS, (soft, hard))
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
|
||||
class BaseEstimator:
|
||||
class BaseEstimator(sklearn.base.ClassifierMixin, sklearn.base.BaseEstimator):
|
||||
"""The abstract class for all learners.
|
||||
|
||||
Typical examples:
|
||||
@@ -130,7 +142,8 @@ class BaseEstimator:
|
||||
self._task = task if isinstance(task, Task) else task_factory(task, None, None)
|
||||
self.params = self.config2params(config)
|
||||
self.estimator_class = self._model = None
|
||||
if "_estimator_type" in config:
|
||||
self.estimator_baseclass = "sklearn"
|
||||
if "_estimator_type" in self.params:
|
||||
self._estimator_type = self.params.pop("_estimator_type")
|
||||
else:
|
||||
self._estimator_type = "classifier" if self._task.is_classification() else "regressor"
|
||||
@@ -142,6 +155,25 @@ class BaseEstimator:
|
||||
params["_estimator_type"] = self._estimator_type
|
||||
return params
|
||||
|
||||
def __sklearn_tags__(self):
|
||||
"""Override sklearn tags to respect the _estimator_type attribute.
|
||||
|
||||
This is needed for sklearn 1.7+ which uses get_tags() instead of
|
||||
checking _estimator_type directly. Since BaseEstimator inherits from
|
||||
ClassifierMixin, it would otherwise always be tagged as a classifier.
|
||||
"""
|
||||
tags = super().__sklearn_tags__()
|
||||
if hasattr(self, "_estimator_type") and SKLEARN_TAGS_AVAILABLE:
|
||||
if self._estimator_type == "regressor":
|
||||
tags.estimator_type = "regressor"
|
||||
tags.regressor_tags = RegressorTags()
|
||||
tags.classifier_tags = None
|
||||
elif self._estimator_type == "classifier":
|
||||
tags.estimator_type = "classifier"
|
||||
tags.classifier_tags = ClassifierTags()
|
||||
tags.regressor_tags = None
|
||||
return tags
|
||||
|
||||
@property
|
||||
def classes_(self):
|
||||
return self._model.classes_
|
||||
@@ -289,6 +321,35 @@ class BaseEstimator:
|
||||
train_time = self._fit(X_train, y_train, **kwargs)
|
||||
return train_time
|
||||
|
||||
def preprocess(self, X):
|
||||
"""Preprocess data using estimator-level preprocessing.
|
||||
|
||||
This method applies estimator-specific preprocessing transformations to the input data.
|
||||
This is the second level of preprocessing that should be applied after task-level
|
||||
preprocessing (automl.preprocess()). Different estimator types may apply different
|
||||
preprocessing steps (e.g., sparse matrix conversion, dataframe handling).
|
||||
|
||||
Args:
|
||||
X: A numpy array or a dataframe of featurized instances, shape n*m.
|
||||
|
||||
Returns:
|
||||
Preprocessed data ready for the estimator's predict/fit methods.
|
||||
|
||||
Example:
|
||||
```python
|
||||
automl = AutoML()
|
||||
automl.fit(X_train, y_train, task="classification")
|
||||
|
||||
# First apply task-level preprocessing
|
||||
X_test_task = automl.preprocess(X_test)
|
||||
|
||||
# Then apply estimator-level preprocessing
|
||||
estimator = automl.model
|
||||
X_test_estimator = estimator.preprocess(X_test_task)
|
||||
```
|
||||
"""
|
||||
return self._preprocess(X)
|
||||
|
||||
def predict(self, X, **kwargs):
|
||||
"""Predict label from features.
|
||||
|
||||
@@ -434,6 +495,7 @@ class SparkEstimator(BaseEstimator):
|
||||
raise SPARK_ERROR
|
||||
super().__init__(task, **config)
|
||||
self.df_train = None
|
||||
self.estimator_baseclass = "spark"
|
||||
|
||||
def _preprocess(
|
||||
self,
|
||||
@@ -969,7 +1031,7 @@ class TransformersEstimator(BaseEstimator):
|
||||
from .nlp.huggingface.utils import tokenize_text
|
||||
from .nlp.utils import is_a_list_of_str
|
||||
|
||||
is_str = str(X.dtypes[0]) in ("string", "str")
|
||||
is_str = str(X.dtypes.iloc[0]) in ("string", "str")
|
||||
is_list_of_str = is_a_list_of_str(X[list(X.keys())[0]].to_list()[0])
|
||||
|
||||
if is_str or is_list_of_str:
|
||||
@@ -1134,16 +1196,31 @@ class TransformersEstimator(BaseEstimator):
|
||||
control.should_save = True
|
||||
control.should_evaluate = True
|
||||
|
||||
self._trainer = TrainerForAuto(
|
||||
args=self._training_args,
|
||||
model_init=self._model_init,
|
||||
train_dataset=train_dataset,
|
||||
eval_dataset=eval_dataset,
|
||||
tokenizer=self.tokenizer,
|
||||
data_collator=self.data_collator,
|
||||
compute_metrics=self._compute_metrics_by_dataset_name,
|
||||
callbacks=[EarlyStoppingCallbackForAuto],
|
||||
)
|
||||
# Use processing_class for transformers >= 4.44.0, tokenizer for older versions
|
||||
trainer_kwargs = {
|
||||
"args": self._training_args,
|
||||
"model_init": self._model_init,
|
||||
"train_dataset": train_dataset,
|
||||
"eval_dataset": eval_dataset,
|
||||
"data_collator": self.data_collator,
|
||||
"compute_metrics": self._compute_metrics_by_dataset_name,
|
||||
"callbacks": [EarlyStoppingCallbackForAuto],
|
||||
}
|
||||
|
||||
# Check if processing_class parameter is supported (transformers >= 4.44.0)
|
||||
try:
|
||||
import transformers
|
||||
from packaging import version
|
||||
|
||||
if version.parse(transformers.__version__) >= version.parse("4.44.0"):
|
||||
trainer_kwargs["processing_class"] = self.tokenizer
|
||||
else:
|
||||
trainer_kwargs["tokenizer"] = self.tokenizer
|
||||
except (ImportError, AttributeError, ValueError):
|
||||
# Fallback to tokenizer if version check fails
|
||||
trainer_kwargs["tokenizer"] = self.tokenizer
|
||||
|
||||
self._trainer = TrainerForAuto(**trainer_kwargs)
|
||||
|
||||
if self._task in NLG_TASKS:
|
||||
setattr(self._trainer, "_is_seq2seq", True)
|
||||
@@ -1691,7 +1768,7 @@ class XGBoostEstimator(SKLearnEstimator):
|
||||
# use_label_encoder is deprecated in 1.7.
|
||||
if xgboost_version < "1.7.0":
|
||||
params["use_label_encoder"] = params.get("use_label_encoder", False)
|
||||
if "n_jobs" in config:
|
||||
if "n_jobs" in params:
|
||||
params["nthread"] = params.pop("n_jobs")
|
||||
return params
|
||||
|
||||
@@ -1891,7 +1968,7 @@ class RandomForestEstimator(SKLearnEstimator, LGBMEstimator):
|
||||
params = super().config2params(config)
|
||||
if "max_leaves" in params:
|
||||
params["max_leaf_nodes"] = params.get("max_leaf_nodes", params.pop("max_leaves"))
|
||||
if not self._task.is_classification() and "criterion" in config:
|
||||
if not self._task.is_classification() and "criterion" in params:
|
||||
params.pop("criterion")
|
||||
if "random_state" not in params:
|
||||
params["random_state"] = 12032022
|
||||
@@ -2066,8 +2143,8 @@ class CatBoostEstimator(BaseEstimator):
|
||||
self.estimator_class = CatBoostRegressor
|
||||
|
||||
def fit(self, X_train, y_train, budget=None, free_mem_ratio=0, **kwargs):
|
||||
if "is_retrain" in kwargs:
|
||||
kwargs.pop("is_retrain")
|
||||
kwargs.pop("is_retrain", None)
|
||||
kwargs.pop("groups", None)
|
||||
start_time = time.time()
|
||||
deadline = start_time + budget if budget else np.inf
|
||||
train_dir = f"catboost_{str(start_time)}"
|
||||
@@ -2342,9 +2419,12 @@ class SGDEstimator(SKLearnEstimator):
|
||||
params = super().config2params(config)
|
||||
params["tol"] = params.get("tol", 0.0001)
|
||||
params["loss"] = params.get("loss", None)
|
||||
if params["loss"] is None and self._task.is_classification():
|
||||
params["loss"] = "log_loss" if SKLEARN_VERSION >= "1.1" else "log"
|
||||
if not self._task.is_classification():
|
||||
if params["loss"] is None:
|
||||
if self._task.is_classification():
|
||||
params["loss"] = "log_loss" if SKLEARN_VERSION >= "1.1" else "log"
|
||||
else:
|
||||
params["loss"] = "squared_error"
|
||||
if not self._task.is_classification() and "n_jobs" in params:
|
||||
params.pop("n_jobs")
|
||||
|
||||
if params.get("penalty") != "elasticnet":
|
||||
@@ -2815,7 +2895,7 @@ class suppress_stdout_stderr:
|
||||
# Open a pair of null files
|
||||
self.null_fds = [os.open(os.devnull, os.O_RDWR) for x in range(2)]
|
||||
# Save the actual stdout (1) and stderr (2) file descriptors.
|
||||
self.save_fds = (os.dup(1), os.dup(2))
|
||||
self.save_fds = [os.dup(1), os.dup(2)]
|
||||
|
||||
def __enter__(self):
|
||||
# Assign the null pointers to stdout and stderr.
|
||||
@@ -2827,5 +2907,5 @@ class suppress_stdout_stderr:
|
||||
os.dup2(self.save_fds[0], 1)
|
||||
os.dup2(self.save_fds[1], 2)
|
||||
# Close the null files
|
||||
os.close(self.null_fds[0])
|
||||
os.close(self.null_fds[1])
|
||||
for fd in self.null_fds + self.save_fds:
|
||||
os.close(fd)
|
||||
|
||||
@@ -5,7 +5,7 @@ from typing import List, Optional
|
||||
from flaml.automl.task.task import NLG_TASKS
|
||||
|
||||
try:
|
||||
from transformers import TrainingArguments
|
||||
from transformers import Seq2SeqTrainingArguments as TrainingArguments
|
||||
except ImportError:
|
||||
TrainingArguments = object
|
||||
|
||||
@@ -77,6 +77,14 @@ class TrainingArgumentsForAuto(TrainingArguments):
|
||||
|
||||
logging_steps: int = field(default=500, metadata={"help": "Log every X updates steps."})
|
||||
|
||||
# Newer versions of HuggingFace Transformers may access `TrainingArguments.generation_config`
|
||||
# (e.g., in generation-aware trainers/callbacks). Keep this attribute to remain compatible
|
||||
# while defaulting to None for non-generation tasks.
|
||||
generation_config: Optional[object] = field(
|
||||
default=None,
|
||||
metadata={"help": "Optional generation config (or path) used by generation-aware trainers."},
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def load_args_from_console():
|
||||
from dataclasses import fields
|
||||
|
||||
@@ -211,29 +211,28 @@ def tokenize_onedataframe(
|
||||
hf_args=None,
|
||||
prefix_str=None,
|
||||
):
|
||||
with tokenizer.as_target_tokenizer():
|
||||
_, tokenized_column_names = tokenize_row(
|
||||
dict(X.iloc[0]),
|
||||
_, tokenized_column_names = tokenize_row(
|
||||
dict(X.iloc[0]),
|
||||
tokenizer,
|
||||
prefix=(prefix_str,) if task is SUMMARIZATION else None,
|
||||
task=task,
|
||||
hf_args=hf_args,
|
||||
return_column_name=True,
|
||||
)
|
||||
d = X.apply(
|
||||
lambda x: tokenize_row(
|
||||
x,
|
||||
tokenizer,
|
||||
prefix=(prefix_str,) if task is SUMMARIZATION else None,
|
||||
task=task,
|
||||
hf_args=hf_args,
|
||||
return_column_name=True,
|
||||
)
|
||||
d = X.apply(
|
||||
lambda x: tokenize_row(
|
||||
x,
|
||||
tokenizer,
|
||||
prefix=(prefix_str,) if task is SUMMARIZATION else None,
|
||||
task=task,
|
||||
hf_args=hf_args,
|
||||
),
|
||||
axis=1,
|
||||
result_type="expand",
|
||||
)
|
||||
X_tokenized = pd.DataFrame(columns=tokenized_column_names)
|
||||
X_tokenized[tokenized_column_names] = d
|
||||
return X_tokenized
|
||||
),
|
||||
axis=1,
|
||||
result_type="expand",
|
||||
)
|
||||
X_tokenized = pd.DataFrame(columns=tokenized_column_names)
|
||||
X_tokenized[tokenized_column_names] = d
|
||||
return X_tokenized
|
||||
|
||||
|
||||
def tokenize_row(
|
||||
@@ -396,7 +395,7 @@ def load_model(checkpoint_path, task, num_labels=None):
|
||||
|
||||
if task in (SEQCLASSIFICATION, SEQREGRESSION):
|
||||
return AutoModelForSequenceClassification.from_pretrained(
|
||||
checkpoint_path, config=model_config, ignore_mismatched_sizes=True
|
||||
checkpoint_path, config=model_config, ignore_mismatched_sizes=True, trust_remote_code=True
|
||||
)
|
||||
elif task == TOKENCLASSIFICATION:
|
||||
return AutoModelForTokenClassification.from_pretrained(checkpoint_path, config=model_config)
|
||||
|
||||
@@ -25,9 +25,7 @@ def load_default_huggingface_metric_for_task(task):
|
||||
|
||||
|
||||
def is_a_list_of_str(this_obj):
|
||||
return (isinstance(this_obj, list) or isinstance(this_obj, np.ndarray)) and all(
|
||||
isinstance(x, str) for x in this_obj
|
||||
)
|
||||
return isinstance(this_obj, (list, np.ndarray)) and all(isinstance(x, str) for x in this_obj)
|
||||
|
||||
|
||||
def _clean_value(value: Any) -> str:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
import atexit
|
||||
import logging
|
||||
import os
|
||||
|
||||
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
|
||||
@@ -10,13 +12,14 @@ try:
|
||||
from pyspark.pandas import Series as psSeries
|
||||
from pyspark.pandas import set_option
|
||||
from pyspark.sql import DataFrame as sparkDataFrame
|
||||
from pyspark.sql import SparkSession
|
||||
from pyspark.util import VersionUtils
|
||||
except ImportError:
|
||||
|
||||
class psDataFrame:
|
||||
pass
|
||||
|
||||
F = T = ps = sparkDataFrame = psSeries = psDataFrame
|
||||
F = T = ps = sparkDataFrame = SparkSession = psSeries = psDataFrame
|
||||
_spark_major_minor_version = set_option = None
|
||||
ERROR = ImportError(
|
||||
"""Please run pip install flaml[spark]
|
||||
@@ -32,3 +35,60 @@ try:
|
||||
from pandas import DataFrame, Series
|
||||
except ImportError:
|
||||
DataFrame = Series = pd = None
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def disable_spark_ansi_mode():
|
||||
"""Disable Spark ANSI mode if it is enabled."""
|
||||
spark = SparkSession.getActiveSession() if hasattr(SparkSession, "getActiveSession") else None
|
||||
adjusted = False
|
||||
try:
|
||||
ps_conf = ps.get_option("compute.fail_on_ansi_mode")
|
||||
except Exception:
|
||||
ps_conf = None
|
||||
ansi_conf = [None, ps_conf] # ansi_conf and ps_conf original values
|
||||
# Spark may store the config as string 'true'/'false' (or boolean in some contexts)
|
||||
if spark is not None:
|
||||
ansi_conf[0] = spark.conf.get("spark.sql.ansi.enabled")
|
||||
ansi_enabled = (
|
||||
(isinstance(ansi_conf[0], str) and ansi_conf[0].lower() == "true")
|
||||
or (isinstance(ansi_conf[0], bool) and ansi_conf[0] is True)
|
||||
or ansi_conf[0] is None
|
||||
)
|
||||
try:
|
||||
if ansi_enabled:
|
||||
logger.debug("Adjusting spark.sql.ansi.enabled to false")
|
||||
spark.conf.set("spark.sql.ansi.enabled", "false")
|
||||
adjusted = True
|
||||
except Exception:
|
||||
# If reading/setting options fail for some reason, keep going and let
|
||||
# pandas-on-Spark raise a meaningful error later.
|
||||
logger.exception("Failed to set spark.sql.ansi.enabled")
|
||||
|
||||
if ansi_conf[1]:
|
||||
logger.debug("Adjusting pandas-on-Spark compute.fail_on_ansi_mode to False")
|
||||
ps.set_option("compute.fail_on_ansi_mode", False)
|
||||
adjusted = True
|
||||
|
||||
return spark, ansi_conf, adjusted
|
||||
|
||||
|
||||
def restore_spark_ansi_mode(spark, ansi_conf, adjusted):
|
||||
"""Restore Spark ANSI mode to its original setting."""
|
||||
# Restore the original spark.sql.ansi.enabled to avoid persistent side-effects.
|
||||
if adjusted and spark and ansi_conf[0] is not None:
|
||||
try:
|
||||
logger.debug(f"Restoring spark.sql.ansi.enabled to {ansi_conf[0]}")
|
||||
spark.conf.set("spark.sql.ansi.enabled", ansi_conf[0])
|
||||
except Exception:
|
||||
logger.exception("Failed to restore spark.sql.ansi.enabled")
|
||||
|
||||
if adjusted and ansi_conf[1]:
|
||||
logger.debug(f"Restoring pandas-on-Spark compute.fail_on_ansi_mode to {ansi_conf[1]}")
|
||||
ps.set_option("compute.fail_on_ansi_mode", ansi_conf[1])
|
||||
|
||||
|
||||
spark, ansi_conf, adjusted = disable_spark_ansi_mode()
|
||||
atexit.register(restore_spark_ansi_mode, spark, ansi_conf, adjusted)
|
||||
|
||||
@@ -59,17 +59,29 @@ def to_pandas_on_spark(
|
||||
```
|
||||
"""
|
||||
set_option("compute.default_index_type", default_index_type)
|
||||
if isinstance(df, (DataFrame, Series)):
|
||||
return ps.from_pandas(df)
|
||||
elif isinstance(df, sparkDataFrame):
|
||||
if _spark_major_minor_version[0] == 3 and _spark_major_minor_version[1] < 3:
|
||||
return df.to_pandas_on_spark(index_col=index_col)
|
||||
try:
|
||||
orig_ps_conf = ps.get_option("compute.fail_on_ansi_mode")
|
||||
except Exception:
|
||||
orig_ps_conf = None
|
||||
if orig_ps_conf:
|
||||
ps.set_option("compute.fail_on_ansi_mode", False)
|
||||
|
||||
try:
|
||||
if isinstance(df, (DataFrame, Series)):
|
||||
return ps.from_pandas(df)
|
||||
elif isinstance(df, sparkDataFrame):
|
||||
if _spark_major_minor_version[0] == 3 and _spark_major_minor_version[1] < 3:
|
||||
return df.to_pandas_on_spark(index_col=index_col)
|
||||
else:
|
||||
return df.pandas_api(index_col=index_col)
|
||||
elif isinstance(df, (psDataFrame, psSeries)):
|
||||
return df
|
||||
else:
|
||||
return df.pandas_api(index_col=index_col)
|
||||
elif isinstance(df, (psDataFrame, psSeries)):
|
||||
return df
|
||||
else:
|
||||
raise TypeError(f"{type(df)} is not one of pandas.DataFrame, pandas.Series and pyspark.sql.DataFrame")
|
||||
raise TypeError(f"{type(df)} is not one of pandas.DataFrame, pandas.Series and pyspark.sql.DataFrame")
|
||||
finally:
|
||||
# Restore original config
|
||||
if orig_ps_conf:
|
||||
ps.set_option("compute.fail_on_ansi_mode", orig_ps_conf)
|
||||
|
||||
|
||||
def train_test_split_pyspark(
|
||||
|
||||
@@ -37,10 +37,9 @@ class SearchState:
|
||||
if isinstance(domain_one_dim, sample.Domain):
|
||||
renamed_type = list(inspect.signature(domain_one_dim.is_valid).parameters.values())[0].annotation
|
||||
type_match = (
|
||||
renamed_type == Any
|
||||
renamed_type is Any
|
||||
or isinstance(value_one_dim, renamed_type)
|
||||
or isinstance(value_one_dim, int)
|
||||
and renamed_type is float
|
||||
or (renamed_type is float and isinstance(value_one_dim, int))
|
||||
)
|
||||
if not (type_match and domain_one_dim.is_valid(value_one_dim)):
|
||||
return False
|
||||
|
||||
@@ -365,6 +365,465 @@ class GenericTask(Task):
|
||||
X_train, X_val, y_train, y_val = GenericTask._split_pyspark(state, X, y, split_ratio, stratify)
|
||||
return X_train, X_val, y_train, y_val
|
||||
|
||||
def _handle_missing_labels_fast(
|
||||
self,
|
||||
state,
|
||||
X_train,
|
||||
X_val,
|
||||
y_train,
|
||||
y_val,
|
||||
X_train_all,
|
||||
y_train_all,
|
||||
is_spark_dataframe,
|
||||
data_is_df,
|
||||
):
|
||||
"""Handle missing labels by adding first instance to the set with missing label.
|
||||
|
||||
This is the faster version that may create some overlap but ensures all labels
|
||||
are present in both sets. If a label is missing from train, it adds the first
|
||||
instance to train. If a label is missing from val, it adds the first instance to val.
|
||||
If no labels are missing, no instances are duplicated.
|
||||
|
||||
Args:
|
||||
state: The state object containing fit parameters
|
||||
X_train, X_val: Training and validation features
|
||||
y_train, y_val: Training and validation labels
|
||||
X_train_all, y_train_all: Complete dataset
|
||||
is_spark_dataframe: Whether data is pandas_on_spark
|
||||
data_is_df: Whether data is DataFrame/Series
|
||||
|
||||
Returns:
|
||||
Tuple of (X_train, X_val, y_train, y_val) with missing labels added
|
||||
"""
|
||||
# Check which labels are present in train and val sets
|
||||
if is_spark_dataframe:
|
||||
label_set_train, _ = unique_pandas_on_spark(y_train)
|
||||
label_set_val, _ = unique_pandas_on_spark(y_val)
|
||||
label_set_all, first = unique_value_first_index(y_train_all)
|
||||
else:
|
||||
label_set_all, first = unique_value_first_index(y_train_all)
|
||||
label_set_train = np.unique(y_train)
|
||||
label_set_val = np.unique(y_val)
|
||||
|
||||
# Find missing labels
|
||||
missing_in_train = np.setdiff1d(label_set_all, label_set_train)
|
||||
missing_in_val = np.setdiff1d(label_set_all, label_set_val)
|
||||
|
||||
# Add first instance of missing labels to train set
|
||||
if len(missing_in_train) > 0:
|
||||
missing_train_indices = []
|
||||
for label in missing_in_train:
|
||||
label_matches = np.where(label_set_all == label)[0]
|
||||
if len(label_matches) > 0 and label_matches[0] < len(first):
|
||||
missing_train_indices.append(first[label_matches[0]])
|
||||
|
||||
if len(missing_train_indices) > 0:
|
||||
X_missing_train = (
|
||||
iloc_pandas_on_spark(X_train_all, missing_train_indices)
|
||||
if is_spark_dataframe
|
||||
else X_train_all.iloc[missing_train_indices]
|
||||
if data_is_df
|
||||
else X_train_all[missing_train_indices]
|
||||
)
|
||||
y_missing_train = (
|
||||
iloc_pandas_on_spark(y_train_all, missing_train_indices)
|
||||
if is_spark_dataframe
|
||||
else y_train_all.iloc[missing_train_indices]
|
||||
if isinstance(y_train_all, (pd.Series, psSeries))
|
||||
else y_train_all[missing_train_indices]
|
||||
)
|
||||
X_train = concat(X_missing_train, X_train)
|
||||
y_train = concat(y_missing_train, y_train) if data_is_df else np.concatenate([y_missing_train, y_train])
|
||||
|
||||
# Handle sample_weight if present
|
||||
if "sample_weight" in state.fit_kwargs:
|
||||
sample_weight_source = (
|
||||
state.sample_weight_all
|
||||
if hasattr(state, "sample_weight_all")
|
||||
else state.fit_kwargs.get("sample_weight")
|
||||
)
|
||||
if sample_weight_source is not None and max(missing_train_indices) < len(sample_weight_source):
|
||||
missing_weights = (
|
||||
sample_weight_source[missing_train_indices]
|
||||
if isinstance(sample_weight_source, np.ndarray)
|
||||
else sample_weight_source.iloc[missing_train_indices]
|
||||
)
|
||||
state.fit_kwargs["sample_weight"] = concat(missing_weights, state.fit_kwargs["sample_weight"])
|
||||
|
||||
# Add first instance of missing labels to val set
|
||||
if len(missing_in_val) > 0:
|
||||
missing_val_indices = []
|
||||
for label in missing_in_val:
|
||||
label_matches = np.where(label_set_all == label)[0]
|
||||
if len(label_matches) > 0 and label_matches[0] < len(first):
|
||||
missing_val_indices.append(first[label_matches[0]])
|
||||
|
||||
if len(missing_val_indices) > 0:
|
||||
X_missing_val = (
|
||||
iloc_pandas_on_spark(X_train_all, missing_val_indices)
|
||||
if is_spark_dataframe
|
||||
else X_train_all.iloc[missing_val_indices]
|
||||
if data_is_df
|
||||
else X_train_all[missing_val_indices]
|
||||
)
|
||||
y_missing_val = (
|
||||
iloc_pandas_on_spark(y_train_all, missing_val_indices)
|
||||
if is_spark_dataframe
|
||||
else y_train_all.iloc[missing_val_indices]
|
||||
if isinstance(y_train_all, (pd.Series, psSeries))
|
||||
else y_train_all[missing_val_indices]
|
||||
)
|
||||
X_val = concat(X_missing_val, X_val)
|
||||
y_val = concat(y_missing_val, y_val) if data_is_df else np.concatenate([y_missing_val, y_val])
|
||||
|
||||
# Handle sample_weight if present
|
||||
if (
|
||||
"sample_weight" in state.fit_kwargs
|
||||
and hasattr(state, "weight_val")
|
||||
and state.weight_val is not None
|
||||
):
|
||||
sample_weight_source = (
|
||||
state.sample_weight_all
|
||||
if hasattr(state, "sample_weight_all")
|
||||
else state.fit_kwargs.get("sample_weight")
|
||||
)
|
||||
if sample_weight_source is not None and max(missing_val_indices) < len(sample_weight_source):
|
||||
missing_weights = (
|
||||
sample_weight_source[missing_val_indices]
|
||||
if isinstance(sample_weight_source, np.ndarray)
|
||||
else sample_weight_source.iloc[missing_val_indices]
|
||||
)
|
||||
state.weight_val = concat(missing_weights, state.weight_val)
|
||||
|
||||
return X_train, X_val, y_train, y_val
|
||||
|
||||
def _handle_missing_labels_no_overlap(
|
||||
self,
|
||||
state,
|
||||
X_train,
|
||||
X_val,
|
||||
y_train,
|
||||
y_val,
|
||||
X_train_all,
|
||||
y_train_all,
|
||||
is_spark_dataframe,
|
||||
data_is_df,
|
||||
split_ratio,
|
||||
):
|
||||
"""Handle missing labels intelligently to avoid overlap when possible.
|
||||
|
||||
This is the slower but more precise version that:
|
||||
- For single-instance classes: Adds to both sets (unavoidable overlap)
|
||||
- For multi-instance classes: Re-splits them properly to avoid overlap
|
||||
|
||||
Args:
|
||||
state: The state object containing fit parameters
|
||||
X_train, X_val: Training and validation features
|
||||
y_train, y_val: Training and validation labels
|
||||
X_train_all, y_train_all: Complete dataset
|
||||
is_spark_dataframe: Whether data is pandas_on_spark
|
||||
data_is_df: Whether data is DataFrame/Series
|
||||
split_ratio: The ratio for splitting
|
||||
|
||||
Returns:
|
||||
Tuple of (X_train, X_val, y_train, y_val) with missing labels handled
|
||||
"""
|
||||
# Check which labels are present in train and val sets
|
||||
if is_spark_dataframe:
|
||||
label_set_train, _ = unique_pandas_on_spark(y_train)
|
||||
label_set_val, _ = unique_pandas_on_spark(y_val)
|
||||
label_set_all, first = unique_value_first_index(y_train_all)
|
||||
else:
|
||||
label_set_all, first = unique_value_first_index(y_train_all)
|
||||
label_set_train = np.unique(y_train)
|
||||
label_set_val = np.unique(y_val)
|
||||
|
||||
# Find missing labels
|
||||
missing_in_train = np.setdiff1d(label_set_all, label_set_train)
|
||||
missing_in_val = np.setdiff1d(label_set_all, label_set_val)
|
||||
|
||||
# Handle missing labels intelligently
|
||||
# For classes with only 1 instance: add to both sets (unavoidable overlap)
|
||||
# For classes with multiple instances: move/split them properly to avoid overlap
|
||||
|
||||
if len(missing_in_train) > 0:
|
||||
# Process missing labels in training set
|
||||
for label in missing_in_train:
|
||||
# Find all indices for this label in the original data
|
||||
if is_spark_dataframe:
|
||||
label_indices = np.where(y_train_all.to_numpy() == label)[0].tolist()
|
||||
else:
|
||||
label_indices = np.where(np.asarray(y_train_all) == label)[0].tolist()
|
||||
|
||||
num_instances = len(label_indices)
|
||||
|
||||
if num_instances == 1:
|
||||
# Single instance: must add to both train and val (unavoidable overlap)
|
||||
X_single = (
|
||||
iloc_pandas_on_spark(X_train_all, label_indices)
|
||||
if is_spark_dataframe
|
||||
else X_train_all.iloc[label_indices]
|
||||
if data_is_df
|
||||
else X_train_all[label_indices]
|
||||
)
|
||||
y_single = (
|
||||
iloc_pandas_on_spark(y_train_all, label_indices)
|
||||
if is_spark_dataframe
|
||||
else y_train_all.iloc[label_indices]
|
||||
if isinstance(y_train_all, (pd.Series, psSeries))
|
||||
else y_train_all[label_indices]
|
||||
)
|
||||
X_train = concat(X_single, X_train)
|
||||
y_train = concat(y_single, y_train) if data_is_df else np.concatenate([y_single, y_train])
|
||||
|
||||
# Handle sample_weight
|
||||
if "sample_weight" in state.fit_kwargs:
|
||||
sample_weight_source = (
|
||||
state.sample_weight_all
|
||||
if hasattr(state, "sample_weight_all")
|
||||
else state.fit_kwargs.get("sample_weight")
|
||||
)
|
||||
if sample_weight_source is not None and label_indices[0] < len(sample_weight_source):
|
||||
single_weight = (
|
||||
sample_weight_source[label_indices]
|
||||
if isinstance(sample_weight_source, np.ndarray)
|
||||
else sample_weight_source.iloc[label_indices]
|
||||
)
|
||||
state.fit_kwargs["sample_weight"] = concat(single_weight, state.fit_kwargs["sample_weight"])
|
||||
else:
|
||||
# Multiple instances: move some from val to train (no overlap needed)
|
||||
# Calculate how many to move to train (leave at least 1 in val)
|
||||
num_to_train = max(1, min(num_instances - 1, int(num_instances * (1 - split_ratio))))
|
||||
indices_to_move = label_indices[:num_to_train]
|
||||
|
||||
X_to_move = (
|
||||
iloc_pandas_on_spark(X_train_all, indices_to_move)
|
||||
if is_spark_dataframe
|
||||
else X_train_all.iloc[indices_to_move]
|
||||
if data_is_df
|
||||
else X_train_all[indices_to_move]
|
||||
)
|
||||
y_to_move = (
|
||||
iloc_pandas_on_spark(y_train_all, indices_to_move)
|
||||
if is_spark_dataframe
|
||||
else y_train_all.iloc[indices_to_move]
|
||||
if isinstance(y_train_all, (pd.Series, psSeries))
|
||||
else y_train_all[indices_to_move]
|
||||
)
|
||||
|
||||
# Add to train
|
||||
X_train = concat(X_to_move, X_train)
|
||||
y_train = concat(y_to_move, y_train) if data_is_df else np.concatenate([y_to_move, y_train])
|
||||
|
||||
# Remove from val (they are currently all in val)
|
||||
if is_spark_dataframe:
|
||||
val_mask = ~y_val.isin([label])
|
||||
X_val = X_val[val_mask]
|
||||
y_val = y_val[val_mask]
|
||||
else:
|
||||
val_mask = np.asarray(y_val) != label
|
||||
if data_is_df:
|
||||
X_val = X_val[val_mask]
|
||||
y_val = y_val[val_mask]
|
||||
else:
|
||||
X_val = X_val[val_mask]
|
||||
y_val = y_val[val_mask]
|
||||
|
||||
# Add remaining instances back to val
|
||||
remaining_indices = label_indices[num_to_train:]
|
||||
if len(remaining_indices) > 0:
|
||||
X_remaining = (
|
||||
iloc_pandas_on_spark(X_train_all, remaining_indices)
|
||||
if is_spark_dataframe
|
||||
else X_train_all.iloc[remaining_indices]
|
||||
if data_is_df
|
||||
else X_train_all[remaining_indices]
|
||||
)
|
||||
y_remaining = (
|
||||
iloc_pandas_on_spark(y_train_all, remaining_indices)
|
||||
if is_spark_dataframe
|
||||
else y_train_all.iloc[remaining_indices]
|
||||
if isinstance(y_train_all, (pd.Series, psSeries))
|
||||
else y_train_all[remaining_indices]
|
||||
)
|
||||
X_val = concat(X_remaining, X_val)
|
||||
y_val = concat(y_remaining, y_val) if data_is_df else np.concatenate([y_remaining, y_val])
|
||||
|
||||
# Handle sample_weight
|
||||
if "sample_weight" in state.fit_kwargs:
|
||||
sample_weight_source = (
|
||||
state.sample_weight_all
|
||||
if hasattr(state, "sample_weight_all")
|
||||
else state.fit_kwargs.get("sample_weight")
|
||||
)
|
||||
if sample_weight_source is not None and max(indices_to_move) < len(sample_weight_source):
|
||||
weights_to_move = (
|
||||
sample_weight_source[indices_to_move]
|
||||
if isinstance(sample_weight_source, np.ndarray)
|
||||
else sample_weight_source.iloc[indices_to_move]
|
||||
)
|
||||
state.fit_kwargs["sample_weight"] = concat(
|
||||
weights_to_move, state.fit_kwargs["sample_weight"]
|
||||
)
|
||||
|
||||
if (
|
||||
len(remaining_indices) > 0
|
||||
and hasattr(state, "weight_val")
|
||||
and state.weight_val is not None
|
||||
):
|
||||
# Remove and re-add weights for val
|
||||
if isinstance(state.weight_val, np.ndarray):
|
||||
state.weight_val = state.weight_val[val_mask]
|
||||
else:
|
||||
state.weight_val = state.weight_val[val_mask]
|
||||
|
||||
if max(remaining_indices) < len(sample_weight_source):
|
||||
remaining_weights = (
|
||||
sample_weight_source[remaining_indices]
|
||||
if isinstance(sample_weight_source, np.ndarray)
|
||||
else sample_weight_source.iloc[remaining_indices]
|
||||
)
|
||||
state.weight_val = concat(remaining_weights, state.weight_val)
|
||||
|
||||
if len(missing_in_val) > 0:
|
||||
# Process missing labels in validation set
|
||||
for label in missing_in_val:
|
||||
# Find all indices for this label in the original data
|
||||
if is_spark_dataframe:
|
||||
label_indices = np.where(y_train_all.to_numpy() == label)[0].tolist()
|
||||
else:
|
||||
label_indices = np.where(np.asarray(y_train_all) == label)[0].tolist()
|
||||
|
||||
num_instances = len(label_indices)
|
||||
|
||||
if num_instances == 1:
|
||||
# Single instance: must add to both train and val (unavoidable overlap)
|
||||
X_single = (
|
||||
iloc_pandas_on_spark(X_train_all, label_indices)
|
||||
if is_spark_dataframe
|
||||
else X_train_all.iloc[label_indices]
|
||||
if data_is_df
|
||||
else X_train_all[label_indices]
|
||||
)
|
||||
y_single = (
|
||||
iloc_pandas_on_spark(y_train_all, label_indices)
|
||||
if is_spark_dataframe
|
||||
else y_train_all.iloc[label_indices]
|
||||
if isinstance(y_train_all, (pd.Series, psSeries))
|
||||
else y_train_all[label_indices]
|
||||
)
|
||||
X_val = concat(X_single, X_val)
|
||||
y_val = concat(y_single, y_val) if data_is_df else np.concatenate([y_single, y_val])
|
||||
|
||||
# Handle sample_weight
|
||||
if "sample_weight" in state.fit_kwargs and hasattr(state, "weight_val"):
|
||||
sample_weight_source = (
|
||||
state.sample_weight_all
|
||||
if hasattr(state, "sample_weight_all")
|
||||
else state.fit_kwargs.get("sample_weight")
|
||||
)
|
||||
if sample_weight_source is not None and label_indices[0] < len(sample_weight_source):
|
||||
single_weight = (
|
||||
sample_weight_source[label_indices]
|
||||
if isinstance(sample_weight_source, np.ndarray)
|
||||
else sample_weight_source.iloc[label_indices]
|
||||
)
|
||||
if state.weight_val is not None:
|
||||
state.weight_val = concat(single_weight, state.weight_val)
|
||||
else:
|
||||
# Multiple instances: move some from train to val (no overlap needed)
|
||||
# Calculate how many to move to val (leave at least 1 in train)
|
||||
num_to_val = max(1, min(num_instances - 1, int(num_instances * split_ratio)))
|
||||
indices_to_move = label_indices[:num_to_val]
|
||||
|
||||
X_to_move = (
|
||||
iloc_pandas_on_spark(X_train_all, indices_to_move)
|
||||
if is_spark_dataframe
|
||||
else X_train_all.iloc[indices_to_move]
|
||||
if data_is_df
|
||||
else X_train_all[indices_to_move]
|
||||
)
|
||||
y_to_move = (
|
||||
iloc_pandas_on_spark(y_train_all, indices_to_move)
|
||||
if is_spark_dataframe
|
||||
else y_train_all.iloc[indices_to_move]
|
||||
if isinstance(y_train_all, (pd.Series, psSeries))
|
||||
else y_train_all[indices_to_move]
|
||||
)
|
||||
|
||||
# Add to val
|
||||
X_val = concat(X_to_move, X_val)
|
||||
y_val = concat(y_to_move, y_val) if data_is_df else np.concatenate([y_to_move, y_val])
|
||||
|
||||
# Remove from train (they are currently all in train)
|
||||
if is_spark_dataframe:
|
||||
train_mask = ~y_train.isin([label])
|
||||
X_train = X_train[train_mask]
|
||||
y_train = y_train[train_mask]
|
||||
else:
|
||||
train_mask = np.asarray(y_train) != label
|
||||
if data_is_df:
|
||||
X_train = X_train[train_mask]
|
||||
y_train = y_train[train_mask]
|
||||
else:
|
||||
X_train = X_train[train_mask]
|
||||
y_train = y_train[train_mask]
|
||||
|
||||
# Add remaining instances back to train
|
||||
remaining_indices = label_indices[num_to_val:]
|
||||
if len(remaining_indices) > 0:
|
||||
X_remaining = (
|
||||
iloc_pandas_on_spark(X_train_all, remaining_indices)
|
||||
if is_spark_dataframe
|
||||
else X_train_all.iloc[remaining_indices]
|
||||
if data_is_df
|
||||
else X_train_all[remaining_indices]
|
||||
)
|
||||
y_remaining = (
|
||||
iloc_pandas_on_spark(y_train_all, remaining_indices)
|
||||
if is_spark_dataframe
|
||||
else y_train_all.iloc[remaining_indices]
|
||||
if isinstance(y_train_all, (pd.Series, psSeries))
|
||||
else y_train_all[remaining_indices]
|
||||
)
|
||||
X_train = concat(X_remaining, X_train)
|
||||
y_train = concat(y_remaining, y_train) if data_is_df else np.concatenate([y_remaining, y_train])
|
||||
|
||||
# Handle sample_weight
|
||||
if "sample_weight" in state.fit_kwargs:
|
||||
sample_weight_source = (
|
||||
state.sample_weight_all
|
||||
if hasattr(state, "sample_weight_all")
|
||||
else state.fit_kwargs.get("sample_weight")
|
||||
)
|
||||
if sample_weight_source is not None and max(indices_to_move) < len(sample_weight_source):
|
||||
weights_to_move = (
|
||||
sample_weight_source[indices_to_move]
|
||||
if isinstance(sample_weight_source, np.ndarray)
|
||||
else sample_weight_source.iloc[indices_to_move]
|
||||
)
|
||||
if hasattr(state, "weight_val") and state.weight_val is not None:
|
||||
state.weight_val = concat(weights_to_move, state.weight_val)
|
||||
|
||||
if len(remaining_indices) > 0:
|
||||
# Remove and re-add weights for train
|
||||
if isinstance(state.fit_kwargs["sample_weight"], np.ndarray):
|
||||
state.fit_kwargs["sample_weight"] = state.fit_kwargs["sample_weight"][train_mask]
|
||||
else:
|
||||
state.fit_kwargs["sample_weight"] = state.fit_kwargs["sample_weight"][train_mask]
|
||||
|
||||
if max(remaining_indices) < len(sample_weight_source):
|
||||
remaining_weights = (
|
||||
sample_weight_source[remaining_indices]
|
||||
if isinstance(sample_weight_source, np.ndarray)
|
||||
else sample_weight_source.iloc[remaining_indices]
|
||||
)
|
||||
state.fit_kwargs["sample_weight"] = concat(
|
||||
remaining_weights, state.fit_kwargs["sample_weight"]
|
||||
)
|
||||
|
||||
return X_train, X_val, y_train, y_val
|
||||
|
||||
def prepare_data(
|
||||
self,
|
||||
state,
|
||||
@@ -377,6 +836,7 @@ class GenericTask(Task):
|
||||
n_splits,
|
||||
data_is_df,
|
||||
sample_weight_full,
|
||||
allow_label_overlap=True,
|
||||
) -> int:
|
||||
X_val, y_val = state.X_val, state.y_val
|
||||
if issparse(X_val):
|
||||
@@ -442,8 +902,8 @@ class GenericTask(Task):
|
||||
X_train_all, y_train_all = shuffle(X_train_all, y_train_all, random_state=RANDOM_SEED)
|
||||
if data_is_df:
|
||||
X_train_all.reset_index(drop=True, inplace=True)
|
||||
if isinstance(y_train_all, pd.Series):
|
||||
y_train_all.reset_index(drop=True, inplace=True)
|
||||
if isinstance(y_train_all, pd.Series):
|
||||
y_train_all.reset_index(drop=True, inplace=True)
|
||||
|
||||
X_train, y_train = X_train_all, y_train_all
|
||||
state.groups_all = state.groups
|
||||
@@ -505,59 +965,46 @@ class GenericTask(Task):
|
||||
elif self.is_classification():
|
||||
# for classification, make sure the labels are complete in both
|
||||
# training and validation data
|
||||
label_set, first = unique_value_first_index(y_train_all)
|
||||
rest = []
|
||||
last = 0
|
||||
first.sort()
|
||||
for i in range(len(first)):
|
||||
rest.extend(range(last, first[i]))
|
||||
last = first[i] + 1
|
||||
rest.extend(range(last, len(y_train_all)))
|
||||
X_first = X_train_all.iloc[first] if data_is_df else X_train_all[first]
|
||||
if len(first) < len(y_train_all) / 2:
|
||||
# Get X_rest and y_rest with drop, sparse matrix can't apply np.delete
|
||||
X_rest = (
|
||||
np.delete(X_train_all, first, axis=0)
|
||||
if isinstance(X_train_all, np.ndarray)
|
||||
else X_train_all.drop(first.tolist())
|
||||
if data_is_df
|
||||
else X_train_all[rest]
|
||||
)
|
||||
y_rest = (
|
||||
np.delete(y_train_all, first, axis=0)
|
||||
if isinstance(y_train_all, np.ndarray)
|
||||
else y_train_all.drop(first.tolist())
|
||||
if data_is_df
|
||||
else y_train_all[rest]
|
||||
stratify = y_train_all if split_type == "stratified" else None
|
||||
X_train, X_val, y_train, y_val = self._train_test_split(
|
||||
state, X_train_all, y_train_all, split_ratio=split_ratio, stratify=stratify
|
||||
)
|
||||
|
||||
# Handle missing labels using the appropriate strategy
|
||||
if allow_label_overlap:
|
||||
# Fast version: adds first instance to set with missing label (may create overlap)
|
||||
X_train, X_val, y_train, y_val = self._handle_missing_labels_fast(
|
||||
state,
|
||||
X_train,
|
||||
X_val,
|
||||
y_train,
|
||||
y_val,
|
||||
X_train_all,
|
||||
y_train_all,
|
||||
is_spark_dataframe,
|
||||
data_is_df,
|
||||
)
|
||||
else:
|
||||
X_rest = (
|
||||
iloc_pandas_on_spark(X_train_all, rest)
|
||||
if is_spark_dataframe
|
||||
else X_train_all.iloc[rest]
|
||||
if data_is_df
|
||||
else X_train_all[rest]
|
||||
# Precise version: avoids overlap when possible (slower)
|
||||
X_train, X_val, y_train, y_val = self._handle_missing_labels_no_overlap(
|
||||
state,
|
||||
X_train,
|
||||
X_val,
|
||||
y_train,
|
||||
y_val,
|
||||
X_train_all,
|
||||
y_train_all,
|
||||
is_spark_dataframe,
|
||||
data_is_df,
|
||||
split_ratio,
|
||||
)
|
||||
y_rest = (
|
||||
iloc_pandas_on_spark(y_train_all, rest)
|
||||
if is_spark_dataframe
|
||||
else y_train_all.iloc[rest]
|
||||
if data_is_df
|
||||
else y_train_all[rest]
|
||||
)
|
||||
stratify = y_rest if split_type == "stratified" else None
|
||||
X_train, X_val, y_train, y_val = self._train_test_split(
|
||||
state, X_rest, y_rest, first, rest, split_ratio, stratify
|
||||
)
|
||||
X_train = concat(X_first, X_train)
|
||||
y_train = concat(label_set, y_train) if data_is_df else np.concatenate([label_set, y_train])
|
||||
X_val = concat(X_first, X_val)
|
||||
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
|
||||
|
||||
if isinstance(y_train, (psDataFrame, pd.DataFrame)) and y_train.shape[1] == 1:
|
||||
y_train = y_train[y_train.columns[0]]
|
||||
y_val = y_val[y_val.columns[0]]
|
||||
y_train.name = y_val.name = y_rest.name
|
||||
# Only set name if y_train_all is a Series (not a DataFrame)
|
||||
if isinstance(y_train_all, (pd.Series, psSeries)):
|
||||
y_train.name = y_val.name = y_train_all.name
|
||||
|
||||
elif self.is_regression():
|
||||
X_train, X_val, y_train, y_val = self._train_test_split(
|
||||
@@ -746,7 +1193,10 @@ class GenericTask(Task):
|
||||
elif isinstance(kf, TimeSeriesSplit):
|
||||
kf = kf.split(X_train_split, y_train_split)
|
||||
else:
|
||||
kf = kf.split(X_train_split)
|
||||
try:
|
||||
kf = kf.split(X_train_split)
|
||||
except TypeError:
|
||||
kf = kf.split(X_train_split, y_train_split)
|
||||
|
||||
for train_index, val_index in kf:
|
||||
if shuffle:
|
||||
@@ -769,10 +1219,10 @@ class GenericTask(Task):
|
||||
if not is_spark_dataframe:
|
||||
y_train, y_val = y_train_split[train_index], y_train_split[val_index]
|
||||
if weight is not None:
|
||||
fit_kwargs["sample_weight"], weight_val = (
|
||||
weight[train_index],
|
||||
weight[val_index],
|
||||
fit_kwargs["sample_weight"] = (
|
||||
weight[train_index] if isinstance(weight, np.ndarray) else weight.iloc[train_index]
|
||||
)
|
||||
weight_val = weight[val_index] if isinstance(weight, np.ndarray) else weight.iloc[val_index]
|
||||
if groups is not None:
|
||||
fit_kwargs["groups"] = (
|
||||
groups[train_index] if isinstance(groups, np.ndarray) else groups.iloc[train_index]
|
||||
|
||||
@@ -192,7 +192,7 @@ class Task(ABC):
|
||||
* Valid str options depend on different tasks.
|
||||
For classification tasks, valid choices are
|
||||
["auto", 'stratified', 'uniform', 'time', 'group']. "auto" -> stratified.
|
||||
For regression tasks, valid choices are ["auto", 'uniform', 'time'].
|
||||
For regression tasks, valid choices are ["auto", 'uniform', 'time', 'group'].
|
||||
"auto" -> uniform.
|
||||
For time series forecast tasks, must be "auto" or 'time'.
|
||||
For ranking task, must be "auto" or 'group'.
|
||||
|
||||
@@ -151,7 +151,7 @@ class TimeSeriesTask(Task):
|
||||
raise ValueError("Must supply either X_train_all and y_train_all, or dataframe and label")
|
||||
|
||||
try:
|
||||
dataframe[self.time_col] = pd.to_datetime(dataframe[self.time_col])
|
||||
dataframe.loc[:, self.time_col] = pd.to_datetime(dataframe[self.time_col])
|
||||
except Exception:
|
||||
raise ValueError(
|
||||
f"For '{TS_FORECAST}' task, time column {self.time_col} must contain timestamp values."
|
||||
@@ -386,9 +386,8 @@ class TimeSeriesTask(Task):
|
||||
return X
|
||||
|
||||
def preprocess(self, X, transformer=None):
|
||||
if isinstance(X, pd.DataFrame) or isinstance(X, np.ndarray) or isinstance(X, pd.Series):
|
||||
X = X.copy()
|
||||
X = normalize_ts_data(X, self.target_names, self.time_col)
|
||||
if isinstance(X, (pd.DataFrame, np.ndarray, pd.Series)):
|
||||
X = normalize_ts_data(X.copy(), self.target_names, self.time_col)
|
||||
return self._preprocess(X, transformer)
|
||||
elif isinstance(X, int):
|
||||
return X
|
||||
@@ -529,7 +528,7 @@ def remove_ts_duplicates(
|
||||
duplicates = X.duplicated()
|
||||
|
||||
if any(duplicates):
|
||||
logger.warning("Duplicate timestamp values found in timestamp column. " f"\n{X.loc[duplicates, X][time_col]}")
|
||||
logger.warning("Duplicate timestamp values found in timestamp column. " f"\n{X.loc[duplicates, time_col]}")
|
||||
X = X.drop_duplicates()
|
||||
logger.warning("Removed duplicate rows based on all columns")
|
||||
assert (
|
||||
|
||||
@@ -17,24 +17,30 @@ from sklearn.preprocessing import StandardScaler
|
||||
|
||||
|
||||
def make_lag_features(X: pd.DataFrame, y: pd.Series, lags: int):
|
||||
"""Transform input data X, y into autoregressive form - shift
|
||||
them appropriately based on horizon and create `lags` columns.
|
||||
"""Transform input data X, y into autoregressive form by creating `lags` columns.
|
||||
|
||||
This function is called automatically by FLAML during the training process
|
||||
to convert time series data into a format suitable for sklearn-based regression
|
||||
models (e.g., lgbm, rf, xgboost). Users do NOT need to manually call this function
|
||||
or create lagged features themselves.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
X : pandas.DataFrame
|
||||
Input features.
|
||||
Input feature DataFrame, which may contain temporal features and/or exogenous variables.
|
||||
|
||||
y : array_like, (1d)
|
||||
Target vector.
|
||||
Target vector (time series values to forecast).
|
||||
|
||||
horizon : int
|
||||
length of X for `predict` method
|
||||
lags : int
|
||||
Number of lagged time steps to use as features.
|
||||
|
||||
Returns
|
||||
-------
|
||||
pandas.DataFrame
|
||||
shifted dataframe with `lags` columns
|
||||
Shifted dataframe with `lags` columns for each original feature.
|
||||
The target variable y is also lagged to prevent data leakage
|
||||
(i.e., we use y(t-1), y(t-2), ..., y(t-lags) to predict y(t)).
|
||||
"""
|
||||
lag_features = []
|
||||
|
||||
@@ -55,6 +61,17 @@ def make_lag_features(X: pd.DataFrame, y: pd.Series, lags: int):
|
||||
|
||||
|
||||
class SklearnWrapper:
|
||||
"""Wrapper class for using sklearn-based models for time series forecasting.
|
||||
|
||||
This wrapper automatically handles the transformation of time series data into
|
||||
a supervised learning format by creating lagged features. It trains separate
|
||||
models for each step in the forecast horizon.
|
||||
|
||||
Users typically don't interact with this class directly - it's used internally
|
||||
by FLAML when sklearn-based estimators (lgbm, rf, xgboost, etc.) are selected
|
||||
for time series forecasting tasks.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
model_class: type,
|
||||
@@ -76,6 +93,8 @@ class SklearnWrapper:
|
||||
self.pca = None
|
||||
|
||||
def fit(self, X: pd.DataFrame, y: pd.Series, **kwargs):
|
||||
if "is_retrain" in kwargs:
|
||||
kwargs.pop("is_retrain")
|
||||
self._X = X
|
||||
self._y = y
|
||||
|
||||
@@ -92,7 +111,14 @@ class SklearnWrapper:
|
||||
|
||||
for i, model in enumerate(self.models):
|
||||
offset = i + self.lags
|
||||
model.fit(X_trans[: len(X) - offset], y[offset:], **fit_params)
|
||||
if len(X) - offset > 2:
|
||||
# series with length 2 will meet All features are either constant or ignored.
|
||||
# TODO: see why the non-constant features are ignored. Selector?
|
||||
model.fit(X_trans[: len(X) - offset], y[offset:], **fit_params)
|
||||
elif len(X) > offset and "catboost" not in str(model).lower():
|
||||
model.fit(X_trans[: len(X) - offset], y[offset:], **fit_params)
|
||||
else:
|
||||
print("[INFO]: Length of data should longer than period + lags.")
|
||||
return self
|
||||
|
||||
def predict(self, X, X_train=None, y_train=None):
|
||||
|
||||
@@ -264,7 +264,8 @@ class TCNEstimator(TimeSeriesEstimator):
|
||||
def predict(self, X):
|
||||
X = self.enrich(X)
|
||||
if isinstance(X, TimeSeriesDataset):
|
||||
df = X.X_val
|
||||
# Use X_train if X_val is empty (e.g., when computing training metrics)
|
||||
df = X.X_val if len(X.test_data) > 0 else X.X_train
|
||||
else:
|
||||
df = X
|
||||
dataset = DataframeDataset(
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
import inspect
|
||||
import time
|
||||
|
||||
try:
|
||||
@@ -106,12 +107,17 @@ class TemporalFusionTransformerEstimator(TimeSeriesEstimator):
|
||||
def fit(self, X_train, y_train, budget=None, **kwargs):
|
||||
import warnings
|
||||
|
||||
import pytorch_lightning as pl
|
||||
try:
|
||||
import lightning.pytorch as pl
|
||||
from lightning.pytorch.callbacks import EarlyStopping, LearningRateMonitor
|
||||
from lightning.pytorch.loggers import TensorBoardLogger
|
||||
except ImportError:
|
||||
import pytorch_lightning as pl
|
||||
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
|
||||
from pytorch_lightning.loggers import TensorBoardLogger
|
||||
import torch
|
||||
from pytorch_forecasting import TemporalFusionTransformer
|
||||
from pytorch_forecasting.metrics import QuantileLoss
|
||||
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
|
||||
from pytorch_lightning.loggers import TensorBoardLogger
|
||||
|
||||
# a bit of monkey patching to fix the MacOS test
|
||||
# all the log_prediction method appears to do is plot stuff, which ?breaks github tests
|
||||
@@ -132,12 +138,26 @@ class TemporalFusionTransformerEstimator(TimeSeriesEstimator):
|
||||
lr_logger = LearningRateMonitor() # log the learning rate
|
||||
logger = TensorBoardLogger(kwargs.get("log_dir", "lightning_logs")) # logging results to a tensorboard
|
||||
default_trainer_kwargs = dict(
|
||||
gpus=self._kwargs.get("gpu_per_trial", [0]) if torch.cuda.is_available() else None,
|
||||
max_epochs=max_epochs,
|
||||
gradient_clip_val=gradient_clip_val,
|
||||
callbacks=[lr_logger, early_stop_callback],
|
||||
logger=logger,
|
||||
)
|
||||
|
||||
# PyTorch Lightning >=2.0 replaced `gpus` with `accelerator`/`devices`.
|
||||
# Also, passing `gpus=None` is not accepted on newer versions.
|
||||
trainer_sig_params = inspect.signature(pl.Trainer.__init__).parameters
|
||||
if torch.cuda.is_available() and "gpus" in trainer_sig_params:
|
||||
gpus = self._kwargs.get("gpu_per_trial", None)
|
||||
if gpus is not None:
|
||||
default_trainer_kwargs["gpus"] = gpus
|
||||
elif torch.cuda.is_available() and "devices" in trainer_sig_params:
|
||||
devices = self._kwargs.get("gpu_per_trial", None)
|
||||
if devices == -1:
|
||||
devices = "auto"
|
||||
if devices is not None:
|
||||
default_trainer_kwargs["accelerator"] = "gpu"
|
||||
default_trainer_kwargs["devices"] = devices
|
||||
trainer = pl.Trainer(
|
||||
**default_trainer_kwargs,
|
||||
)
|
||||
@@ -157,7 +177,14 @@ class TemporalFusionTransformerEstimator(TimeSeriesEstimator):
|
||||
val_dataloaders=val_dataloader,
|
||||
)
|
||||
best_model_path = trainer.checkpoint_callback.best_model_path
|
||||
best_tft = TemporalFusionTransformer.load_from_checkpoint(best_model_path)
|
||||
# PyTorch 2.6 changed `torch.load` default `weights_only` from False -> True.
|
||||
# Some Lightning checkpoints (including those produced here) can require full unpickling.
|
||||
# This path is generated locally during training, so it's trusted.
|
||||
load_sig_params = inspect.signature(TemporalFusionTransformer.load_from_checkpoint).parameters
|
||||
if "weights_only" in load_sig_params:
|
||||
best_tft = TemporalFusionTransformer.load_from_checkpoint(best_model_path, weights_only=False)
|
||||
else:
|
||||
best_tft = TemporalFusionTransformer.load_from_checkpoint(best_model_path)
|
||||
train_time = time.time() - current_time
|
||||
self._model = best_tft
|
||||
return train_time
|
||||
@@ -170,7 +197,11 @@ class TemporalFusionTransformerEstimator(TimeSeriesEstimator):
|
||||
last_data_cols = self.group_ids.copy()
|
||||
last_data_cols.append(self.target_names[0])
|
||||
last_data = self.data[lambda x: x.time_idx == x.time_idx.max()][last_data_cols]
|
||||
decoder_data = X.X_val if isinstance(X, TimeSeriesDataset) else X
|
||||
# Use X_train if test_data is empty (e.g., when computing training metrics)
|
||||
if isinstance(X, TimeSeriesDataset):
|
||||
decoder_data = X.X_val if len(X.test_data) > 0 else X.X_train
|
||||
else:
|
||||
decoder_data = X
|
||||
if "time_idx" not in decoder_data:
|
||||
decoder_data = add_time_idx_col(decoder_data)
|
||||
decoder_data["time_idx"] += encoder_data["time_idx"].max() + 1 - decoder_data["time_idx"].min()
|
||||
|
||||
@@ -9,6 +9,7 @@ import numpy as np
|
||||
try:
|
||||
import pandas as pd
|
||||
from pandas import DataFrame, Series, to_datetime
|
||||
from pandas.api.types import is_datetime64_any_dtype
|
||||
from scipy.sparse import issparse
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.impute import SimpleImputer
|
||||
@@ -120,7 +121,12 @@ class TimeSeriesDataset:
|
||||
|
||||
@property
|
||||
def X_all(self) -> pd.DataFrame:
|
||||
return pd.concat([self.X_train, self.X_val], axis=0)
|
||||
# Remove empty or all-NA columns before concatenation
|
||||
X_train_filtered = self.X_train.dropna(axis=1, how="all")
|
||||
X_val_filtered = self.X_val.dropna(axis=1, how="all")
|
||||
|
||||
# Concatenate the filtered DataFrames
|
||||
return pd.concat([X_train_filtered, X_val_filtered], axis=0)
|
||||
|
||||
@property
|
||||
def y_train(self) -> pd.DataFrame:
|
||||
@@ -392,6 +398,15 @@ class DataTransformerTS:
|
||||
assert len(self.num_columns) == 0, "Trying to call fit() twice, something is wrong"
|
||||
|
||||
for column in X.columns:
|
||||
# Never treat the time column as a feature for sklearn preprocessing
|
||||
if column == self.time_col:
|
||||
continue
|
||||
|
||||
# Robust datetime detection (covers datetime64[ms/us/ns], tz-aware, etc.)
|
||||
if is_datetime64_any_dtype(X[column]):
|
||||
self.datetime_columns.append(column)
|
||||
continue
|
||||
|
||||
# sklearn/utils/validation.py needs int/float values
|
||||
if X[column].dtype.name in ("object", "category", "string"):
|
||||
if (
|
||||
@@ -462,7 +477,7 @@ class DataTransformerTS:
|
||||
if "__NAN__" not in X[col].cat.categories:
|
||||
X[col] = X[col].cat.add_categories("__NAN__").fillna("__NAN__")
|
||||
else:
|
||||
X[col] = X[col].fillna("__NAN__")
|
||||
X[col] = X[col].fillna("__NAN__").infer_objects(copy=False)
|
||||
X[col] = X[col].astype("category")
|
||||
|
||||
for column in self.num_columns:
|
||||
@@ -531,14 +546,12 @@ def normalize_ts_data(X_train_all, target_names, time_col, y_train_all=None):
|
||||
|
||||
|
||||
def validate_data_basic(X_train_all, y_train_all):
|
||||
assert isinstance(X_train_all, np.ndarray) or issparse(X_train_all) or isinstance(X_train_all, pd.DataFrame), (
|
||||
"X_train_all must be a numpy array, a pandas dataframe, " "or Scipy sparse matrix."
|
||||
)
|
||||
assert isinstance(X_train_all, (np.ndarray, DataFrame)) or issparse(
|
||||
X_train_all
|
||||
), "X_train_all must be a numpy array, a pandas dataframe, or Scipy sparse matrix."
|
||||
|
||||
assert (
|
||||
isinstance(y_train_all, np.ndarray)
|
||||
or isinstance(y_train_all, pd.Series)
|
||||
or isinstance(y_train_all, pd.DataFrame)
|
||||
assert isinstance(
|
||||
y_train_all, (np.ndarray, pd.Series, pd.DataFrame)
|
||||
), "y_train_all must be a numpy array or a pandas series or DataFrame."
|
||||
|
||||
assert X_train_all.size != 0 and y_train_all.size != 0, "Input data must not be empty, use None if no data"
|
||||
|
||||
@@ -194,7 +194,13 @@ class Orbit(TimeSeriesEstimator):
|
||||
|
||||
elif isinstance(X, TimeSeriesDataset):
|
||||
data = X
|
||||
X = data.test_data[[self.time_col] + X.regressors]
|
||||
# By default we predict on the dataset's test partition.
|
||||
# Some internal call paths (e.g., training-metric logging) may pass a
|
||||
# dataset whose test partition is empty; fall back to train partition.
|
||||
if data.test_data is not None and len(data.test_data):
|
||||
X = data.test_data[data.regressors + [data.time_col]]
|
||||
else:
|
||||
X = data.train_data[data.regressors + [data.time_col]]
|
||||
|
||||
if self._model is not None:
|
||||
forecast = self._model.predict(X, **kwargs)
|
||||
@@ -301,7 +307,13 @@ class Prophet(TimeSeriesEstimator):
|
||||
|
||||
if isinstance(X, TimeSeriesDataset):
|
||||
data = X
|
||||
X = data.test_data[data.regressors + [data.time_col]]
|
||||
# By default we predict on the dataset's test partition.
|
||||
# Some internal call paths (e.g., training-metric logging) may pass a
|
||||
# dataset whose test partition is empty; fall back to train partition.
|
||||
if data.test_data is not None and len(data.test_data):
|
||||
X = data.test_data[data.regressors + [data.time_col]]
|
||||
else:
|
||||
X = data.train_data[data.regressors + [data.time_col]]
|
||||
|
||||
X = X.rename(columns={self.time_col: "ds"})
|
||||
if self._model is not None:
|
||||
@@ -327,11 +339,19 @@ class StatsModelsEstimator(TimeSeriesEstimator):
|
||||
|
||||
if isinstance(X, TimeSeriesDataset):
|
||||
data = X
|
||||
X = data.test_data[data.regressors + [data.time_col]]
|
||||
# By default we predict on the dataset's test partition.
|
||||
# Some internal call paths (e.g., training-metric logging) may pass a
|
||||
# dataset whose test partition is empty; fall back to train partition.
|
||||
if data.test_data is not None and len(data.test_data):
|
||||
X = data.test_data[data.regressors + [data.time_col]]
|
||||
else:
|
||||
X = data.train_data[data.regressors + [data.time_col]]
|
||||
else:
|
||||
X = X[self.regressors + [self.time_col]]
|
||||
|
||||
if isinstance(X, DataFrame):
|
||||
if X.shape[0] == 0:
|
||||
return pd.Series([], name=self.target_names[0], dtype=float)
|
||||
start = X[self.time_col].iloc[0]
|
||||
end = X[self.time_col].iloc[-1]
|
||||
if len(self.regressors):
|
||||
@@ -829,6 +849,13 @@ class TS_SKLearn(TimeSeriesEstimator):
|
||||
if isinstance(X, TimeSeriesDataset):
|
||||
data = X
|
||||
X = data.test_data
|
||||
# By default we predict on the dataset's test partition.
|
||||
# Some internal call paths (e.g., training-metric logging) may pass a
|
||||
# dataset whose test partition is empty; fall back to train partition.
|
||||
if data.test_data is not None and len(data.test_data):
|
||||
X = data.test_data
|
||||
else:
|
||||
X = data.train_data
|
||||
|
||||
if self._model is not None:
|
||||
X = X[self.regressors]
|
||||
|
||||
@@ -95,6 +95,27 @@ def flamlize_estimator(super_class, name: str, task: str, alternatives=None):
|
||||
def fit(self, X, y, *args, **params):
|
||||
hyperparams, estimator_name, X, y_transformed = self.suggest_hyperparams(X, y)
|
||||
self.set_params(**hyperparams)
|
||||
|
||||
# Transform eval_set if present
|
||||
if "eval_set" in params and params["eval_set"] is not None:
|
||||
transformed_eval_set = []
|
||||
for eval_X, eval_y in params["eval_set"]:
|
||||
# Transform features
|
||||
eval_X_transformed = self._feature_transformer.transform(eval_X)
|
||||
# Transform labels if applicable
|
||||
if self._label_transformer and estimator_name in [
|
||||
"rf",
|
||||
"extra_tree",
|
||||
"xgboost",
|
||||
"xgb_limitdepth",
|
||||
"choose_xgb",
|
||||
]:
|
||||
eval_y_transformed = self._label_transformer.transform(eval_y)
|
||||
transformed_eval_set.append((eval_X_transformed, eval_y_transformed))
|
||||
else:
|
||||
transformed_eval_set.append((eval_X_transformed, eval_y))
|
||||
params["eval_set"] = transformed_eval_set
|
||||
|
||||
if self._label_transformer and estimator_name in [
|
||||
"rf",
|
||||
"extra_tree",
|
||||
|
||||
@@ -32,6 +32,7 @@ def construct_portfolio(regret_matrix, meta_features, regret_bound):
|
||||
if meta_features is not None:
|
||||
scaler = RobustScaler()
|
||||
meta_features = meta_features.loc[tasks]
|
||||
meta_features = meta_features.astype(float)
|
||||
meta_features.loc[:, :] = scaler.fit_transform(meta_features)
|
||||
nearest_task = {}
|
||||
for t in tasks:
|
||||
|
||||
@@ -26,6 +26,7 @@ def config_predictor_tuple(tasks, configs, meta_features, regret_matrix):
|
||||
# pre-processing
|
||||
scaler = RobustScaler()
|
||||
meta_features_norm = meta_features.loc[tasks] # this makes a copy
|
||||
meta_features_norm = meta_features_norm.astype(float)
|
||||
meta_features_norm.loc[:, :] = scaler.fit_transform(meta_features_norm)
|
||||
|
||||
proc = {
|
||||
|
||||
@@ -1,10 +1,14 @@
|
||||
import atexit
|
||||
import functools
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import pickle
|
||||
import random
|
||||
import sys
|
||||
import tempfile
|
||||
import time
|
||||
import warnings
|
||||
from concurrent.futures import ThreadPoolExecutor, wait
|
||||
from typing import MutableMapping
|
||||
|
||||
import mlflow
|
||||
@@ -12,14 +16,15 @@ import pandas as pd
|
||||
from mlflow.entities import Metric, Param, RunTag
|
||||
from mlflow.exceptions import MlflowException
|
||||
from mlflow.utils.autologging_utils import AUTOLOGGING_INTEGRATIONS, autologging_is_disabled
|
||||
from packaging.requirements import Requirement
|
||||
from scipy.sparse import issparse
|
||||
from sklearn import tree
|
||||
|
||||
try:
|
||||
from pyspark.ml import Pipeline as SparkPipeline
|
||||
from pyspark.ml import PipelineModel as SparkPipelineModel
|
||||
except ImportError:
|
||||
|
||||
class SparkPipeline:
|
||||
class SparkPipelineModel:
|
||||
pass
|
||||
|
||||
|
||||
@@ -32,6 +37,84 @@ from flaml.version import __version__
|
||||
|
||||
SEARCH_MAX_RESULTS = 5000 # Each train should not have more than 5000 trials
|
||||
IS_RENAME_CHILD_RUN = os.environ.get("FLAML_IS_RENAME_CHILD_RUN", "false").lower() == "true"
|
||||
REMOVE_REQUIREMENT_LIST = [
|
||||
"synapseml-cognitive",
|
||||
"synapseml-core",
|
||||
"synapseml-deep-learning",
|
||||
"synapseml-internal",
|
||||
"synapseml-mlflow",
|
||||
"synapseml-opencv",
|
||||
"synapseml-vw",
|
||||
"synapseml-lightgbm",
|
||||
"synapseml-utils",
|
||||
"nni",
|
||||
"optuna",
|
||||
]
|
||||
OPTIONAL_REMOVE_REQUIREMENT_LIST = ["pytorch-lightning", "transformers"]
|
||||
|
||||
os.environ["MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR"] = os.environ.get("MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR", "false")
|
||||
|
||||
MLFLOW_NUM_WORKERS = int(os.environ.get("FLAML_MLFLOW_NUM_WORKERS", os.cpu_count() * 4 if os.cpu_count() else 2))
|
||||
executor = ThreadPoolExecutor(max_workers=MLFLOW_NUM_WORKERS)
|
||||
atexit.register(lambda: executor.shutdown(wait=True))
|
||||
|
||||
IS_CLEAN_LOGS = os.environ.get("FLAML_IS_CLEAN_LOGS", "1")
|
||||
if IS_CLEAN_LOGS == "1":
|
||||
logging.getLogger("synapse.ml").setLevel(logging.CRITICAL)
|
||||
logging.getLogger("mlflow.utils").setLevel(logging.CRITICAL)
|
||||
logging.getLogger("mlflow.utils.environment").setLevel(logging.CRITICAL)
|
||||
logging.getLogger("mlflow.models.model").setLevel(logging.CRITICAL)
|
||||
warnings.simplefilter("ignore", category=FutureWarning)
|
||||
warnings.simplefilter("ignore", category=UserWarning)
|
||||
|
||||
|
||||
def convert_requirement(requirement_list: list[str]):
|
||||
ret = (
|
||||
[Requirement(s.strip().lower()) for s in requirement_list]
|
||||
if mlflow.__version__ <= "2.17.0"
|
||||
else requirement_list
|
||||
)
|
||||
return ret
|
||||
|
||||
|
||||
def time_it(func_or_code=None):
|
||||
"""
|
||||
Decorator or function that measures execution time.
|
||||
|
||||
Can be used in three ways:
|
||||
1. As a decorator with no arguments: @time_it
|
||||
2. As a decorator with arguments: @time_it()
|
||||
3. As a function call with a string of code to execute and time: time_it("some_code()")
|
||||
|
||||
Args:
|
||||
func_or_code (callable or str, optional): Either a function to decorate or
|
||||
a string of code to execute and time.
|
||||
|
||||
Returns:
|
||||
callable or None: Returns a decorated function if used as a decorator,
|
||||
or None if used to execute a string of code.
|
||||
"""
|
||||
|
||||
def decorator(func):
|
||||
@functools.wraps(func)
|
||||
def wrapper(*args, **kwargs):
|
||||
start_time = time.time()
|
||||
result = func(*args, **kwargs)
|
||||
end_time = time.time()
|
||||
logger.debug(f"Execution of {func.__name__} took {end_time - start_time:.4f} seconds")
|
||||
return result
|
||||
|
||||
return wrapper
|
||||
|
||||
if callable(func_or_code):
|
||||
return decorator(func_or_code)
|
||||
elif func_or_code is None:
|
||||
return decorator
|
||||
else:
|
||||
start_time = time.time()
|
||||
exec(func_or_code)
|
||||
end_time = time.time()
|
||||
logger.debug(f"Execution\n```\n{func_or_code}\n```\ntook {end_time - start_time:.4f} seconds")
|
||||
|
||||
|
||||
def flatten_dict(d: MutableMapping, sep: str = ".") -> MutableMapping:
|
||||
@@ -49,23 +132,28 @@ def is_autolog_enabled():
|
||||
return not all(autologging_is_disabled(k) for k in AUTOLOGGING_INTEGRATIONS.keys())
|
||||
|
||||
|
||||
def get_mlflow_log_latency(model_history=False):
|
||||
def get_mlflow_log_latency(model_history=False, delete_run=True):
|
||||
try:
|
||||
FLAML_MLFLOW_LOG_LATENCY = float(os.getenv("FLAML_MLFLOW_LOG_LATENCY", 0))
|
||||
except ValueError:
|
||||
FLAML_MLFLOW_LOG_LATENCY = 0
|
||||
if FLAML_MLFLOW_LOG_LATENCY >= 0.1:
|
||||
return FLAML_MLFLOW_LOG_LATENCY
|
||||
st = time.time()
|
||||
with mlflow.start_run(nested=True, run_name="get_mlflow_log_latency") as run:
|
||||
if model_history:
|
||||
sk_model = tree.DecisionTreeClassifier()
|
||||
mlflow.sklearn.log_model(sk_model, "sk_models")
|
||||
mlflow.sklearn.log_model(Pipeline([("estimator", sk_model)]), "sk_pipeline")
|
||||
mlflow.sklearn.log_model(sk_model, "model")
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
pickle_fpath = os.path.join(tmpdir, f"tmp_{int(time.time()*1000)}")
|
||||
pickle_fpath = os.path.join(tmpdir, f"tmp_{int(time.time() * 1000)}")
|
||||
with open(pickle_fpath, "wb") as f:
|
||||
pickle.dump(sk_model, f)
|
||||
mlflow.log_artifact(pickle_fpath, "sk_model1")
|
||||
mlflow.log_artifact(pickle_fpath, "sk_model2")
|
||||
mlflow.log_artifact(pickle_fpath, "sk_model")
|
||||
mlflow.set_tag("synapseml.ui.visible", "false") # not shown inline in fabric
|
||||
mlflow.delete_run(run.info.run_id)
|
||||
if delete_run:
|
||||
mlflow.delete_run(run.info.run_id)
|
||||
et = time.time()
|
||||
return et - st
|
||||
return 3 * (et - st)
|
||||
|
||||
|
||||
def infer_signature(X_train=None, y_train=None, dataframe=None, label=None):
|
||||
@@ -98,12 +186,76 @@ def infer_signature(X_train=None, y_train=None, dataframe=None, label=None):
|
||||
)
|
||||
|
||||
|
||||
def update_and_install_requirements(
|
||||
run_id=None,
|
||||
model_name=None,
|
||||
model_version=None,
|
||||
remove_list=None,
|
||||
artifact_path="model",
|
||||
dst_path=None,
|
||||
install_with_ipython=False,
|
||||
):
|
||||
if not (run_id or (model_name and model_version)):
|
||||
raise ValueError(
|
||||
"Please provide `run_id` or both `model_name` and `model_version`. If all three are provided, `run_id` will be used."
|
||||
)
|
||||
|
||||
if install_with_ipython:
|
||||
from IPython import get_ipython
|
||||
|
||||
if not remove_list:
|
||||
remove_list = [
|
||||
"synapseml-cognitive",
|
||||
"synapseml-core",
|
||||
"synapseml-deep-learning",
|
||||
"synapseml-internal",
|
||||
"synapseml-mlflow",
|
||||
"synapseml-opencv",
|
||||
"synapseml-vw",
|
||||
"synapseml-lightgbm",
|
||||
"synapseml-utils",
|
||||
"flaml", # flaml is needed for AutoML models, should be pre-installed in the runtime
|
||||
"pyspark", # fabric internal pyspark should be pre-installed in the runtime
|
||||
]
|
||||
|
||||
# Download model artifacts
|
||||
client = mlflow.MlflowClient()
|
||||
if not run_id:
|
||||
run_id = client.get_model_version(model_name, model_version).run_id
|
||||
if not dst_path:
|
||||
dst_path = os.path.join(tempfile.gettempdir(), "model_artifacts")
|
||||
os.makedirs(dst_path, exist_ok=True)
|
||||
client.download_artifacts(run_id, artifact_path, dst_path)
|
||||
requirements_path = os.path.join(dst_path, artifact_path, "requirements.txt")
|
||||
with open(requirements_path) as f:
|
||||
reqs = f.read().splitlines()
|
||||
old_reqs = [Requirement(req) for req in reqs if req]
|
||||
old_reqs_dict = {req.name: str(req) for req in old_reqs}
|
||||
for req in remove_list:
|
||||
req = Requirement(req)
|
||||
if req.name in old_reqs_dict:
|
||||
old_reqs_dict.pop(req.name, None)
|
||||
new_reqs_list = list(old_reqs_dict.values())
|
||||
|
||||
with open(requirements_path, "w") as f:
|
||||
f.write("\n".join(new_reqs_list))
|
||||
|
||||
if install_with_ipython:
|
||||
get_ipython().run_line_magic("pip", f"install -r {requirements_path} -q")
|
||||
else:
|
||||
logger.info(f"You can run `pip install -r {requirements_path}` to install dependencies.")
|
||||
return requirements_path
|
||||
|
||||
|
||||
def _mlflow_wrapper(evaluation_func, mlflow_exp_id, mlflow_config=None, extra_tags=None, autolog=False):
|
||||
def wrapped(*args, **kwargs):
|
||||
if mlflow_config is not None:
|
||||
from synapse.ml.mlflow import set_mlflow_env_config
|
||||
try:
|
||||
from synapse.ml.mlflow import set_mlflow_env_config
|
||||
|
||||
set_mlflow_env_config(mlflow_config)
|
||||
set_mlflow_env_config(mlflow_config)
|
||||
except Exception:
|
||||
pass
|
||||
import mlflow
|
||||
|
||||
if mlflow_exp_id is not None:
|
||||
@@ -124,7 +276,20 @@ def _mlflow_wrapper(evaluation_func, mlflow_exp_id, mlflow_config=None, extra_ta
|
||||
|
||||
|
||||
def _get_notebook_name():
|
||||
return None
|
||||
try:
|
||||
import re
|
||||
|
||||
from synapse.ml.mlflow import get_mlflow_env_config
|
||||
from synapse.ml.mlflow.shared_platform_utils import get_artifact
|
||||
|
||||
notebook_id = get_mlflow_env_config(False).artifact_id
|
||||
current_notebook = get_artifact(notebook_id)
|
||||
notebook_name = re.sub("\\W+", "-", current_notebook.displayName).strip()
|
||||
|
||||
return notebook_name
|
||||
except Exception as e:
|
||||
logger.debug(f"Failed to get notebook name: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def safe_json_dumps(obj):
|
||||
@@ -163,6 +328,8 @@ class MLflowIntegration:
|
||||
self.has_model = False
|
||||
self.only_history = False
|
||||
self._do_log_model = True
|
||||
self.futures = {}
|
||||
self.futures_log_model = {}
|
||||
|
||||
self.extra_tag = (
|
||||
extra_tag
|
||||
@@ -170,6 +337,9 @@ class MLflowIntegration:
|
||||
else {"extra_tag.sid": f"flaml_{__version__}_{int(time.time())}_{random.randint(1001, 9999)}"}
|
||||
)
|
||||
self.start_time = time.time()
|
||||
self.experiment_type = experiment_type
|
||||
self.update_autolog_state()
|
||||
|
||||
self.mlflow_client = mlflow.tracking.MlflowClient()
|
||||
parent_run_info = mlflow.active_run().info if mlflow.active_run() is not None else None
|
||||
if parent_run_info:
|
||||
@@ -188,8 +358,6 @@ class MLflowIntegration:
|
||||
mlflow.set_experiment(experiment_name=mlflow_exp_name)
|
||||
self.experiment_id = mlflow.tracking.fluent._active_experiment_id
|
||||
self.experiment_name = mlflow.get_experiment(self.experiment_id).name
|
||||
self.experiment_type = experiment_type
|
||||
self.update_autolog_state()
|
||||
|
||||
if self.autolog:
|
||||
# only end user created parent run in autolog scenario
|
||||
@@ -197,9 +365,12 @@ class MLflowIntegration:
|
||||
|
||||
def set_mlflow_config(self):
|
||||
if self.driver_mlflow_env_config is not None:
|
||||
from synapse.ml.mlflow import set_mlflow_env_config
|
||||
try:
|
||||
from synapse.ml.mlflow import set_mlflow_env_config
|
||||
|
||||
set_mlflow_env_config(self.driver_mlflow_env_config)
|
||||
set_mlflow_env_config(self.driver_mlflow_env_config)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
def wrap_evaluation_function(self, evaluation_function):
|
||||
wrapped_evaluation_function = _mlflow_wrapper(
|
||||
@@ -267,6 +438,7 @@ class MLflowIntegration:
|
||||
else:
|
||||
_tags = []
|
||||
self.mlflow_client.log_batch(run_id=target_id, metrics=_metrics, params=[], tags=_tags)
|
||||
return f"Successfully copy_mlflow_run run_id {src_id} to run_id {target_id}"
|
||||
|
||||
def record_trial(self, result, trial, metric):
|
||||
if isinstance(result, dict):
|
||||
@@ -334,12 +506,31 @@ class MLflowIntegration:
|
||||
self.copy_mlflow_run(best_mlflow_run_id, self.parent_run_id)
|
||||
self.has_summary = True
|
||||
|
||||
def log_model(self, model, estimator, signature=None):
|
||||
def log_model(self, model, estimator, signature=None, run_id=None):
|
||||
if not self._do_log_model:
|
||||
return
|
||||
logger.debug(f"logging model {estimator}")
|
||||
ret_message = f"Successfully log_model {estimator} to run_id {run_id}"
|
||||
optional_remove_list = (
|
||||
[] if estimator in ["transformer", "transformer_ms", "tcn", "tft"] else OPTIONAL_REMOVE_REQUIREMENT_LIST
|
||||
)
|
||||
run = mlflow.active_run()
|
||||
if run and run.info.run_id == self.parent_run_id:
|
||||
logger.debug(
|
||||
f"Current active run_id {run.info.run_id} == parent_run_id {self.parent_run_id}, Starting run_id {run_id}"
|
||||
)
|
||||
mlflow.start_run(run_id=run_id, nested=True)
|
||||
elif run and run.info.run_id != run_id:
|
||||
ret_message = (
|
||||
f"Error: Should log_model {estimator} to run_id {run_id}, but logged to run_id {run.info.run_id}"
|
||||
)
|
||||
logger.error(ret_message)
|
||||
else:
|
||||
logger.debug(f"No active run, start run_id {run_id}")
|
||||
mlflow.start_run(run_id=run_id)
|
||||
logger.debug(f"logged model {estimator} to run_id {mlflow.active_run().info.run_id}")
|
||||
if estimator.endswith("_spark"):
|
||||
mlflow.spark.log_model(model, estimator, signature=signature)
|
||||
# mlflow.spark.log_model(model, estimator, signature=signature)
|
||||
mlflow.spark.log_model(model, "model", signature=signature)
|
||||
elif estimator in ["lgbm"]:
|
||||
mlflow.lightgbm.log_model(model, estimator, signature=signature)
|
||||
@@ -352,42 +543,93 @@ class MLflowIntegration:
|
||||
elif estimator in ["prophet"]:
|
||||
mlflow.prophet.log_model(model, estimator, signature=signature)
|
||||
elif estimator in ["orbit"]:
|
||||
pass
|
||||
logger.warning(f"Unsupported model: {estimator}. No model logged.")
|
||||
else:
|
||||
mlflow.sklearn.log_model(model, estimator, signature=signature)
|
||||
future = executor.submit(
|
||||
lambda: mlflow.models.model.update_model_requirements(
|
||||
model_uri=f"runs:/{run_id}/{'model' if estimator.endswith('_spark') else estimator}",
|
||||
operation="remove",
|
||||
requirement_list=convert_requirement(REMOVE_REQUIREMENT_LIST + optional_remove_list),
|
||||
)
|
||||
)
|
||||
self.futures[future] = f"run_{run_id}_requirements_updated"
|
||||
if not run or run.info.run_id == self.parent_run_id:
|
||||
logger.debug(f"Ending current run_id {mlflow.active_run().info.run_id}")
|
||||
mlflow.end_run()
|
||||
return ret_message
|
||||
|
||||
def _pickle_and_log_artifact(self, obj, artifact_name, pickle_fname="temp_.pkl"):
|
||||
def _pickle_and_log_artifact(self, obj, artifact_name, pickle_fname="temp_.pkl", run_id=None):
|
||||
if not self._do_log_model:
|
||||
return
|
||||
return True
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
pickle_fpath = os.path.join(tmpdir, pickle_fname)
|
||||
try:
|
||||
with open(pickle_fpath, "wb") as f:
|
||||
pickle.dump(obj, f)
|
||||
mlflow.log_artifact(pickle_fpath, artifact_name)
|
||||
self.mlflow_client.log_artifact(run_id, pickle_fpath, artifact_name)
|
||||
return True
|
||||
except Exception as e:
|
||||
logger.debug(f"Failed to pickle and log artifact {artifact_name}, error: {e}")
|
||||
logger.debug(f"Failed to pickle and log {artifact_name}, error: {e}")
|
||||
return False
|
||||
|
||||
def pickle_and_log_automl_artifacts(self, automl, model, estimator, signature=None):
|
||||
def _log_pipeline(self, pipeline, flavor_name, pipeline_name, signature, run_id, estimator=None):
|
||||
logger.debug(f"logging pipeline {flavor_name}:{pipeline_name}:{estimator}")
|
||||
ret_message = f"Successfully _log_pipeline {flavor_name}:{pipeline_name}:{estimator} to run_id {run_id}"
|
||||
optional_remove_list = (
|
||||
[] if estimator in ["transformer", "transformer_ms", "tcn", "tft"] else OPTIONAL_REMOVE_REQUIREMENT_LIST
|
||||
)
|
||||
run = mlflow.active_run()
|
||||
if run and run.info.run_id == self.parent_run_id:
|
||||
logger.debug(
|
||||
f"Current active run_id {run.info.run_id} == parent_run_id {self.parent_run_id}, Starting run_id {run_id}"
|
||||
)
|
||||
mlflow.start_run(run_id=run_id, nested=True)
|
||||
elif run and run.info.run_id != run_id:
|
||||
ret_message = f"Error: Should _log_pipeline {flavor_name}:{pipeline_name}:{estimator} model to run_id {run_id}, but logged to run_id {run.info.run_id}"
|
||||
logger.error(ret_message)
|
||||
else:
|
||||
logger.debug(f"No active run, start run_id {run_id}")
|
||||
mlflow.start_run(run_id=run_id)
|
||||
logger.debug(
|
||||
f"logging pipeline {flavor_name}:{pipeline_name}:{estimator} to run_id {mlflow.active_run().info.run_id}"
|
||||
)
|
||||
if flavor_name == "sklearn":
|
||||
mlflow.sklearn.log_model(pipeline, pipeline_name, signature=signature)
|
||||
elif flavor_name == "spark":
|
||||
mlflow.spark.log_model(pipeline, pipeline_name, signature=signature)
|
||||
else:
|
||||
logger.warning(f"Unsupported pipeline flavor: {flavor_name}. No model logged.")
|
||||
future = executor.submit(
|
||||
lambda: mlflow.models.model.update_model_requirements(
|
||||
model_uri=f"runs:/{run_id}/{pipeline_name}",
|
||||
operation="remove",
|
||||
requirement_list=convert_requirement(REMOVE_REQUIREMENT_LIST + optional_remove_list),
|
||||
)
|
||||
)
|
||||
self.futures[future] = f"run_{run_id}_requirements_updated"
|
||||
if not run or run.info.run_id == self.parent_run_id:
|
||||
logger.debug(f"Ending current run_id {mlflow.active_run().info.run_id}")
|
||||
mlflow.end_run()
|
||||
return ret_message
|
||||
|
||||
def pickle_and_log_automl_artifacts(self, automl, model, estimator, signature=None, run_id=None):
|
||||
"""log automl artifacts to mlflow
|
||||
load back with `automl = mlflow.pyfunc.load_model(model_run_id_or_uri)`, then do prediction with `automl.predict(X)`
|
||||
"""
|
||||
logger.debug(f"logging automl artifacts {estimator}")
|
||||
self._pickle_and_log_artifact(automl.feature_transformer, "feature_transformer", "feature_transformer.pkl")
|
||||
self._pickle_and_log_artifact(automl.label_transformer, "label_transformer", "label_transformer.pkl")
|
||||
# Test test_mlflow 1 and 4 will get error: TypeError: cannot pickle '_io.TextIOWrapper' object
|
||||
# try:
|
||||
# self._pickle_and_log_artifact(automl, "automl", "automl.pkl")
|
||||
# except TypeError:
|
||||
# pass
|
||||
logger.debug(f"logging automl estimator {estimator}")
|
||||
# self._pickle_and_log_artifact(
|
||||
# automl.feature_transformer, "feature_transformer", "feature_transformer.pkl", run_id
|
||||
# )
|
||||
# self._pickle_and_log_artifact(automl.label_transformer, "label_transformer", "label_transformer.pkl", run_id)
|
||||
if estimator.endswith("_spark"):
|
||||
# spark pipeline is not supported yet
|
||||
return
|
||||
feature_transformer = automl.feature_transformer
|
||||
if isinstance(feature_transformer, Pipeline):
|
||||
if isinstance(feature_transformer, Pipeline) and not estimator.endswith("_spark"):
|
||||
pipeline = feature_transformer
|
||||
pipeline.steps.append(("estimator", model))
|
||||
elif isinstance(feature_transformer, SparkPipeline):
|
||||
elif isinstance(feature_transformer, SparkPipelineModel) and estimator.endswith("_spark"):
|
||||
pipeline = feature_transformer
|
||||
pipeline.stages.append(model)
|
||||
elif not estimator.endswith("_spark"):
|
||||
@@ -395,24 +637,26 @@ class MLflowIntegration:
|
||||
steps.append(("estimator", model))
|
||||
pipeline = Pipeline(steps)
|
||||
else:
|
||||
stages = [feature_transformer]
|
||||
stages = []
|
||||
if feature_transformer is not None:
|
||||
stages.append(feature_transformer)
|
||||
stages.append(model)
|
||||
pipeline = SparkPipeline(stages=stages)
|
||||
if isinstance(pipeline, SparkPipeline):
|
||||
pipeline = SparkPipelineModel(stages=stages)
|
||||
if isinstance(pipeline, SparkPipelineModel):
|
||||
logger.debug(f"logging spark pipeline {estimator}")
|
||||
mlflow.spark.log_model(pipeline, "automl_pipeline", signature=signature)
|
||||
self._log_pipeline(pipeline, "spark", "model", signature, run_id, estimator)
|
||||
else:
|
||||
# Add a log named "model" to fit default settings
|
||||
logger.debug(f"logging sklearn pipeline {estimator}")
|
||||
mlflow.sklearn.log_model(pipeline, "automl_pipeline", signature=signature)
|
||||
mlflow.sklearn.log_model(pipeline, "model", signature=signature)
|
||||
self._log_pipeline(pipeline, "sklearn", "model", signature, run_id, estimator)
|
||||
return f"Successfully pickle_and_log_automl_artifacts {estimator} to run_id {run_id}"
|
||||
|
||||
def record_state(self, automl, search_state, estimator):
|
||||
@time_it
|
||||
def record_state(self, automl, search_state, estimator, is_log_model=True):
|
||||
_st = time.time()
|
||||
automl_metric_name = (
|
||||
automl._state.metric if isinstance(automl._state.metric, str) else automl._state.error_metric
|
||||
)
|
||||
|
||||
if automl._state.error_metric.startswith("1-"):
|
||||
automl_metric_value = 1 - search_state.val_loss
|
||||
elif automl._state.error_metric.startswith("-"):
|
||||
@@ -425,6 +669,8 @@ class MLflowIntegration:
|
||||
else:
|
||||
config = search_state.config
|
||||
|
||||
self.automl_user_configurations = safe_json_dumps(automl._automl_user_configurations)
|
||||
|
||||
info = {
|
||||
"metrics": {
|
||||
"iter_counter": automl._track_iter,
|
||||
@@ -445,7 +691,7 @@ class MLflowIntegration:
|
||||
"flaml.meric": automl_metric_name,
|
||||
"flaml.run_source": "flaml-automl",
|
||||
"flaml.log_type": self.log_type,
|
||||
"flaml.automl_user_configurations": safe_json_dumps(automl._automl_user_configurations),
|
||||
"flaml.automl_user_configurations": self.automl_user_configurations,
|
||||
},
|
||||
"params": {
|
||||
"sample_size": search_state.sample_size,
|
||||
@@ -472,33 +718,70 @@ class MLflowIntegration:
|
||||
run_name = f"{self.parent_run_name}_child_{self.child_counter}"
|
||||
else:
|
||||
run_name = None
|
||||
_t1 = time.time()
|
||||
wait(self.futures_log_model)
|
||||
_t2 = time.time() - _t1
|
||||
logger.debug(f"wait futures_log_model in record_state took {_t2} seconds")
|
||||
with mlflow.start_run(nested=True, run_name=run_name) as child_run:
|
||||
self._log_info_to_run(info, child_run.info.run_id, log_params=True)
|
||||
if automl._state.model_history:
|
||||
self.log_model(
|
||||
search_state.trained_estimator._model, estimator, signature=automl.estimator_signature
|
||||
)
|
||||
self.pickle_and_log_automl_artifacts(
|
||||
automl, search_state.trained_estimator, estimator, signature=automl.pipeline_signature
|
||||
)
|
||||
future = executor.submit(lambda: self._log_info_to_run(info, child_run.info.run_id, log_params=True))
|
||||
self.futures[future] = f"iter_{automl._track_iter}_log_info_to_run"
|
||||
future = executor.submit(lambda: self._log_automl_configurations(child_run.info.run_id))
|
||||
self.futures[future] = f"iter_{automl._track_iter}_log_automl_configurations"
|
||||
if automl._state.model_history and is_log_model:
|
||||
if estimator.endswith("_spark"):
|
||||
future = executor.submit(
|
||||
lambda: self.log_model(
|
||||
search_state.trained_estimator._model,
|
||||
estimator,
|
||||
automl.estimator_signature,
|
||||
child_run.info.run_id,
|
||||
)
|
||||
)
|
||||
self.futures_log_model[future] = f"record_state-log_model_{estimator}"
|
||||
else:
|
||||
future = executor.submit(
|
||||
lambda: self.pickle_and_log_automl_artifacts(
|
||||
automl,
|
||||
search_state.trained_estimator,
|
||||
estimator,
|
||||
automl.pipeline_signature,
|
||||
child_run.info.run_id,
|
||||
)
|
||||
)
|
||||
self.futures_log_model[future] = f"record_state-pickle_and_log_automl_artifacts_{estimator}"
|
||||
self.manual_run_ids.append(child_run.info.run_id)
|
||||
self.child_counter += 1
|
||||
return f"Successfully record_state iteration {automl._track_iter}"
|
||||
|
||||
@time_it
|
||||
def log_automl(self, automl):
|
||||
self.set_best_iter(automl)
|
||||
if self.autolog:
|
||||
if self.parent_run_id is not None:
|
||||
mlflow.start_run(run_id=self.parent_run_id, experiment_id=self.experiment_id)
|
||||
mlflow.log_metric("best_validation_loss", automl._state.best_loss)
|
||||
mlflow.log_metric("best_iteration", automl._best_iteration)
|
||||
mlflow.log_metric("num_child_runs", len(self.infos))
|
||||
if automl._trained_estimator is not None and not self.has_model:
|
||||
self.log_model(
|
||||
automl._trained_estimator._model, automl.best_estimator, signature=automl.estimator_signature
|
||||
)
|
||||
self.pickle_and_log_automl_artifacts(
|
||||
automl, automl.model, automl.best_estimator, signature=automl.pipeline_signature
|
||||
)
|
||||
mlflow.log_metrics(
|
||||
{
|
||||
"best_validation_loss": automl._state.best_loss,
|
||||
"best_iteration": automl._best_iteration,
|
||||
"num_child_runs": len(self.infos),
|
||||
}
|
||||
)
|
||||
if (
|
||||
automl._trained_estimator is not None
|
||||
and not self.has_model
|
||||
and automl._trained_estimator._model is not None
|
||||
):
|
||||
if automl.best_estimator.endswith("_spark"):
|
||||
self.log_model(
|
||||
automl._trained_estimator._model,
|
||||
automl.best_estimator,
|
||||
automl.estimator_signature,
|
||||
self.parent_run_id,
|
||||
)
|
||||
else:
|
||||
self.pickle_and_log_automl_artifacts(
|
||||
automl, automl.model, automl.best_estimator, automl.pipeline_signature, self.parent_run_id
|
||||
)
|
||||
self.has_model = True
|
||||
|
||||
self.adopt_children(automl)
|
||||
@@ -514,31 +797,68 @@ class MLflowIntegration:
|
||||
conf = automl._config_history[automl._best_iteration][1].copy()
|
||||
if "ml" in conf.keys():
|
||||
conf = conf["ml"]
|
||||
|
||||
mlflow.log_params(conf)
|
||||
mlflow.log_param("best_learner", automl._best_estimator)
|
||||
params_arr = [
|
||||
Param(key, str(value)) for key, value in {**conf, "best_learner": automl._best_estimator}.items()
|
||||
]
|
||||
self.mlflow_client.log_batch(run_id=self.parent_run_id, metrics=[], params=params_arr, tags=[])
|
||||
if not self.has_summary:
|
||||
logger.info(f"logging best model {automl.best_estimator}")
|
||||
self.copy_mlflow_run(best_mlflow_run_id, self.parent_run_id)
|
||||
future = executor.submit(lambda: self.copy_mlflow_run(best_mlflow_run_id, self.parent_run_id))
|
||||
self.futures[future] = "log_automl_copy_mlflow_run"
|
||||
future = executor.submit(lambda: self._log_automl_configurations(self.parent_run_id))
|
||||
self.futures[future] = "log_automl_log_automl_configurations"
|
||||
self.has_summary = True
|
||||
if automl._trained_estimator is not None and not self.has_model:
|
||||
self.log_model(
|
||||
automl._trained_estimator._model,
|
||||
automl.best_estimator,
|
||||
signature=automl.estimator_signature,
|
||||
)
|
||||
self.pickle_and_log_automl_artifacts(
|
||||
automl, automl.model, automl.best_estimator, signature=automl.pipeline_signature
|
||||
)
|
||||
_t1 = time.time()
|
||||
wait(self.futures_log_model)
|
||||
_t2 = time.time() - _t1
|
||||
logger.debug(f"wait futures_log_model in log_automl took {_t2} seconds")
|
||||
if (
|
||||
automl._trained_estimator is not None
|
||||
and not self.has_model
|
||||
and automl._trained_estimator._model is not None
|
||||
):
|
||||
if automl.best_estimator.endswith("_spark"):
|
||||
future = executor.submit(
|
||||
lambda: self.log_model(
|
||||
automl._trained_estimator._model,
|
||||
automl.best_estimator,
|
||||
signature=automl.estimator_signature,
|
||||
run_id=self.parent_run_id,
|
||||
)
|
||||
)
|
||||
self.futures_log_model[future] = f"log_automl-log_model_{automl.best_estimator}"
|
||||
else:
|
||||
future = executor.submit(
|
||||
lambda: self.pickle_and_log_automl_artifacts(
|
||||
automl,
|
||||
automl.model,
|
||||
automl.best_estimator,
|
||||
signature=automl.pipeline_signature,
|
||||
run_id=self.parent_run_id,
|
||||
)
|
||||
)
|
||||
self.futures_log_model[
|
||||
future
|
||||
] = f"log_automl-pickle_and_log_automl_artifacts_{automl.best_estimator}"
|
||||
self.has_model = True
|
||||
|
||||
def resume_mlflow(self):
|
||||
if len(self.resume_params) > 0:
|
||||
mlflow.autolog(**self.resume_params)
|
||||
|
||||
def _log_automl_configurations(self, run_id):
|
||||
self.mlflow_client.log_text(
|
||||
run_id=run_id,
|
||||
text=self.automl_user_configurations,
|
||||
artifact_file="automl_configurations/automl_user_configurations.json",
|
||||
)
|
||||
return f"Successfully _log_automl_configurations to run_id {run_id}"
|
||||
|
||||
def _log_info_to_run(self, info, run_id, log_params=False):
|
||||
_metrics = [Metric(key, value, int(time.time() * 1000), 0) for key, value in info["metrics"].items()]
|
||||
_tags = [RunTag(key, str(value)) for key, value in info["tags"].items()]
|
||||
_tags = [
|
||||
RunTag(key, str(value)[:5000]) for key, value in info["tags"].items()
|
||||
] # AML will raise error if value length > 5000
|
||||
_params = [
|
||||
Param(key, str(value))
|
||||
for key, value in info["params"].items()
|
||||
@@ -554,6 +874,7 @@ class MLflowIntegration:
|
||||
_tags = [RunTag("mlflow.parentRunId", run_id)]
|
||||
self.mlflow_client.log_batch(run_id=run.info.run_id, metrics=_metrics, params=[], tags=_tags)
|
||||
del info["submetrics"]["values"]
|
||||
return f"Successfully _log_info_to_run to run_id {run_id}"
|
||||
|
||||
def adopt_children(self, result=None):
|
||||
"""
|
||||
@@ -575,6 +896,7 @@ class MLflowIntegration:
|
||||
),
|
||||
)
|
||||
self.child_counter = 0
|
||||
num_infos = len(self.infos)
|
||||
|
||||
# From latest to earliest, remove duplicate cross-validation runs
|
||||
_exist_child_run_params = [] # for deduplication of cross-validation child runs
|
||||
@@ -639,22 +961,37 @@ class MLflowIntegration:
|
||||
)
|
||||
self.mlflow_client.set_tag(child_run_id, "flaml.child_counter", self.child_counter)
|
||||
|
||||
# merge autolog child run and corresponding manual run
|
||||
flaml_info = self.infos[self.child_counter]
|
||||
child_run = self.mlflow_client.get_run(child_run_id)
|
||||
self._log_info_to_run(flaml_info, child_run_id, log_params=False)
|
||||
# Merge autolog child run and corresponding FLAML trial info (if available).
|
||||
# In nested scenarios (e.g., Tune -> AutoML -> MLflow autolog), MLflow can create
|
||||
# more child runs than the number of FLAML trials recorded in self.infos.
|
||||
# TODO: need more tests in nested scenarios.
|
||||
flaml_info = None
|
||||
child_run = None
|
||||
if self.child_counter < num_infos:
|
||||
flaml_info = self.infos[self.child_counter]
|
||||
child_run = self.mlflow_client.get_run(child_run_id)
|
||||
self._log_info_to_run(flaml_info, child_run_id, log_params=False)
|
||||
|
||||
if self.experiment_type == "automl":
|
||||
if "learner" not in child_run.data.params:
|
||||
self.mlflow_client.log_param(child_run_id, "learner", flaml_info["params"]["learner"])
|
||||
if "sample_size" not in child_run.data.params:
|
||||
self.mlflow_client.log_param(
|
||||
child_run_id, "sample_size", flaml_info["params"]["sample_size"]
|
||||
)
|
||||
if self.experiment_type == "automl":
|
||||
if "learner" not in child_run.data.params:
|
||||
self.mlflow_client.log_param(child_run_id, "learner", flaml_info["params"]["learner"])
|
||||
if "sample_size" not in child_run.data.params:
|
||||
self.mlflow_client.log_param(
|
||||
child_run_id, "sample_size", flaml_info["params"]["sample_size"]
|
||||
)
|
||||
else:
|
||||
logger.debug(
|
||||
"No corresponding FLAML info for MLflow child run %s (child_counter=%s, infos=%s); skipping merge.",
|
||||
child_run_id,
|
||||
self.child_counter,
|
||||
num_infos,
|
||||
)
|
||||
|
||||
if self.child_counter == best_iteration:
|
||||
if flaml_info is not None and self.child_counter == best_iteration:
|
||||
self.mlflow_client.set_tag(child_run_id, "flaml.best_run", True)
|
||||
if result is not None:
|
||||
if child_run is None:
|
||||
child_run = self.mlflow_client.get_run(child_run_id)
|
||||
result.best_run_id = child_run_id
|
||||
result.best_run_name = child_run.info.run_name
|
||||
self.best_run_id = child_run_id
|
||||
@@ -678,7 +1015,7 @@ class MLflowIntegration:
|
||||
self.resume_mlflow()
|
||||
|
||||
|
||||
def register_automl_pipeline(automl, model_name=None, signature=None):
|
||||
def register_automl_pipeline(automl, model_name=None, signature=None, artifact_path="model"):
|
||||
pipeline = automl.automl_pipeline
|
||||
if pipeline is None:
|
||||
logger.warning("pipeline not found, cannot register it")
|
||||
@@ -688,7 +1025,7 @@ def register_automl_pipeline(automl, model_name=None, signature=None):
|
||||
if automl.best_run_id is None:
|
||||
mlflow.sklearn.log_model(
|
||||
pipeline,
|
||||
"automl_pipeline",
|
||||
artifact_path,
|
||||
registered_model_name=model_name,
|
||||
signature=automl.pipeline_signature if signature is None else signature,
|
||||
)
|
||||
@@ -698,5 +1035,5 @@ def register_automl_pipeline(automl, model_name=None, signature=None):
|
||||
return mvs[0]
|
||||
else:
|
||||
best_run = mlflow.get_run(automl.best_run_id)
|
||||
model_uri = f"runs:/{best_run.info.run_id}/automl_pipeline"
|
||||
model_uri = f"runs:/{best_run.info.run_id}/{artifact_path}"
|
||||
return mlflow.register_model(model_uri, model_name)
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# ChaCha for Online AutoML
|
||||
|
||||
FLAML includes *ChaCha* which is an automatic hyperparameter tuning solution for online machine learning. Online machine learning has the following properties: (1) data comes in sequential order; and (2) the performance of the machine learning model is evaluated online, i.e., at every iteration. *ChaCha* performs online AutoML respecting the aforementioned properties of online learning, and at the same time respecting the following constraints: (1) only a small constant number of 'live' models are allowed to perform online learning at the same time; and (2) no model persistence or offline training is allowed, which means that once we decide to replace a 'live' model with a new one, the replaced model can no longer be retrieved.
|
||||
FLAML includes *ChaCha* which is an automatic hyperparameter tuning solution for online machine learning. Online machine learning has the following properties: (1) data comes in sequential order; and (2) the performance of the machine learning model is evaluated online, i.e., at every iteration. *ChaCha* performs online AutoML respecting the aforementioned properties of online learning, and at the same time respecting the following constraints: (1) only a small constant number of 'live' models are allowed to perform online learning at the same time; and (2) no model persistence or offline training is allowed, which means that once we decide to replace a 'live' model with a new one, the replaced model can no longer be retrieved.
|
||||
|
||||
For more technical details about *ChaCha*, please check our paper.
|
||||
|
||||
|
||||
37
flaml/tune/logger.py
Normal file
37
flaml/tune/logger.py
Normal file
@@ -0,0 +1,37 @@
|
||||
import logging
|
||||
import os
|
||||
|
||||
|
||||
class ColoredFormatter(logging.Formatter):
|
||||
# ANSI escape codes for colors
|
||||
COLORS = {
|
||||
# logging.DEBUG: "\033[36m", # Cyan
|
||||
# logging.INFO: "\033[32m", # Green
|
||||
logging.WARNING: "\033[33m", # Yellow
|
||||
logging.ERROR: "\033[31m", # Red
|
||||
logging.CRITICAL: "\033[1;31m", # Bright Red
|
||||
}
|
||||
RESET = "\033[0m" # Reset to default
|
||||
|
||||
def __init__(self, fmt, datefmt, use_color=True):
|
||||
super().__init__(fmt, datefmt)
|
||||
self.use_color = use_color
|
||||
|
||||
def format(self, record):
|
||||
formatted = super().format(record)
|
||||
if self.use_color:
|
||||
color = self.COLORS.get(record.levelno, "")
|
||||
if color:
|
||||
return f"{color}{formatted}{self.RESET}"
|
||||
return formatted
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
use_color = True
|
||||
if os.getenv("FLAML_LOG_NO_COLOR"):
|
||||
use_color = False
|
||||
|
||||
logger_formatter = ColoredFormatter(
|
||||
"[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s", "%m-%d %H:%M:%S", use_color
|
||||
)
|
||||
logger.propagate = False
|
||||
@@ -217,7 +217,24 @@ class BlendSearch(Searcher):
|
||||
if global_search_alg is not None:
|
||||
self._gs = global_search_alg
|
||||
elif getattr(self, "__name__", None) != "CFO":
|
||||
if space and self._ls.hierarchical:
|
||||
# Use define-by-run for OptunaSearch when needed:
|
||||
# - Hierarchical/conditional spaces are best supported via define-by-run.
|
||||
# - Ray Tune domain/grid specs can trigger an "unresolved search space" warning
|
||||
# unless we switch to define-by-run.
|
||||
use_define_by_run = bool(getattr(self._ls, "hierarchical", False))
|
||||
if (not use_define_by_run) and isinstance(space, dict) and space:
|
||||
try:
|
||||
from .variant_generator import parse_spec_vars
|
||||
|
||||
_, domain_vars, grid_vars = parse_spec_vars(space)
|
||||
use_define_by_run = bool(domain_vars or grid_vars)
|
||||
except Exception:
|
||||
# Be conservative: if we can't determine whether the space is
|
||||
# unresolved, fall back to the original behavior.
|
||||
use_define_by_run = False
|
||||
|
||||
self._use_define_by_run = use_define_by_run
|
||||
if use_define_by_run:
|
||||
from functools import partial
|
||||
|
||||
gs_space = partial(define_by_run_func, space=space)
|
||||
@@ -244,13 +261,32 @@ class BlendSearch(Searcher):
|
||||
evaluated_rewards=evaluated_rewards,
|
||||
)
|
||||
except (AssertionError, ValueError):
|
||||
self._gs = GlobalSearch(
|
||||
space=gs_space,
|
||||
metric=metric,
|
||||
mode=mode,
|
||||
seed=gs_seed,
|
||||
sampler=sampler,
|
||||
)
|
||||
try:
|
||||
self._gs = GlobalSearch(
|
||||
space=gs_space,
|
||||
metric=metric,
|
||||
mode=mode,
|
||||
seed=gs_seed,
|
||||
sampler=sampler,
|
||||
)
|
||||
except ValueError:
|
||||
# Ray Tune's OptunaSearch converts Tune domains into Optuna
|
||||
# distributions. Optuna disallows integer log distributions
|
||||
# with step != 1 (e.g., qlograndint with q>1), which can
|
||||
# raise here. Fall back to FLAML's OptunaSearch wrapper,
|
||||
# which handles these spaces more permissively.
|
||||
if getattr(GlobalSearch, "__module__", "").startswith("ray.tune"):
|
||||
from .suggestion import OptunaSearch as _FallbackOptunaSearch
|
||||
|
||||
self._gs = _FallbackOptunaSearch(
|
||||
space=gs_space,
|
||||
metric=metric,
|
||||
mode=mode,
|
||||
seed=gs_seed,
|
||||
sampler=sampler,
|
||||
)
|
||||
else:
|
||||
raise
|
||||
self._gs.space = space
|
||||
else:
|
||||
self._gs = None
|
||||
@@ -468,7 +504,7 @@ class BlendSearch(Searcher):
|
||||
self._ls_bound_max,
|
||||
self._subspace.get(trial_id, self._ls.space),
|
||||
)
|
||||
if self._gs is not None and self._experimental and (not self._ls.hierarchical):
|
||||
if self._gs is not None and self._experimental and (not getattr(self, "_use_define_by_run", False)):
|
||||
self._gs.add_evaluated_point(flatten_dict(config), objective)
|
||||
# TODO: recover when supported
|
||||
# converted = convert_key(config, self._gs.space)
|
||||
|
||||
@@ -641,8 +641,10 @@ class FLOW2(Searcher):
|
||||
else:
|
||||
# key must be in space
|
||||
domain = space[key]
|
||||
if self.hierarchical and not (
|
||||
domain is None or type(domain) in (str, int, float) or isinstance(domain, sample.Domain)
|
||||
if (
|
||||
self.hierarchical
|
||||
and domain is not None
|
||||
and not isinstance(domain, (str, int, float, sample.Domain))
|
||||
):
|
||||
# not domain or hashable
|
||||
# get rid of list type for hierarchical search space.
|
||||
|
||||
@@ -207,7 +207,7 @@ class ChampionFrontierSearcher(BaseSearcher):
|
||||
hyperparameter_config_groups.append(partial_new_configs)
|
||||
# does not have searcher_trial_ids
|
||||
searcher_trial_ids_groups.append([])
|
||||
elif isinstance(config_domain, Float) or isinstance(config_domain, Categorical):
|
||||
elif isinstance(config_domain, (Float, Categorical)):
|
||||
# otherwise we need to deal with them in group
|
||||
nonpoly_config[k] = v
|
||||
if k not in self._space_of_nonpoly_hp:
|
||||
|
||||
@@ -25,6 +25,31 @@ from .flow2 import FLOW2
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def _recursive_dict_update(target: Dict, source: Dict) -> None:
|
||||
"""Recursively update target dictionary with source dictionary.
|
||||
|
||||
Unlike dict.update(), this function merges nested dictionaries instead of
|
||||
replacing them entirely. This is crucial for configurations with nested
|
||||
structures (e.g., XGBoost params).
|
||||
|
||||
Args:
|
||||
target: The dictionary to be updated (modified in place).
|
||||
source: The dictionary containing values to merge into target.
|
||||
|
||||
Example:
|
||||
>>> target = {'params': {'eta': 0.1, 'max_depth': 3}}
|
||||
>>> source = {'params': {'verbosity': 0}}
|
||||
>>> _recursive_dict_update(target, source)
|
||||
>>> target
|
||||
{'params': {'eta': 0.1, 'max_depth': 3, 'verbosity': 0}}
|
||||
"""
|
||||
for key, value in source.items():
|
||||
if isinstance(value, dict) and key in target and isinstance(target[key], dict):
|
||||
_recursive_dict_update(target[key], value)
|
||||
else:
|
||||
target[key] = value
|
||||
|
||||
|
||||
class SearchThread:
|
||||
"""Class of global or local search thread."""
|
||||
|
||||
@@ -65,7 +90,7 @@ class SearchThread:
|
||||
try:
|
||||
config = self._search_alg.suggest(trial_id)
|
||||
if isinstance(self._search_alg._space, dict):
|
||||
config.update(self._const)
|
||||
_recursive_dict_update(config, self._const)
|
||||
else:
|
||||
# define by run
|
||||
config, self.space = unflatten_hierarchical(config, self._space)
|
||||
|
||||
@@ -35,6 +35,73 @@ from ..sample import (
|
||||
Quantized,
|
||||
Uniform,
|
||||
)
|
||||
|
||||
# If Ray is installed, flaml.tune may re-export Ray Tune sampling functions.
|
||||
# In that case, the search space contains Ray Tune Domain/Sampler objects,
|
||||
# which should be accepted by our Optuna search-space conversion.
|
||||
try:
|
||||
from ray import __version__ as _ray_version # type: ignore
|
||||
|
||||
if str(_ray_version).startswith("1."):
|
||||
from ray.tune.sample import ( # type: ignore
|
||||
Categorical as _RayCategorical,
|
||||
)
|
||||
from ray.tune.sample import (
|
||||
Domain as _RayDomain,
|
||||
)
|
||||
from ray.tune.sample import (
|
||||
Float as _RayFloat,
|
||||
)
|
||||
from ray.tune.sample import (
|
||||
Integer as _RayInteger,
|
||||
)
|
||||
from ray.tune.sample import (
|
||||
LogUniform as _RayLogUniform,
|
||||
)
|
||||
from ray.tune.sample import (
|
||||
Quantized as _RayQuantized,
|
||||
)
|
||||
from ray.tune.sample import (
|
||||
Uniform as _RayUniform,
|
||||
)
|
||||
else:
|
||||
from ray.tune.search.sample import ( # type: ignore
|
||||
Categorical as _RayCategorical,
|
||||
)
|
||||
from ray.tune.search.sample import (
|
||||
Domain as _RayDomain,
|
||||
)
|
||||
from ray.tune.search.sample import (
|
||||
Float as _RayFloat,
|
||||
)
|
||||
from ray.tune.search.sample import (
|
||||
Integer as _RayInteger,
|
||||
)
|
||||
from ray.tune.search.sample import (
|
||||
LogUniform as _RayLogUniform,
|
||||
)
|
||||
from ray.tune.search.sample import (
|
||||
Quantized as _RayQuantized,
|
||||
)
|
||||
from ray.tune.search.sample import (
|
||||
Uniform as _RayUniform,
|
||||
)
|
||||
|
||||
_FLOAT_TYPES = (Float, _RayFloat)
|
||||
_INTEGER_TYPES = (Integer, _RayInteger)
|
||||
_CATEGORICAL_TYPES = (Categorical, _RayCategorical)
|
||||
_DOMAIN_TYPES = (Domain, _RayDomain)
|
||||
_QUANTIZED_TYPES = (Quantized, _RayQuantized)
|
||||
_UNIFORM_TYPES = (Uniform, _RayUniform)
|
||||
_LOGUNIFORM_TYPES = (LogUniform, _RayLogUniform)
|
||||
except Exception: # pragma: no cover
|
||||
_FLOAT_TYPES = (Float,)
|
||||
_INTEGER_TYPES = (Integer,)
|
||||
_CATEGORICAL_TYPES = (Categorical,)
|
||||
_DOMAIN_TYPES = (Domain,)
|
||||
_QUANTIZED_TYPES = (Quantized,)
|
||||
_UNIFORM_TYPES = (Uniform,)
|
||||
_LOGUNIFORM_TYPES = (LogUniform,)
|
||||
from ..trial import flatten_dict, unflatten_dict
|
||||
from .variant_generator import parse_spec_vars
|
||||
|
||||
@@ -850,19 +917,22 @@ class OptunaSearch(Searcher):
|
||||
def resolve_value(domain: Domain) -> ot.distributions.BaseDistribution:
|
||||
quantize = None
|
||||
|
||||
sampler = domain.get_sampler()
|
||||
if isinstance(sampler, Quantized):
|
||||
# Ray Tune Domains and FLAML Domains both provide get_sampler(), but
|
||||
# fall back to the .sampler attribute for robustness.
|
||||
sampler = domain.get_sampler() if hasattr(domain, "get_sampler") else getattr(domain, "sampler", None)
|
||||
|
||||
if isinstance(sampler, _QUANTIZED_TYPES) or type(sampler).__name__ == "Quantized":
|
||||
quantize = sampler.q
|
||||
sampler = sampler.sampler
|
||||
if isinstance(sampler, LogUniform):
|
||||
sampler = getattr(sampler, "sampler", None) or sampler.get_sampler()
|
||||
if isinstance(sampler, _LOGUNIFORM_TYPES) or type(sampler).__name__ == "LogUniform":
|
||||
logger.warning(
|
||||
"Optuna does not handle quantization in loguniform "
|
||||
"sampling. The parameter will be passed but it will "
|
||||
"probably be ignored."
|
||||
)
|
||||
|
||||
if isinstance(domain, Float):
|
||||
if isinstance(sampler, LogUniform):
|
||||
if isinstance(domain, _FLOAT_TYPES) or type(domain).__name__ == "Float":
|
||||
if isinstance(sampler, _LOGUNIFORM_TYPES) or type(sampler).__name__ == "LogUniform":
|
||||
if quantize:
|
||||
logger.warning(
|
||||
"Optuna does not support both quantization and "
|
||||
@@ -870,17 +940,17 @@ class OptunaSearch(Searcher):
|
||||
)
|
||||
return ot.distributions.LogUniformDistribution(domain.lower, domain.upper)
|
||||
|
||||
elif isinstance(sampler, Uniform):
|
||||
elif isinstance(sampler, _UNIFORM_TYPES) or type(sampler).__name__ == "Uniform":
|
||||
if quantize:
|
||||
return ot.distributions.DiscreteUniformDistribution(domain.lower, domain.upper, quantize)
|
||||
return ot.distributions.UniformDistribution(domain.lower, domain.upper)
|
||||
|
||||
elif isinstance(domain, Integer):
|
||||
if isinstance(sampler, LogUniform):
|
||||
elif isinstance(domain, _INTEGER_TYPES) or type(domain).__name__ == "Integer":
|
||||
if isinstance(sampler, _LOGUNIFORM_TYPES) or type(sampler).__name__ == "LogUniform":
|
||||
# ``step`` argument Deprecated in v2.0.0. ``step`` argument should be 1 in Log Distribution
|
||||
# The removal of this feature is currently scheduled for v4.0.0,
|
||||
return ot.distributions.IntLogUniformDistribution(domain.lower, domain.upper - 1, step=1)
|
||||
elif isinstance(sampler, Uniform):
|
||||
elif isinstance(sampler, _UNIFORM_TYPES) or type(sampler).__name__ == "Uniform":
|
||||
# Upper bound should be inclusive for quantization and
|
||||
# exclusive otherwise
|
||||
return ot.distributions.IntUniformDistribution(
|
||||
@@ -888,16 +958,16 @@ class OptunaSearch(Searcher):
|
||||
domain.upper - int(bool(not quantize)),
|
||||
step=quantize or 1,
|
||||
)
|
||||
elif isinstance(domain, Categorical):
|
||||
if isinstance(sampler, Uniform):
|
||||
elif isinstance(domain, _CATEGORICAL_TYPES) or type(domain).__name__ == "Categorical":
|
||||
if isinstance(sampler, _UNIFORM_TYPES) or type(sampler).__name__ == "Uniform":
|
||||
return ot.distributions.CategoricalDistribution(domain.categories)
|
||||
|
||||
raise ValueError(
|
||||
"Optuna search does not support parameters of type "
|
||||
"`{}` with samplers of type `{}`".format(type(domain).__name__, type(domain.sampler).__name__)
|
||||
"`{}` with samplers of type `{}`".format(type(domain).__name__, type(sampler).__name__)
|
||||
)
|
||||
|
||||
# Parameter name is e.g. "a/b/c" for nested dicts
|
||||
values = {"/".join(path): resolve_value(domain) for path, domain in domain_vars}
|
||||
|
||||
return values
|
||||
return values
|
||||
|
||||
@@ -261,7 +261,7 @@ def add_cost_to_space(space: Dict, low_cost_point: Dict, choice_cost: Dict):
|
||||
low_cost[i] = point
|
||||
if len(low_cost) > len(domain.categories):
|
||||
if domain.ordered:
|
||||
low_cost[-1] = int(np.where(ind == low_cost[-1])[0])
|
||||
low_cost[-1] = int(np.where(ind == low_cost[-1])[0].item())
|
||||
domain.low_cost_point = low_cost[-1]
|
||||
return
|
||||
if low_cost:
|
||||
|
||||
@@ -162,6 +162,10 @@ def broadcast_code(custom_code="", file_name="mylearner"):
|
||||
assert isinstance(MyLargeLGBM(), LGBMEstimator)
|
||||
```
|
||||
"""
|
||||
# Check if Spark is available
|
||||
spark_available, _ = check_spark()
|
||||
|
||||
# Write to local driver file system
|
||||
flaml_path = os.path.dirname(os.path.abspath(__file__))
|
||||
custom_code = textwrap.dedent(custom_code)
|
||||
custom_path = os.path.join(flaml_path, file_name + ".py")
|
||||
@@ -169,6 +173,24 @@ def broadcast_code(custom_code="", file_name="mylearner"):
|
||||
with open(custom_path, "w") as f:
|
||||
f.write(custom_code)
|
||||
|
||||
# If using Spark, broadcast the code content to executors
|
||||
if spark_available:
|
||||
spark = SparkSession.builder.getOrCreate()
|
||||
bc_code = spark.sparkContext.broadcast(custom_code)
|
||||
|
||||
# Execute a job to ensure the code is distributed to all executors
|
||||
def _write_code(bc):
|
||||
code = bc.value
|
||||
import os
|
||||
|
||||
module_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), file_name + ".py")
|
||||
os.makedirs(os.path.dirname(module_path), exist_ok=True)
|
||||
with open(module_path, "w") as f:
|
||||
f.write(code)
|
||||
return True
|
||||
|
||||
spark.sparkContext.parallelize(range(1)).map(lambda _: _write_code(bc_code)).collect()
|
||||
|
||||
return custom_path
|
||||
|
||||
|
||||
|
||||
@@ -21,11 +21,11 @@ except (ImportError, AssertionError):
|
||||
from .analysis import ExperimentAnalysis as EA
|
||||
else:
|
||||
ray_available = True
|
||||
|
||||
import logging
|
||||
|
||||
from flaml.tune.spark.utils import PySparkOvertimeMonitor, check_spark
|
||||
|
||||
from .logger import logger, logger_formatter
|
||||
from .result import DEFAULT_METRIC
|
||||
from .trial import Trial
|
||||
|
||||
@@ -41,8 +41,6 @@ except ImportError:
|
||||
internal_mlflow = False
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
logger.propagate = False
|
||||
_use_ray = True
|
||||
_runner = None
|
||||
_verbose = 0
|
||||
@@ -197,9 +195,16 @@ def report(_metric=None, **kwargs):
|
||||
global _training_iteration
|
||||
if _use_ray:
|
||||
try:
|
||||
from ray import tune
|
||||
from ray import __version__ as ray_version
|
||||
|
||||
return tune.report(_metric, **kwargs)
|
||||
if ray_version.startswith("1."):
|
||||
from ray import tune
|
||||
|
||||
return tune.report(_metric, **kwargs)
|
||||
else: # ray>=2
|
||||
from ray.air import session
|
||||
|
||||
return session.report(metrics={"metric": _metric, **kwargs})
|
||||
except ImportError:
|
||||
# calling tune.report() outside tune.run()
|
||||
return
|
||||
@@ -260,6 +265,8 @@ def run(
|
||||
mlflow_exp_name: Optional[str] = None,
|
||||
automl_info: Optional[Tuple[float]] = None,
|
||||
extra_tag: Optional[dict] = None,
|
||||
cost_attr: Optional[str] = "auto",
|
||||
cost_budget: Optional[float] = None,
|
||||
**ray_args,
|
||||
):
|
||||
"""The function-based way of performing HPO.
|
||||
@@ -462,6 +469,12 @@ def run(
|
||||
overwritten by the value of `n_concurrent_trials` in AutoML. When <= 0, the concurrent trials
|
||||
will be set to the number of executors.
|
||||
extra_tag: dict, default=None | Extra tags to be added to the mlflow runs created by autologging.
|
||||
cost_attr: None or str to specify the attribute to evaluate the cost of different trials.
|
||||
Default is "auto", which means that we will automatically choose the cost attribute to use (depending
|
||||
on the nature of the resource budget). When cost_attr is set to None, cost differences between different trials will be omitted
|
||||
in our search algorithm. When cost_attr is set to a str different from "auto" and "time_total_s",
|
||||
this cost_attr must be available in the result dict of the trial.
|
||||
cost_budget: A float of the cost budget. Only valid when cost_attr is a str different from "auto" and "time_total_s".
|
||||
**ray_args: keyword arguments to pass to ray.tune.run().
|
||||
Only valid when use_ray=True.
|
||||
"""
|
||||
@@ -506,10 +519,6 @@ def run(
|
||||
elif not logger.hasHandlers():
|
||||
# Add the console handler.
|
||||
_ch = logging.StreamHandler(stream=sys.stdout)
|
||||
logger_formatter = logging.Formatter(
|
||||
"[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s",
|
||||
"%m-%d %H:%M:%S",
|
||||
)
|
||||
_ch.setFormatter(logger_formatter)
|
||||
logger.addHandler(_ch)
|
||||
if verbose <= 2:
|
||||
@@ -600,6 +609,8 @@ def run(
|
||||
metric_constraints=metric_constraints,
|
||||
use_incumbent_result_in_evaluation=use_incumbent_result_in_evaluation,
|
||||
lexico_objectives=lexico_objectives,
|
||||
cost_attr=cost_attr,
|
||||
cost_budget=cost_budget,
|
||||
)
|
||||
else:
|
||||
if metric is None or mode is None:
|
||||
@@ -735,10 +746,16 @@ def run(
|
||||
max_concurrent = max(1, search_alg.max_concurrent)
|
||||
else:
|
||||
max_concurrent = max(1, max_spark_parallelism)
|
||||
passed_in_n_concurrent_trials = max(n_concurrent_trials, max_concurrent)
|
||||
n_concurrent_trials = min(
|
||||
n_concurrent_trials if n_concurrent_trials > 0 else num_executors,
|
||||
max_concurrent,
|
||||
)
|
||||
if n_concurrent_trials < passed_in_n_concurrent_trials:
|
||||
logger.warning(
|
||||
f"The actual concurrent trials is {n_concurrent_trials}. You can set the environment "
|
||||
f"variable `FLAML_MAX_CONCURRENT` to '{passed_in_n_concurrent_trials}' to override the detected num of executors."
|
||||
)
|
||||
with parallel_backend("spark"):
|
||||
with Parallel(n_jobs=n_concurrent_trials, verbose=max(0, (verbose - 1) * 50)) as parallel:
|
||||
try:
|
||||
@@ -759,8 +776,8 @@ def run(
|
||||
and (num_samples < 0 or num_trials < num_samples)
|
||||
and num_failures < upperbound_num_failures
|
||||
):
|
||||
if automl_info and automl_info[0] > 0 and time_budget_s < np.inf:
|
||||
time_budget_s -= automl_info[0]
|
||||
if automl_info and automl_info[1] == "all" and automl_info[0] > 0 and time_budget_s < np.inf:
|
||||
time_budget_s -= automl_info[0] * n_concurrent_trials
|
||||
logger.debug(f"Remaining time budget with mlflow log latency: {time_budget_s} seconds.")
|
||||
while len(_runner.running_trials) < n_concurrent_trials:
|
||||
# suggest trials for spark
|
||||
@@ -785,9 +802,17 @@ def run(
|
||||
)
|
||||
results = None
|
||||
with PySparkOvertimeMonitor(time_start, time_budget_s, force_cancel, parallel=parallel):
|
||||
results = parallel(
|
||||
delayed(evaluation_function)(trial_to_run.config) for trial_to_run in trials_to_run
|
||||
)
|
||||
try:
|
||||
results = parallel(
|
||||
delayed(evaluation_function)(trial_to_run.config) for trial_to_run in trials_to_run
|
||||
)
|
||||
except RuntimeError as e:
|
||||
logger.warning(f"RuntimeError: {e}")
|
||||
results = None
|
||||
logger.info(
|
||||
"Encountered RuntimeError. Waiting 10 seconds for Spark cluster to recover before retrying."
|
||||
)
|
||||
time.sleep(10)
|
||||
# results = [evaluation_function(trial_to_run.config) for trial_to_run in trials_to_run]
|
||||
while results:
|
||||
result = results.pop(0)
|
||||
|
||||
@@ -1 +1 @@
|
||||
__version__ = "2.3.2"
|
||||
__version__ = "2.5.0"
|
||||
|
||||
@@ -2,7 +2,6 @@
|
||||
license_file = "LICENSE"
|
||||
description-file = "README.md"
|
||||
|
||||
|
||||
[tool.pytest.ini_options]
|
||||
addopts = '-m "not conda"'
|
||||
markers = [
|
||||
|
||||
3
pytest.ini
Normal file
3
pytest.ini
Normal file
@@ -0,0 +1,3 @@
|
||||
[pytest]
|
||||
markers =
|
||||
spark: mark a test as requiring Spark
|
||||
78
setup.py
78
setup.py
@@ -51,60 +51,59 @@ setuptools.setup(
|
||||
"joblib<=1.3.2",
|
||||
],
|
||||
"test": [
|
||||
"numpy>=1.17,<2.0.0; python_version<'3.13'",
|
||||
"numpy>=1.17; python_version>='3.13'",
|
||||
"jupyter",
|
||||
"lightgbm>=2.3.1",
|
||||
"xgboost>=0.90,<2.0.0",
|
||||
"xgboost>=0.90,<2.0.0; python_version<'3.11'",
|
||||
"xgboost>=2.0.0; python_version>='3.11'",
|
||||
"scipy>=1.4.1",
|
||||
"pandas>=1.1.4,<2.0.0; python_version<'3.10'",
|
||||
"pandas>=1.1.4; python_version>='3.10'",
|
||||
"scikit-learn>=1.0.0",
|
||||
"scikit-learn>=1.2.0",
|
||||
"thop",
|
||||
"pytest>=6.1.1",
|
||||
"pytest-rerunfailures>=13.0",
|
||||
"coverage>=5.3",
|
||||
"pre-commit",
|
||||
"torch",
|
||||
"torchvision",
|
||||
"catboost>=0.26,<1.2; python_version<'3.11'",
|
||||
"catboost>=0.26; python_version>='3.11'",
|
||||
"catboost>=0.26",
|
||||
"rgf-python",
|
||||
"optuna>=2.8.0,<=3.6.1",
|
||||
"openml",
|
||||
"statsmodels>=0.12.2",
|
||||
"psutil==5.8.0",
|
||||
"psutil",
|
||||
"dataclasses",
|
||||
"transformers[torch]==4.26",
|
||||
"transformers[torch]",
|
||||
"datasets",
|
||||
"nltk<=3.8.1", # 3.8.2 doesn't work with mlflow
|
||||
"evaluate",
|
||||
"nltk!=3.8.2", # 3.8.2 doesn't work with mlflow
|
||||
"rouge_score",
|
||||
"hcrystalball==0.1.10",
|
||||
"hcrystalball",
|
||||
"seqeval",
|
||||
"pytorch-forecasting>=0.9.0,<=0.10.1; python_version<'3.11'",
|
||||
# "pytorch-forecasting==0.10.1; python_version=='3.11'",
|
||||
"mlflow==2.15.1",
|
||||
"pytorch-forecasting",
|
||||
"mlflow-skinny<=2.22.1", # Refer to https://mvnrepository.com/artifact/org.mlflow/mlflow-spark
|
||||
"joblibspark>=0.5.0",
|
||||
"joblib<=1.3.2",
|
||||
"nbconvert",
|
||||
"nbformat",
|
||||
"ipykernel",
|
||||
"pytorch-lightning<1.9.1", # test_forecast_panel
|
||||
"tensorboardX==2.6", # test_forecast_panel
|
||||
"requests<2.29.0", # https://github.com/docker/docker-py/issues/3113
|
||||
"pytorch-lightning", # test_forecast_panel
|
||||
"tensorboardX", # test_forecast_panel
|
||||
"requests", # https://github.com/docker/docker-py/issues/3113
|
||||
"packaging",
|
||||
"pydantic==1.10.9",
|
||||
"sympy",
|
||||
"wolframalpha",
|
||||
"dill", # a drop in replacement of pickle
|
||||
],
|
||||
"catboost": [
|
||||
"catboost>=0.26,<1.2; python_version<'3.11'",
|
||||
"catboost>=0.26,<=1.2.5; python_version>='3.11'",
|
||||
"catboost>=0.26",
|
||||
],
|
||||
"blendsearch": [
|
||||
"optuna>=2.8.0,<=3.6.1",
|
||||
"packaging",
|
||||
],
|
||||
"ray": [
|
||||
"ray[tune]~=1.13",
|
||||
"ray[tune]>=1.13,<2.5.0",
|
||||
],
|
||||
"azureml": [
|
||||
"azureml-mlflow",
|
||||
@@ -117,47 +116,35 @@ setuptools.setup(
|
||||
"scikit-learn",
|
||||
],
|
||||
"hf": [
|
||||
"transformers[torch]==4.26",
|
||||
"transformers[torch]>=4.26",
|
||||
"datasets",
|
||||
"nltk<=3.8.1",
|
||||
"rouge_score",
|
||||
"seqeval",
|
||||
],
|
||||
"nlp": [ # for backward compatibility; hf is the new option name
|
||||
"transformers[torch]==4.26",
|
||||
"transformers[torch]>=4.26",
|
||||
"datasets",
|
||||
"nltk<=3.8.1",
|
||||
"rouge_score",
|
||||
"seqeval",
|
||||
],
|
||||
"ts_forecast": [
|
||||
"holidays<0.14", # to prevent installation error for prophet
|
||||
"prophet>=1.0.1",
|
||||
"holidays",
|
||||
"prophet>=1.1.5",
|
||||
"statsmodels>=0.12.2",
|
||||
"hcrystalball==0.1.10",
|
||||
"hcrystalball>=0.1.10",
|
||||
],
|
||||
"forecast": [
|
||||
"holidays<0.14", # to prevent installation error for prophet
|
||||
"prophet>=1.0.1",
|
||||
"holidays",
|
||||
"prophet>=1.1.5",
|
||||
"statsmodels>=0.12.2",
|
||||
"hcrystalball==0.1.10",
|
||||
"pytorch-forecasting>=0.9.0; python_version<'3.11'",
|
||||
# "pytorch-forecasting==0.10.1; python_version=='3.11'",
|
||||
"pytorch-lightning==1.9.0",
|
||||
"tensorboardX==2.6",
|
||||
"hcrystalball>=0.1.10",
|
||||
"pytorch-forecasting>=0.10.4",
|
||||
"pytorch-lightning>=1.9.0",
|
||||
"tensorboardX>=2.6",
|
||||
],
|
||||
"benchmark": ["catboost>=0.26", "psutil==5.8.0", "xgboost==1.3.3", "pandas==1.1.4"],
|
||||
"openai": ["openai==0.27.8", "diskcache"],
|
||||
"autogen": ["openai==0.27.8", "diskcache", "termcolor"],
|
||||
"mathchat": ["openai==0.27.8", "diskcache", "termcolor", "sympy", "pydantic==1.10.9", "wolframalpha"],
|
||||
"retrievechat": [
|
||||
"openai==0.27.8",
|
||||
"diskcache",
|
||||
"termcolor",
|
||||
"chromadb",
|
||||
"tiktoken",
|
||||
"sentence_transformers",
|
||||
],
|
||||
"synapse": [
|
||||
"joblibspark>=0.5.0",
|
||||
"optuna>=2.8.0,<=3.6.1",
|
||||
@@ -170,10 +157,9 @@ setuptools.setup(
|
||||
"Operating System :: OS Independent",
|
||||
# Specify the Python versions you support here.
|
||||
"Programming Language :: Python :: 3",
|
||||
"Programming Language :: Python :: 3.8",
|
||||
"Programming Language :: Python :: 3.9",
|
||||
"Programming Language :: Python :: 3.10",
|
||||
"Programming Language :: Python :: 3.11",
|
||||
"Programming Language :: Python :: 3.12",
|
||||
],
|
||||
python_requires=">=3.8",
|
||||
python_requires=">=3.10",
|
||||
)
|
||||
|
||||
@@ -4,8 +4,17 @@ import pytest
|
||||
|
||||
from flaml import AutoML, tune
|
||||
|
||||
try:
|
||||
import transformers
|
||||
|
||||
@pytest.mark.skipif(sys.platform == "darwin", reason="do not run on mac os")
|
||||
_transformers_installed = True
|
||||
except ImportError:
|
||||
_transformers_installed = False
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
sys.platform == "darwin" or not _transformers_installed, reason="do not run on mac os or transformers not installed"
|
||||
)
|
||||
def test_custom_hp_nlp():
|
||||
from test.nlp.utils import get_automl_settings, get_toy_data_seqclassification
|
||||
|
||||
@@ -63,5 +72,39 @@ def test_custom_hp():
|
||||
print(automl.best_config_per_estimator)
|
||||
|
||||
|
||||
def test_lgbm_objective():
|
||||
"""Test that objective parameter can be set via custom_hp for LGBMEstimator"""
|
||||
import numpy as np
|
||||
|
||||
# Create a simple regression dataset
|
||||
np.random.seed(42)
|
||||
X_train = np.random.rand(100, 5)
|
||||
y_train = np.random.rand(100) * 100 # Scale to avoid division issues with MAPE
|
||||
|
||||
automl = AutoML()
|
||||
settings = {
|
||||
"time_budget": 3,
|
||||
"metric": "mape",
|
||||
"task": "regression",
|
||||
"estimator_list": ["lgbm"],
|
||||
"verbose": 0,
|
||||
"custom_hp": {"lgbm": {"objective": {"domain": "mape"}}}, # Fixed value, not tuned
|
||||
}
|
||||
|
||||
automl.fit(X_train, y_train, **settings)
|
||||
|
||||
# Verify that objective was set correctly
|
||||
assert "objective" in automl.best_config, "objective should be in best_config"
|
||||
assert automl.best_config["objective"] == "mape", "objective should be 'mape'"
|
||||
|
||||
# Verify the model has the correct objective
|
||||
if hasattr(automl.model, "estimator") and hasattr(automl.model.estimator, "get_params"):
|
||||
model_params = automl.model.estimator.get_params()
|
||||
assert model_params.get("objective") == "mape", "Model should use 'mape' objective"
|
||||
|
||||
print("Test passed: objective parameter works correctly with LGBMEstimator")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_custom_hp()
|
||||
test_lgbm_objective()
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
import atexit
|
||||
import os
|
||||
import sys
|
||||
import unittest
|
||||
@@ -15,8 +16,18 @@ from sklearn.model_selection import train_test_split
|
||||
|
||||
from flaml import AutoML
|
||||
from flaml.automl.ml import sklearn_metric_loss_score
|
||||
from flaml.automl.spark import disable_spark_ansi_mode, restore_spark_ansi_mode
|
||||
from flaml.tune.spark.utils import check_spark
|
||||
|
||||
try:
|
||||
import pytorch_lightning
|
||||
|
||||
_pl_installed = True
|
||||
except ImportError:
|
||||
_pl_installed = False
|
||||
|
||||
pytestmark = pytest.mark.spark
|
||||
|
||||
leaderboard = defaultdict(dict)
|
||||
|
||||
warnings.simplefilter(action="ignore")
|
||||
@@ -37,7 +48,7 @@ else:
|
||||
.config(
|
||||
"spark.jars.packages",
|
||||
(
|
||||
"com.microsoft.azure:synapseml_2.12:1.0.2,"
|
||||
"com.microsoft.azure:synapseml_2.12:1.1.0,"
|
||||
"org.apache.hadoop:hadoop-azure:3.3.5,"
|
||||
"com.microsoft.azure:azure-storage:8.6.6,"
|
||||
f"org.mlflow:mlflow-spark_2.12:{mlflow.__version__}"
|
||||
@@ -61,6 +72,9 @@ else:
|
||||
except ImportError:
|
||||
skip_spark = True
|
||||
|
||||
spark, ansi_conf, adjusted = disable_spark_ansi_mode()
|
||||
atexit.register(restore_spark_ansi_mode, spark, ansi_conf, adjusted)
|
||||
|
||||
|
||||
def _test_regular_models(estimator_list, task):
|
||||
if isinstance(estimator_list, str):
|
||||
@@ -174,7 +188,11 @@ def _test_sparse_matrix_classification(estimator):
|
||||
"n_jobs": 1,
|
||||
"model_history": True,
|
||||
}
|
||||
X_train = scipy.sparse.random(1554, 21, dtype=int)
|
||||
# NOTE: Avoid `dtype=int` here. On some NumPy/SciPy combinations (notably
|
||||
# Windows + Python 3.13), `scipy.sparse.random(..., dtype=int)` may trigger
|
||||
# integer sampling paths which raise "low is out of bounds for int32".
|
||||
# A float sparse matrix is sufficient to validate sparse-input support.
|
||||
X_train = scipy.sparse.random(1554, 21, dtype=np.float32)
|
||||
y_train = np.random.randint(3, size=1554)
|
||||
automl_experiment.fit(X_train=X_train, y_train=y_train, **automl_settings)
|
||||
|
||||
@@ -269,7 +287,11 @@ class TestExtraModel(unittest.TestCase):
|
||||
|
||||
@unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_default_spark(self):
|
||||
_test_spark_models(None, "classification")
|
||||
# TODO: remove the estimator assignment once SynapseML supports spark 4+.
|
||||
from flaml.automl.spark.utils import _spark_major_minor_version
|
||||
|
||||
estimator_list = ["rf_spark"] if _spark_major_minor_version[0] >= 4 else None
|
||||
_test_spark_models(estimator_list, "classification")
|
||||
|
||||
def test_svc(self):
|
||||
_test_regular_models("svc", "classification")
|
||||
@@ -300,7 +322,7 @@ class TestExtraModel(unittest.TestCase):
|
||||
def test_avg(self):
|
||||
_test_forecast("avg")
|
||||
|
||||
@unittest.skipIf(skip_spark, reason="Skip on Mac or Windows")
|
||||
@unittest.skipIf(skip_spark or not _pl_installed, reason="Skip on Mac or Windows or no pytorch_lightning.")
|
||||
def test_tcn(self):
|
||||
_test_forecast("tcn")
|
||||
|
||||
|
||||
@@ -10,7 +10,7 @@ from flaml import AutoML
|
||||
from flaml.automl.task.time_series_task import TimeSeriesTask
|
||||
|
||||
|
||||
def test_forecast_automl(budget=10, estimators_when_no_prophet=["arima", "sarimax", "holt-winters"]):
|
||||
def test_forecast_automl(budget=20, estimators_when_no_prophet=["arima", "sarimax", "holt-winters"]):
|
||||
# using dataframe
|
||||
import statsmodels.api as sm
|
||||
|
||||
@@ -477,7 +477,10 @@ def test_forecast_classification(budget=5):
|
||||
def get_stalliion_data():
|
||||
from pytorch_forecasting.data.examples import get_stallion_data
|
||||
|
||||
data = get_stallion_data()
|
||||
# data = get_stallion_data()
|
||||
data = pd.read_parquet(
|
||||
"https://raw.githubusercontent.com/sktime/pytorch-forecasting/refs/heads/main/examples/data/stallion.parquet"
|
||||
)
|
||||
# add time index - For datasets with no missing values, FLAML will automate this process
|
||||
data["time_idx"] = data["date"].dt.year * 12 + data["date"].dt.month
|
||||
data["time_idx"] -= data["time_idx"].min()
|
||||
@@ -507,8 +510,12 @@ def get_stalliion_data():
|
||||
"3.11" in sys.version,
|
||||
reason="do not run on py 3.11",
|
||||
)
|
||||
def test_forecast_panel(budget=5):
|
||||
data, special_days = get_stalliion_data()
|
||||
def test_forecast_panel(budget=30):
|
||||
try:
|
||||
data, special_days = get_stalliion_data()
|
||||
except ImportError:
|
||||
print("pytorch_forecasting not installed")
|
||||
return
|
||||
time_horizon = 6 # predict six months
|
||||
training_cutoff = data["time_idx"].max() - time_horizon
|
||||
data["time_idx"] = data["time_idx"].astype("int")
|
||||
@@ -674,11 +681,55 @@ def test_cv_step():
|
||||
print("yahoo!")
|
||||
|
||||
|
||||
def test_log_training_metric_ts_models():
|
||||
"""Test that log_training_metric=True works with time series models (arima, sarimax, holt-winters)."""
|
||||
import statsmodels.api as sm
|
||||
|
||||
from flaml.automl.task.time_series_task import TimeSeriesTask
|
||||
|
||||
estimators_all = TimeSeriesTask("forecast").estimators.keys()
|
||||
estimators_to_test = ["xgboost", "arima", "lassolars", "tcn", "snaive", "prophet", "orbit"]
|
||||
estimators = [
|
||||
est for est in estimators_to_test if est in estimators_all
|
||||
] # not all estimators available in current python env
|
||||
print(f"Testing estimators: {estimators}")
|
||||
|
||||
# Prepare data
|
||||
data = sm.datasets.co2.load_pandas().data["co2"]
|
||||
data = data.resample("MS").mean()
|
||||
data = data.bfill().ffill()
|
||||
data = data.to_frame().reset_index()
|
||||
data = data.rename(columns={"index": "ds", "co2": "y"})
|
||||
num_samples = data.shape[0]
|
||||
time_horizon = 12
|
||||
split_idx = num_samples - time_horizon
|
||||
df = data[:split_idx]
|
||||
|
||||
# Test each time series model with log_training_metric=True
|
||||
for estimator in estimators:
|
||||
print(f"\nTesting {estimator} with log_training_metric=True")
|
||||
automl = AutoML()
|
||||
settings = {
|
||||
"time_budget": 3,
|
||||
"metric": "mape",
|
||||
"task": "forecast",
|
||||
"eval_method": "holdout",
|
||||
"label": "y",
|
||||
"log_training_metric": True, # This should not cause errors
|
||||
"estimator_list": [estimator],
|
||||
}
|
||||
automl.fit(dataframe=df, **settings, period=time_horizon, force_cancel=True)
|
||||
print(f" ✅ {estimator} SUCCESS with log_training_metric=True")
|
||||
if automl.best_estimator:
|
||||
assert automl.best_estimator == estimator
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# test_forecast_automl(60)
|
||||
# test_multivariate_forecast_num(5)
|
||||
# test_multivariate_forecast_cat(5)
|
||||
test_numpy()
|
||||
# test_numpy()
|
||||
# test_forecast_classification(5)
|
||||
# test_forecast_panel(5)
|
||||
# test_cv_step()
|
||||
test_log_training_metric_ts_models()
|
||||
|
||||
51
test/automl/test_max_iter_1.py
Normal file
51
test/automl/test_max_iter_1.py
Normal file
@@ -0,0 +1,51 @@
|
||||
import mlflow
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
from flaml import AutoML
|
||||
|
||||
|
||||
def test_max_iter_1():
|
||||
date_rng = pd.date_range(start="2024-01-01", periods=100, freq="H")
|
||||
X = pd.DataFrame({"ds": date_rng})
|
||||
y_train_24h = np.random.rand(len(X)) * 100
|
||||
|
||||
# AutoML
|
||||
settings = {
|
||||
"max_iter": 1,
|
||||
"estimator_list": ["xgboost", "lgbm"],
|
||||
"starting_points": {"xgboost": {}, "lgbm": {}},
|
||||
"task": "ts_forecast",
|
||||
"log_file_name": "test_max_iter_1.log",
|
||||
"seed": 41,
|
||||
"mlflow_exp_name": "TestExp-max_iter-1",
|
||||
"use_spark": False,
|
||||
"n_concurrent_trials": 1,
|
||||
"verbose": 1,
|
||||
"featurization": "off",
|
||||
"metric": "rmse",
|
||||
"mlflow_logging": True,
|
||||
}
|
||||
|
||||
automl = AutoML(**settings)
|
||||
|
||||
with mlflow.start_run(run_name="AutoMLModel-XGBoost-and-LGBM-max_iter_1"):
|
||||
automl.fit(
|
||||
X_train=X,
|
||||
y_train=y_train_24h,
|
||||
period=24,
|
||||
X_val=X,
|
||||
y_val=y_train_24h,
|
||||
split_ratio=0,
|
||||
force_cancel=False,
|
||||
)
|
||||
|
||||
assert automl.model is not None, "AutoML failed to return a model"
|
||||
assert automl.best_run_id is not None, "Best run ID should not be None with mlflow logging"
|
||||
|
||||
print("Best model:", automl.model)
|
||||
print("Best run ID:", automl.best_run_id)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_max_iter_1()
|
||||
@@ -10,6 +10,18 @@ from flaml import AutoML
|
||||
|
||||
|
||||
class TestMLFlowLoggingParam:
|
||||
def test_update_and_install_requirements(self):
|
||||
import mlflow
|
||||
from sklearn import tree
|
||||
|
||||
from flaml.fabric.mlflow import update_and_install_requirements
|
||||
|
||||
with mlflow.start_run(run_name="test") as run:
|
||||
sk_model = tree.DecisionTreeClassifier()
|
||||
mlflow.sklearn.log_model(sk_model, "model", registered_model_name="test")
|
||||
|
||||
update_and_install_requirements(run_id=run.info.run_id)
|
||||
|
||||
def test_should_start_new_run_by_default(self, automl_settings):
|
||||
with mlflow.start_run() as parent_run:
|
||||
automl = AutoML()
|
||||
|
||||
@@ -143,4 +143,5 @@ def test_prep():
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_lrl2()
|
||||
test_prep()
|
||||
@@ -181,6 +181,49 @@ class TestMultiClass(unittest.TestCase):
|
||||
}
|
||||
automl.fit(X_train=X_train, y_train=y_train, **settings)
|
||||
|
||||
def test_ensemble_final_estimator_params_not_tuned(self):
|
||||
"""Test that final_estimator parameters in ensemble are not automatically tuned.
|
||||
|
||||
This test verifies that when a custom final_estimator is provided with specific
|
||||
parameters, those parameters are used as-is without any hyperparameter tuning.
|
||||
"""
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
automl = AutoML()
|
||||
X_train, y_train = load_wine(return_X_y=True)
|
||||
|
||||
# Create a LogisticRegression with specific non-default parameters
|
||||
custom_params = {
|
||||
"C": 0.5, # Non-default value
|
||||
"max_iter": 50, # Non-default value
|
||||
"random_state": 42,
|
||||
}
|
||||
final_est = LogisticRegression(**custom_params)
|
||||
|
||||
settings = {
|
||||
"time_budget": 5,
|
||||
"estimator_list": ["rf", "lgbm"],
|
||||
"task": "classification",
|
||||
"ensemble": {
|
||||
"final_estimator": final_est,
|
||||
"passthrough": False,
|
||||
},
|
||||
"n_jobs": 1,
|
||||
}
|
||||
automl.fit(X_train=X_train, y_train=y_train, **settings)
|
||||
|
||||
# Verify that the final estimator in the stacker uses the exact parameters we specified
|
||||
if hasattr(automl.model, "final_estimator_"):
|
||||
# The model is a StackingClassifier
|
||||
fitted_final_estimator = automl.model.final_estimator_
|
||||
assert (
|
||||
abs(fitted_final_estimator.C - custom_params["C"]) < 1e-9
|
||||
), f"Expected C={custom_params['C']}, but got {fitted_final_estimator.C}"
|
||||
assert (
|
||||
fitted_final_estimator.max_iter == custom_params["max_iter"]
|
||||
), f"Expected max_iter={custom_params['max_iter']}, but got {fitted_final_estimator.max_iter}"
|
||||
print("✓ Final estimator parameters were preserved (not tuned)")
|
||||
|
||||
def test_dataframe(self):
|
||||
self.test_classification(True)
|
||||
|
||||
@@ -235,6 +278,34 @@ class TestMultiClass(unittest.TestCase):
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
def test_invalid_custom_metric(self):
|
||||
"""Test that proper error is raised when custom_metric is called instead of passed."""
|
||||
from sklearn.datasets import load_iris
|
||||
|
||||
X_train, y_train = load_iris(return_X_y=True)
|
||||
|
||||
# Test with non-callable metric in __init__
|
||||
with self.assertRaises(ValueError) as context:
|
||||
automl = AutoML(metric=123) # passing an int instead of function
|
||||
self.assertIn("must be either a string or a callable function", str(context.exception))
|
||||
self.assertIn("but got int", str(context.exception))
|
||||
|
||||
# Test with non-callable metric in fit
|
||||
automl = AutoML()
|
||||
with self.assertRaises(ValueError) as context:
|
||||
automl.fit(X_train=X_train, y_train=y_train, metric=[], task="classification", time_budget=1)
|
||||
self.assertIn("must be either a string or a callable function", str(context.exception))
|
||||
self.assertIn("but got list", str(context.exception))
|
||||
|
||||
# Test with tuple (simulating result of calling a function that returns tuple)
|
||||
with self.assertRaises(ValueError) as context:
|
||||
automl = AutoML()
|
||||
automl.fit(
|
||||
X_train=X_train, y_train=y_train, metric=(0.5, {"loss": 0.5}), task="classification", time_budget=1
|
||||
)
|
||||
self.assertIn("must be either a string or a callable function", str(context.exception))
|
||||
self.assertIn("but got tuple", str(context.exception))
|
||||
|
||||
def test_classification(self, as_frame=False):
|
||||
automl_experiment = AutoML()
|
||||
automl_settings = {
|
||||
@@ -368,7 +439,11 @@ class TestMultiClass(unittest.TestCase):
|
||||
"n_jobs": 1,
|
||||
"model_history": True,
|
||||
}
|
||||
X_train = scipy.sparse.random(1554, 21, dtype=int)
|
||||
# NOTE: Avoid `dtype=int` here. On some NumPy/SciPy combinations (notably
|
||||
# Windows + Python 3.13), `scipy.sparse.random(..., dtype=int)` may trigger
|
||||
# integer sampling paths which raise "low is out of bounds for int32".
|
||||
# A float sparse matrix is sufficient to validate sparse-input support.
|
||||
X_train = scipy.sparse.random(1554, 21, dtype=np.float32)
|
||||
y_train = np.random.randint(3, size=1554)
|
||||
automl_experiment.fit(X_train=X_train, y_train=y_train, **automl_settings)
|
||||
print(automl_experiment.classes_)
|
||||
@@ -531,6 +606,32 @@ class TestMultiClass(unittest.TestCase):
|
||||
print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
|
||||
# print('Training duration of best run: {0:.4g} s'.format(new_automl_experiment.best_config_train_time))
|
||||
|
||||
def test_starting_points_should_improve_performance(self):
|
||||
N = 10000 # a large N is needed to see the improvement
|
||||
X_train, y_train = load_iris(return_X_y=True)
|
||||
X_train = np.concatenate([X_train + 0.1 * i for i in range(N)], axis=0)
|
||||
y_train = np.concatenate([y_train] * N, axis=0)
|
||||
|
||||
am1 = AutoML()
|
||||
am1.fit(X_train, y_train, estimator_list=["lgbm"], time_budget=3, seed=11)
|
||||
|
||||
am2 = AutoML()
|
||||
am2.fit(
|
||||
X_train,
|
||||
y_train,
|
||||
estimator_list=["lgbm"],
|
||||
time_budget=2,
|
||||
seed=11,
|
||||
starting_points=am1.best_config_per_estimator,
|
||||
)
|
||||
|
||||
print(f"am1.best_loss: {am1.best_loss:.4f}")
|
||||
print(f"am2.best_loss: {am2.best_loss:.4f}")
|
||||
|
||||
assert np.round(am2.best_loss, 4) <= np.round(
|
||||
am1.best_loss, 4
|
||||
), "Starting points should help improve the performance!"
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
|
||||
272
test/automl/test_no_overlap.py
Normal file
272
test/automl/test_no_overlap.py
Normal file
@@ -0,0 +1,272 @@
|
||||
"""Test to ensure correct label overlap handling for classification tasks"""
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sklearn.datasets import load_iris, make_classification
|
||||
|
||||
from flaml import AutoML
|
||||
|
||||
|
||||
def test_allow_label_overlap_true():
|
||||
"""Test with allow_label_overlap=True (fast mode, default)"""
|
||||
# Load iris dataset
|
||||
dic_data = load_iris(as_frame=True)
|
||||
iris_data = dic_data["frame"]
|
||||
|
||||
# Prepare data
|
||||
x_train = iris_data[["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"]].to_numpy()
|
||||
y_train = iris_data["target"]
|
||||
|
||||
# Train with fast mode (default)
|
||||
automl = AutoML()
|
||||
automl_settings = {
|
||||
"max_iter": 5,
|
||||
"metric": "accuracy",
|
||||
"task": "classification",
|
||||
"estimator_list": ["lgbm"],
|
||||
"eval_method": "holdout",
|
||||
"split_type": "stratified",
|
||||
"keep_search_state": True,
|
||||
"retrain_full": False,
|
||||
"auto_augment": False,
|
||||
"verbose": 0,
|
||||
"allow_label_overlap": True, # Fast mode
|
||||
}
|
||||
automl.fit(x_train, y_train, **automl_settings)
|
||||
|
||||
# Check results
|
||||
input_size = len(x_train)
|
||||
train_size = len(automl._state.X_train)
|
||||
val_size = len(automl._state.X_val)
|
||||
|
||||
# With stratified split on balanced data, fast mode may have no overlap
|
||||
assert (
|
||||
train_size + val_size >= input_size
|
||||
), f"Inconsistent sizes. Input: {input_size}, Train: {train_size}, Val: {val_size}"
|
||||
|
||||
# Verify all classes are represented in both sets
|
||||
train_labels = set(np.unique(automl._state.y_train))
|
||||
val_labels = set(np.unique(automl._state.y_val))
|
||||
all_labels = set(np.unique(y_train))
|
||||
|
||||
assert train_labels == all_labels, f"Not all labels in train. All: {all_labels}, Train: {train_labels}"
|
||||
assert val_labels == all_labels, f"Not all labels in val. All: {all_labels}, Val: {val_labels}"
|
||||
|
||||
print(
|
||||
f"✓ Test passed (fast mode): Input: {input_size}, Train: {train_size}, Val: {val_size}, "
|
||||
f"Overlap: {train_size + val_size - input_size}"
|
||||
)
|
||||
|
||||
|
||||
def test_allow_label_overlap_false():
|
||||
"""Test with allow_label_overlap=False (precise mode)"""
|
||||
# Load iris dataset
|
||||
dic_data = load_iris(as_frame=True)
|
||||
iris_data = dic_data["frame"]
|
||||
|
||||
# Prepare data
|
||||
x_train = iris_data[["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"]].to_numpy()
|
||||
y_train = iris_data["target"]
|
||||
|
||||
# Train with precise mode
|
||||
automl = AutoML()
|
||||
automl_settings = {
|
||||
"max_iter": 5,
|
||||
"metric": "accuracy",
|
||||
"task": "classification",
|
||||
"estimator_list": ["lgbm"],
|
||||
"eval_method": "holdout",
|
||||
"split_type": "stratified",
|
||||
"keep_search_state": True,
|
||||
"retrain_full": False,
|
||||
"auto_augment": False,
|
||||
"verbose": 0,
|
||||
"allow_label_overlap": False, # Precise mode
|
||||
}
|
||||
automl.fit(x_train, y_train, **automl_settings)
|
||||
|
||||
# Check that there's no overlap (or minimal overlap for single-instance classes)
|
||||
input_size = len(x_train)
|
||||
train_size = len(automl._state.X_train)
|
||||
val_size = len(automl._state.X_val)
|
||||
|
||||
# Verify all classes are represented
|
||||
all_labels = set(np.unique(y_train))
|
||||
|
||||
# Should have no overlap or minimal overlap
|
||||
overlap = train_size + val_size - input_size
|
||||
assert overlap <= len(all_labels), f"Excessive overlap: {overlap}"
|
||||
|
||||
# Verify all classes are represented
|
||||
train_labels = set(np.unique(automl._state.y_train))
|
||||
val_labels = set(np.unique(automl._state.y_val))
|
||||
|
||||
combined_labels = train_labels.union(val_labels)
|
||||
assert combined_labels == all_labels, f"Not all labels present. All: {all_labels}, Combined: {combined_labels}"
|
||||
|
||||
print(
|
||||
f"✓ Test passed (precise mode): Input: {input_size}, Train: {train_size}, Val: {val_size}, "
|
||||
f"Overlap: {overlap}"
|
||||
)
|
||||
|
||||
|
||||
def test_uniform_split_with_overlap_control():
|
||||
"""Test with uniform split and both overlap modes"""
|
||||
# Load iris dataset
|
||||
dic_data = load_iris(as_frame=True)
|
||||
iris_data = dic_data["frame"]
|
||||
|
||||
# Prepare data
|
||||
x_train = iris_data[["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"]].to_numpy()
|
||||
y_train = iris_data["target"]
|
||||
|
||||
# Test precise mode with uniform split
|
||||
automl = AutoML()
|
||||
automl_settings = {
|
||||
"max_iter": 5,
|
||||
"metric": "accuracy",
|
||||
"task": "classification",
|
||||
"estimator_list": ["lgbm"],
|
||||
"eval_method": "holdout",
|
||||
"split_type": "uniform",
|
||||
"keep_search_state": True,
|
||||
"retrain_full": False,
|
||||
"auto_augment": False,
|
||||
"verbose": 0,
|
||||
"allow_label_overlap": False, # Precise mode
|
||||
}
|
||||
automl.fit(x_train, y_train, **automl_settings)
|
||||
|
||||
input_size = len(x_train)
|
||||
train_size = len(automl._state.X_train)
|
||||
val_size = len(automl._state.X_val)
|
||||
|
||||
# Verify all classes are represented
|
||||
train_labels = set(np.unique(automl._state.y_train))
|
||||
val_labels = set(np.unique(automl._state.y_val))
|
||||
all_labels = set(np.unique(y_train))
|
||||
|
||||
combined_labels = train_labels.union(val_labels)
|
||||
assert combined_labels == all_labels, "Not all labels present with uniform split"
|
||||
|
||||
print(f"✓ Test passed (uniform split): Input: {input_size}, Train: {train_size}, Val: {val_size}")
|
||||
|
||||
|
||||
def test_with_sample_weights():
|
||||
"""Test label overlap handling with sample weights"""
|
||||
# Create a simple dataset
|
||||
X, y = make_classification(
|
||||
n_samples=200,
|
||||
n_features=10,
|
||||
n_informative=5,
|
||||
n_redundant=2,
|
||||
n_classes=3,
|
||||
n_clusters_per_class=1,
|
||||
random_state=42,
|
||||
)
|
||||
|
||||
# Create sample weights (giving more weight to some samples)
|
||||
sample_weight = np.random.uniform(0.5, 2.0, size=len(y))
|
||||
|
||||
# Test fast mode with sample weights
|
||||
automl_fast = AutoML()
|
||||
automl_fast.fit(
|
||||
X,
|
||||
y,
|
||||
task="classification",
|
||||
metric="accuracy",
|
||||
estimator_list=["lgbm"],
|
||||
eval_method="holdout",
|
||||
split_type="stratified",
|
||||
max_iter=3,
|
||||
keep_search_state=True,
|
||||
retrain_full=False,
|
||||
auto_augment=False,
|
||||
verbose=0,
|
||||
allow_label_overlap=True, # Fast mode
|
||||
sample_weight=sample_weight,
|
||||
)
|
||||
|
||||
# Verify all labels present
|
||||
train_labels_fast = set(np.unique(automl_fast._state.y_train))
|
||||
val_labels_fast = set(np.unique(automl_fast._state.y_val))
|
||||
all_labels = set(np.unique(y))
|
||||
|
||||
assert train_labels_fast == all_labels, "Not all labels in train (fast mode with weights)"
|
||||
assert val_labels_fast == all_labels, "Not all labels in val (fast mode with weights)"
|
||||
|
||||
# Test precise mode with sample weights
|
||||
automl_precise = AutoML()
|
||||
automl_precise.fit(
|
||||
X,
|
||||
y,
|
||||
task="classification",
|
||||
metric="accuracy",
|
||||
estimator_list=["lgbm"],
|
||||
eval_method="holdout",
|
||||
split_type="stratified",
|
||||
max_iter=3,
|
||||
keep_search_state=True,
|
||||
retrain_full=False,
|
||||
auto_augment=False,
|
||||
verbose=0,
|
||||
allow_label_overlap=False, # Precise mode
|
||||
sample_weight=sample_weight,
|
||||
)
|
||||
|
||||
# Verify all labels present
|
||||
train_labels_precise = set(np.unique(automl_precise._state.y_train))
|
||||
val_labels_precise = set(np.unique(automl_precise._state.y_val))
|
||||
|
||||
combined_labels = train_labels_precise.union(val_labels_precise)
|
||||
assert combined_labels == all_labels, "Not all labels present (precise mode with weights)"
|
||||
|
||||
print("✓ Test passed with sample weights (fast and precise modes)")
|
||||
|
||||
|
||||
def test_single_instance_class():
|
||||
"""Test handling of single-instance classes"""
|
||||
# Create imbalanced dataset where one class has only 1 instance
|
||||
X = np.random.randn(50, 4)
|
||||
y = np.array([0] * 40 + [1] * 9 + [2] * 1) # Class 2 has only 1 instance
|
||||
|
||||
# Test precise mode - should add single instance to both sets
|
||||
automl = AutoML()
|
||||
automl.fit(
|
||||
X,
|
||||
y,
|
||||
task="classification",
|
||||
metric="accuracy",
|
||||
estimator_list=["lgbm"],
|
||||
eval_method="holdout",
|
||||
split_type="uniform",
|
||||
max_iter=3,
|
||||
keep_search_state=True,
|
||||
retrain_full=False,
|
||||
auto_augment=False,
|
||||
verbose=0,
|
||||
allow_label_overlap=False, # Precise mode
|
||||
)
|
||||
|
||||
# Verify all labels present
|
||||
train_labels = set(np.unique(automl._state.y_train))
|
||||
val_labels = set(np.unique(automl._state.y_val))
|
||||
all_labels = set(np.unique(y))
|
||||
|
||||
# Single-instance class should be in both sets
|
||||
combined_labels = train_labels.union(val_labels)
|
||||
assert combined_labels == all_labels, "Not all labels present with single-instance class"
|
||||
|
||||
# Check that single-instance class (label 2) is in both sets
|
||||
assert 2 in train_labels, "Single-instance class not in train"
|
||||
assert 2 in val_labels, "Single-instance class not in val"
|
||||
|
||||
print("✓ Test passed with single-instance class")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_allow_label_overlap_true()
|
||||
test_allow_label_overlap_false()
|
||||
test_uniform_split_with_overlap_control()
|
||||
test_with_sample_weights()
|
||||
test_single_instance_class()
|
||||
print("\n✓ All tests passed!")
|
||||
@@ -1,8 +1,23 @@
|
||||
import sys
|
||||
|
||||
import pytest
|
||||
from minio.error import ServerError
|
||||
from openml.exceptions import OpenMLServerException
|
||||
|
||||
try:
|
||||
from minio.error import ServerError
|
||||
except ImportError:
|
||||
|
||||
class ServerError(Exception):
|
||||
pass
|
||||
|
||||
|
||||
try:
|
||||
from openml.exceptions import OpenMLServerException
|
||||
except ImportError:
|
||||
|
||||
class OpenMLServerException(Exception):
|
||||
pass
|
||||
|
||||
|
||||
from requests.exceptions import ChunkedEncodingError, SSLError
|
||||
|
||||
|
||||
@@ -64,6 +79,9 @@ def test_automl(budget=5, dataset_format="dataframe", hpo_method=None):
|
||||
automl.fit(X_train=X_train, y_train=y_train, **settings)
|
||||
""" retrieve best config and best learner """
|
||||
print("Best ML leaner:", automl.best_estimator)
|
||||
if not automl.best_estimator:
|
||||
print("Training budget is not sufficient")
|
||||
return
|
||||
print("Best hyperparmeter config:", automl.best_config)
|
||||
print(f"Best accuracy on validation data: {1 - automl.best_loss:.4g}")
|
||||
print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")
|
||||
|
||||
236
test/automl/test_preprocess_api.py
Normal file
236
test/automl/test_preprocess_api.py
Normal file
@@ -0,0 +1,236 @@
|
||||
"""Tests for the public preprocessor APIs."""
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sklearn.datasets import load_breast_cancer, load_diabetes
|
||||
|
||||
from flaml import AutoML
|
||||
|
||||
|
||||
class TestPreprocessAPI(unittest.TestCase):
|
||||
"""Test cases for the public preprocess() API methods."""
|
||||
|
||||
def test_automl_preprocess_before_fit(self):
|
||||
"""Test that calling preprocess before fit raises an error."""
|
||||
automl = AutoML()
|
||||
X_test = np.array([[1, 2, 3], [4, 5, 6]])
|
||||
|
||||
with self.assertRaises(AttributeError) as context:
|
||||
automl.preprocess(X_test)
|
||||
# Check that an error is raised about not being fitted
|
||||
self.assertIn("fit()", str(context.exception))
|
||||
|
||||
def test_automl_preprocess_classification(self):
|
||||
"""Test task-level preprocessing for classification."""
|
||||
# Load dataset
|
||||
X, y = load_breast_cancer(return_X_y=True)
|
||||
X_train, y_train = X[:400], y[:400]
|
||||
X_test = X[400:450]
|
||||
|
||||
# Train AutoML
|
||||
automl = AutoML()
|
||||
automl_settings = {
|
||||
"max_iter": 5,
|
||||
"task": "classification",
|
||||
"metric": "accuracy",
|
||||
"estimator_list": ["lgbm"],
|
||||
"verbose": 0,
|
||||
}
|
||||
automl.fit(X_train, y_train, **automl_settings)
|
||||
|
||||
# Test task-level preprocessing
|
||||
X_preprocessed = automl.preprocess(X_test)
|
||||
|
||||
# Verify the output is not None and has the right shape
|
||||
self.assertIsNotNone(X_preprocessed)
|
||||
self.assertEqual(X_preprocessed.shape[0], X_test.shape[0])
|
||||
|
||||
def test_automl_preprocess_regression(self):
|
||||
"""Test task-level preprocessing for regression."""
|
||||
# Load dataset
|
||||
X, y = load_diabetes(return_X_y=True)
|
||||
X_train, y_train = X[:300], y[:300]
|
||||
X_test = X[300:350]
|
||||
|
||||
# Train AutoML
|
||||
automl = AutoML()
|
||||
automl_settings = {
|
||||
"max_iter": 5,
|
||||
"task": "regression",
|
||||
"metric": "r2",
|
||||
"estimator_list": ["lgbm"],
|
||||
"verbose": 0,
|
||||
}
|
||||
automl.fit(X_train, y_train, **automl_settings)
|
||||
|
||||
# Test task-level preprocessing
|
||||
X_preprocessed = automl.preprocess(X_test)
|
||||
|
||||
# Verify the output
|
||||
self.assertIsNotNone(X_preprocessed)
|
||||
self.assertEqual(X_preprocessed.shape[0], X_test.shape[0])
|
||||
|
||||
def test_automl_preprocess_with_dataframe(self):
|
||||
"""Test task-level preprocessing with pandas DataFrame."""
|
||||
# Create a simple dataset
|
||||
X_train = pd.DataFrame(
|
||||
{
|
||||
"feature1": [1, 2, 3, 4, 5] * 20,
|
||||
"feature2": [5, 4, 3, 2, 1] * 20,
|
||||
"category": ["a", "b", "a", "b", "a"] * 20,
|
||||
}
|
||||
)
|
||||
y_train = pd.Series([0, 1, 0, 1, 0] * 20)
|
||||
|
||||
X_test = pd.DataFrame(
|
||||
{
|
||||
"feature1": [6, 7, 8],
|
||||
"feature2": [1, 2, 3],
|
||||
"category": ["a", "b", "a"],
|
||||
}
|
||||
)
|
||||
|
||||
# Train AutoML
|
||||
automl = AutoML()
|
||||
automl_settings = {
|
||||
"max_iter": 5,
|
||||
"task": "classification",
|
||||
"metric": "accuracy",
|
||||
"estimator_list": ["lgbm"],
|
||||
"verbose": 0,
|
||||
}
|
||||
automl.fit(X_train, y_train, **automl_settings)
|
||||
|
||||
# Test preprocessing
|
||||
X_preprocessed = automl.preprocess(X_test)
|
||||
|
||||
# Verify the output - check the number of rows matches
|
||||
self.assertIsNotNone(X_preprocessed)
|
||||
preprocessed_len = len(X_preprocessed) if hasattr(X_preprocessed, "__len__") else X_preprocessed.shape[0]
|
||||
self.assertEqual(preprocessed_len, len(X_test))
|
||||
|
||||
def test_estimator_preprocess(self):
|
||||
"""Test estimator-level preprocessing."""
|
||||
# Load dataset
|
||||
X, y = load_breast_cancer(return_X_y=True)
|
||||
X_train, y_train = X[:400], y[:400]
|
||||
X_test = X[400:450]
|
||||
|
||||
# Train AutoML
|
||||
automl = AutoML()
|
||||
automl_settings = {
|
||||
"max_iter": 5,
|
||||
"task": "classification",
|
||||
"metric": "accuracy",
|
||||
"estimator_list": ["lgbm"],
|
||||
"verbose": 0,
|
||||
}
|
||||
automl.fit(X_train, y_train, **automl_settings)
|
||||
|
||||
# Get the trained estimator
|
||||
estimator = automl.model
|
||||
self.assertIsNotNone(estimator)
|
||||
|
||||
# First apply task-level preprocessing
|
||||
X_task_preprocessed = automl.preprocess(X_test)
|
||||
|
||||
# Then apply estimator-level preprocessing
|
||||
X_estimator_preprocessed = estimator.preprocess(X_task_preprocessed)
|
||||
|
||||
# Verify the output
|
||||
self.assertIsNotNone(X_estimator_preprocessed)
|
||||
self.assertEqual(X_estimator_preprocessed.shape[0], X_test.shape[0])
|
||||
|
||||
def test_preprocess_pipeline(self):
|
||||
"""Test the complete preprocessing pipeline (task-level then estimator-level)."""
|
||||
# Load dataset
|
||||
X, y = load_breast_cancer(return_X_y=True)
|
||||
X_train, y_train = X[:400], y[:400]
|
||||
X_test = X[400:450]
|
||||
|
||||
# Train AutoML
|
||||
automl = AutoML()
|
||||
automl_settings = {
|
||||
"max_iter": 5,
|
||||
"task": "classification",
|
||||
"metric": "accuracy",
|
||||
"estimator_list": ["lgbm"],
|
||||
"verbose": 0,
|
||||
}
|
||||
automl.fit(X_train, y_train, **automl_settings)
|
||||
|
||||
# Apply the complete preprocessing pipeline
|
||||
X_task_preprocessed = automl.preprocess(X_test)
|
||||
X_final = automl.model.preprocess(X_task_preprocessed)
|
||||
|
||||
# Verify predictions work with preprocessed data
|
||||
# The internal predict already does this preprocessing,
|
||||
# but we verify our manual preprocessing gives consistent results
|
||||
y_pred_manual = automl.model._model.predict(X_final)
|
||||
y_pred_auto = automl.predict(X_test)
|
||||
|
||||
# Both should give the same predictions
|
||||
np.testing.assert_array_equal(y_pred_manual, y_pred_auto)
|
||||
|
||||
def test_preprocess_with_mixed_types(self):
|
||||
"""Test preprocessing with mixed data types."""
|
||||
# Create dataset with mixed types
|
||||
X_train = pd.DataFrame(
|
||||
{
|
||||
"numeric1": np.random.rand(100),
|
||||
"numeric2": np.random.randint(0, 100, 100),
|
||||
"categorical": np.random.choice(["cat", "dog", "bird"], 100),
|
||||
"boolean": np.random.choice([True, False], 100),
|
||||
}
|
||||
)
|
||||
y_train = pd.Series(np.random.randint(0, 2, 100))
|
||||
|
||||
X_test = pd.DataFrame(
|
||||
{
|
||||
"numeric1": np.random.rand(10),
|
||||
"numeric2": np.random.randint(0, 100, 10),
|
||||
"categorical": np.random.choice(["cat", "dog", "bird"], 10),
|
||||
"boolean": np.random.choice([True, False], 10),
|
||||
}
|
||||
)
|
||||
|
||||
# Train AutoML
|
||||
automl = AutoML()
|
||||
automl_settings = {
|
||||
"max_iter": 5,
|
||||
"task": "classification",
|
||||
"metric": "accuracy",
|
||||
"estimator_list": ["lgbm"],
|
||||
"verbose": 0,
|
||||
}
|
||||
automl.fit(X_train, y_train, **automl_settings)
|
||||
|
||||
# Test preprocessing
|
||||
X_preprocessed = automl.preprocess(X_test)
|
||||
|
||||
# Verify the output
|
||||
self.assertIsNotNone(X_preprocessed)
|
||||
|
||||
def test_estimator_preprocess_without_automl(self):
|
||||
"""Test that estimator.preprocess() can be used independently."""
|
||||
from flaml.automl.model import LGBMEstimator
|
||||
|
||||
# Create a simple estimator
|
||||
X_train = np.random.rand(100, 5)
|
||||
y_train = np.random.randint(0, 2, 100)
|
||||
|
||||
estimator = LGBMEstimator(task="classification")
|
||||
estimator.fit(X_train, y_train)
|
||||
|
||||
# Test preprocessing
|
||||
X_test = np.random.rand(10, 5)
|
||||
X_preprocessed = estimator.preprocess(X_test)
|
||||
|
||||
# Verify the output
|
||||
self.assertIsNotNone(X_preprocessed)
|
||||
self.assertEqual(X_preprocessed.shape, X_test.shape)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
@@ -38,7 +38,7 @@ class TestLogging(unittest.TestCase):
|
||||
"keep_search_state": True,
|
||||
"learner_selector": "roundrobin",
|
||||
}
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True)
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, data_home="test")
|
||||
n = len(y_train) >> 1
|
||||
print(automl.model, automl.classes_, automl.predict(X_train))
|
||||
automl.fit(
|
||||
|
||||
@@ -47,7 +47,7 @@ class TestRegression(unittest.TestCase):
|
||||
"n_jobs": 1,
|
||||
"model_history": True,
|
||||
}
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True)
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, data_home="test")
|
||||
n = int(len(y_train) * 9 // 10)
|
||||
automl.fit(X_train=X_train[:n], y_train=y_train[:n], X_val=X_train[n:], y_val=y_train[n:], **automl_settings)
|
||||
assert automl._state.eval_method == "holdout"
|
||||
@@ -130,7 +130,7 @@ class TestRegression(unittest.TestCase):
|
||||
)
|
||||
automl.fit(X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, **settings)
|
||||
|
||||
def test_parallel(self, hpo_method=None):
|
||||
def test_parallel_and_pickle(self, hpo_method=None):
|
||||
automl_experiment = AutoML()
|
||||
automl_settings = {
|
||||
"time_budget": 10,
|
||||
@@ -141,7 +141,7 @@ class TestRegression(unittest.TestCase):
|
||||
"n_concurrent_trials": 10,
|
||||
"hpo_method": hpo_method,
|
||||
}
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True)
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, data_home="test")
|
||||
try:
|
||||
automl_experiment.fit(X_train=X_train, y_train=y_train, **automl_settings)
|
||||
print(automl_experiment.predict(X_train))
|
||||
@@ -153,6 +153,18 @@ class TestRegression(unittest.TestCase):
|
||||
except ImportError:
|
||||
return
|
||||
|
||||
# test pickle and load_pickle, should work for prediction
|
||||
automl_experiment.pickle("automl_xgboost_spark.pkl")
|
||||
automl_loaded = AutoML().load_pickle("automl_xgboost_spark.pkl")
|
||||
assert automl_loaded.best_estimator == automl_experiment.best_estimator
|
||||
assert automl_loaded.best_loss == automl_experiment.best_loss
|
||||
automl_loaded.predict(X_train)
|
||||
|
||||
import shutil
|
||||
|
||||
shutil.rmtree("automl_xgboost_spark.pkl", ignore_errors=True)
|
||||
shutil.rmtree("automl_xgboost_spark.pkl.flaml_artifacts", ignore_errors=True)
|
||||
|
||||
def test_sparse_matrix_regression_holdout(self):
|
||||
X_train = scipy.sparse.random(8, 100)
|
||||
y_train = np.random.uniform(size=8)
|
||||
@@ -268,7 +280,7 @@ def test_reproducibility_of_regression_models(estimator: str):
|
||||
"skip_transform": True,
|
||||
"retrain_full": True,
|
||||
}
|
||||
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
|
||||
X, y = fetch_california_housing(return_X_y=True, as_frame=True, data_home="test")
|
||||
automl.fit(X_train=X, y_train=y, **automl_settings)
|
||||
best_model = automl.model
|
||||
assert best_model is not None
|
||||
@@ -314,7 +326,7 @@ def test_reproducibility_of_catboost_regression_model():
|
||||
"skip_transform": True,
|
||||
"retrain_full": True,
|
||||
}
|
||||
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
|
||||
X, y = fetch_california_housing(return_X_y=True, as_frame=True, data_home="test")
|
||||
automl.fit(X_train=X, y_train=y, **automl_settings)
|
||||
best_model = automl.model
|
||||
assert best_model is not None
|
||||
@@ -360,7 +372,7 @@ def test_reproducibility_of_lgbm_regression_model():
|
||||
"skip_transform": True,
|
||||
"retrain_full": True,
|
||||
}
|
||||
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
|
||||
X, y = fetch_california_housing(return_X_y=True, as_frame=True, data_home="test")
|
||||
automl.fit(X_train=X, y_train=y, **automl_settings)
|
||||
best_model = automl.model
|
||||
assert best_model is not None
|
||||
@@ -424,7 +436,7 @@ def test_reproducibility_of_underlying_regression_models(estimator: str):
|
||||
"skip_transform": True,
|
||||
"retrain_full": False,
|
||||
}
|
||||
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
|
||||
X, y = fetch_california_housing(return_X_y=True, as_frame=True, data_home="test")
|
||||
automl.fit(X_train=X, y_train=y, **automl_settings)
|
||||
best_model = automl.model
|
||||
assert best_model is not None
|
||||
|
||||
@@ -142,7 +142,7 @@ class TestScore:
|
||||
def test_regression(self):
|
||||
automl_experiment = AutoML()
|
||||
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True)
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, data_home="test")
|
||||
n = int(len(y_train) * 9 // 10)
|
||||
|
||||
for each_estimator in [
|
||||
|
||||
89
test/automl/test_sklearn_17_compat.py
Normal file
89
test/automl/test_sklearn_17_compat.py
Normal file
@@ -0,0 +1,89 @@
|
||||
"""Test sklearn 1.7+ compatibility for estimator type detection.
|
||||
|
||||
This test ensures that FLAML estimators are properly recognized as
|
||||
regressors or classifiers by sklearn's is_regressor() and is_classifier()
|
||||
functions, which is required for sklearn 1.7+ ensemble methods.
|
||||
"""
|
||||
|
||||
import pytest
|
||||
from sklearn.base import is_classifier, is_regressor
|
||||
|
||||
from flaml.automl.model import (
|
||||
ExtraTreesEstimator,
|
||||
LGBMEstimator,
|
||||
RandomForestEstimator,
|
||||
XGBoostSklearnEstimator,
|
||||
)
|
||||
|
||||
|
||||
def test_extra_trees_regressor_type():
|
||||
"""Test that ExtraTreesEstimator with regression task is recognized as regressor."""
|
||||
est = ExtraTreesEstimator(task="regression")
|
||||
assert is_regressor(est), "ExtraTreesEstimator(task='regression') should be recognized as a regressor"
|
||||
assert not is_classifier(est), "ExtraTreesEstimator(task='regression') should not be recognized as a classifier"
|
||||
|
||||
|
||||
def test_extra_trees_classifier_type():
|
||||
"""Test that ExtraTreesEstimator with classification task is recognized as classifier."""
|
||||
est = ExtraTreesEstimator(task="binary")
|
||||
assert is_classifier(est), "ExtraTreesEstimator(task='binary') should be recognized as a classifier"
|
||||
assert not is_regressor(est), "ExtraTreesEstimator(task='binary') should not be recognized as a regressor"
|
||||
|
||||
est = ExtraTreesEstimator(task="multiclass")
|
||||
assert is_classifier(est), "ExtraTreesEstimator(task='multiclass') should be recognized as a classifier"
|
||||
assert not is_regressor(est), "ExtraTreesEstimator(task='multiclass') should not be recognized as a regressor"
|
||||
|
||||
|
||||
def test_random_forest_regressor_type():
|
||||
"""Test that RandomForestEstimator with regression task is recognized as regressor."""
|
||||
est = RandomForestEstimator(task="regression")
|
||||
assert is_regressor(est), "RandomForestEstimator(task='regression') should be recognized as a regressor"
|
||||
assert not is_classifier(est), "RandomForestEstimator(task='regression') should not be recognized as a classifier"
|
||||
|
||||
|
||||
def test_random_forest_classifier_type():
|
||||
"""Test that RandomForestEstimator with classification task is recognized as classifier."""
|
||||
est = RandomForestEstimator(task="binary")
|
||||
assert is_classifier(est), "RandomForestEstimator(task='binary') should be recognized as a classifier"
|
||||
assert not is_regressor(est), "RandomForestEstimator(task='binary') should not be recognized as a regressor"
|
||||
|
||||
|
||||
def test_lgbm_regressor_type():
|
||||
"""Test that LGBMEstimator with regression task is recognized as regressor."""
|
||||
est = LGBMEstimator(task="regression")
|
||||
assert is_regressor(est), "LGBMEstimator(task='regression') should be recognized as a regressor"
|
||||
assert not is_classifier(est), "LGBMEstimator(task='regression') should not be recognized as a classifier"
|
||||
|
||||
|
||||
def test_lgbm_classifier_type():
|
||||
"""Test that LGBMEstimator with classification task is recognized as classifier."""
|
||||
est = LGBMEstimator(task="binary")
|
||||
assert is_classifier(est), "LGBMEstimator(task='binary') should be recognized as a classifier"
|
||||
assert not is_regressor(est), "LGBMEstimator(task='binary') should not be recognized as a regressor"
|
||||
|
||||
|
||||
def test_xgboost_regressor_type():
|
||||
"""Test that XGBoostSklearnEstimator with regression task is recognized as regressor."""
|
||||
est = XGBoostSklearnEstimator(task="regression")
|
||||
assert is_regressor(est), "XGBoostSklearnEstimator(task='regression') should be recognized as a regressor"
|
||||
assert not is_classifier(est), "XGBoostSklearnEstimator(task='regression') should not be recognized as a classifier"
|
||||
|
||||
|
||||
def test_xgboost_classifier_type():
|
||||
"""Test that XGBoostSklearnEstimator with classification task is recognized as classifier."""
|
||||
est = XGBoostSklearnEstimator(task="binary")
|
||||
assert is_classifier(est), "XGBoostSklearnEstimator(task='binary') should be recognized as a classifier"
|
||||
assert not is_regressor(est), "XGBoostSklearnEstimator(task='binary') should not be recognized as a regressor"
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Run all tests
|
||||
test_extra_trees_regressor_type()
|
||||
test_extra_trees_classifier_type()
|
||||
test_random_forest_regressor_type()
|
||||
test_random_forest_classifier_type()
|
||||
test_lgbm_regressor_type()
|
||||
test_lgbm_classifier_type()
|
||||
test_xgboost_regressor_type()
|
||||
test_xgboost_classifier_type()
|
||||
print("All sklearn 1.7+ compatibility tests passed!")
|
||||
@@ -1,4 +1,6 @@
|
||||
from sklearn.datasets import fetch_openml
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sklearn.datasets import fetch_openml, load_iris
|
||||
from sklearn.metrics import accuracy_score
|
||||
from sklearn.model_selection import GroupKFold, KFold, train_test_split
|
||||
|
||||
@@ -48,7 +50,7 @@ def test_time():
|
||||
_test(split_type="time")
|
||||
|
||||
|
||||
def test_groups():
|
||||
def test_groups_for_classification_task():
|
||||
from sklearn.externals._arff import ArffException
|
||||
|
||||
try:
|
||||
@@ -58,8 +60,6 @@ def test_groups():
|
||||
|
||||
X, y = load_wine(return_X_y=True)
|
||||
|
||||
import numpy as np
|
||||
|
||||
automl = AutoML()
|
||||
automl_settings = {
|
||||
"time_budget": 2,
|
||||
@@ -68,7 +68,7 @@ def test_groups():
|
||||
"model_history": True,
|
||||
"eval_method": "cv",
|
||||
"groups": np.random.randint(low=0, high=10, size=len(y)),
|
||||
"estimator_list": ["lgbm", "rf", "xgboost", "kneighbor"],
|
||||
"estimator_list": ["catboost", "lgbm", "rf", "xgboost", "kneighbor"],
|
||||
"learner_selector": "roundrobin",
|
||||
}
|
||||
automl.fit(X, y, **automl_settings)
|
||||
@@ -88,6 +88,72 @@ def test_groups():
|
||||
automl.fit(X, y, **automl_settings)
|
||||
|
||||
|
||||
def test_groups_for_regression_task():
|
||||
"""Append nonsensical groups to iris dataset and use it to test that GroupKFold works for regression tasks"""
|
||||
iris_dict_data = load_iris(as_frame=True) # numpy arrays
|
||||
iris_data = iris_dict_data["frame"] # pandas dataframe data + target
|
||||
|
||||
rng = np.random.default_rng(42)
|
||||
iris_data["cluster"] = rng.integers(
|
||||
low=0, high=5, size=iris_data.shape[0]
|
||||
) # np.random.randint(0, 5, iris_data.shape[0])
|
||||
|
||||
automl = AutoML()
|
||||
X = iris_data[["sepal length (cm)", "sepal width (cm)", "petal length (cm)"]].to_numpy()
|
||||
y = iris_data["petal width (cm)"]
|
||||
X_train, X_test, y_train, y_test, groups_train, groups_test = train_test_split(
|
||||
X, y, iris_data["cluster"], random_state=42
|
||||
)
|
||||
automl_settings = {
|
||||
"max_iter": 5,
|
||||
"time_budget": -1,
|
||||
"metric": "r2",
|
||||
"task": "regression",
|
||||
"estimator_list": ["lgbm", "rf", "xgboost", "kneighbor"],
|
||||
"eval_method": "cv",
|
||||
"split_type": "uniform",
|
||||
"groups": groups_train,
|
||||
}
|
||||
automl.fit(X_train, y_train, **automl_settings)
|
||||
|
||||
|
||||
def test_groups_with_sample_weights():
|
||||
"""Verifies that sample weights can be used with group splits i.e. that https://github.com/microsoft/FLAML/issues/1396 remains fixed"""
|
||||
iris_dict_data = load_iris(as_frame=True) # numpy arrays
|
||||
iris_data = iris_dict_data["frame"] # pandas dataframe data + target
|
||||
iris_data["cluster"] = np.random.randint(0, 5, iris_data.shape[0])
|
||||
automl = AutoML()
|
||||
|
||||
X = iris_data[["sepal length (cm)", "sepal width (cm)", "petal length (cm)"]].to_numpy()
|
||||
y = iris_data["petal width (cm)"]
|
||||
sample_weight = pd.Series(np.random.rand(X.shape[0]))
|
||||
(
|
||||
X_train,
|
||||
X_test,
|
||||
y_train,
|
||||
y_test,
|
||||
groups_train,
|
||||
groups_test,
|
||||
sample_weight_train,
|
||||
sample_weight_test,
|
||||
) = train_test_split(X, y, iris_data["cluster"], sample_weight, random_state=42)
|
||||
automl_settings = {
|
||||
"max_iter": 5,
|
||||
"time_budget": -1,
|
||||
"metric": "r2",
|
||||
"task": "regression",
|
||||
"log_file_name": "error.log",
|
||||
"log_type": "all",
|
||||
"estimator_list": ["lgbm"],
|
||||
"eval_method": "cv",
|
||||
"split_type": "group",
|
||||
"groups": groups_train,
|
||||
"sample_weight": sample_weight_train,
|
||||
}
|
||||
automl.fit(X_train, y_train, **automl_settings)
|
||||
assert automl.model is not None
|
||||
|
||||
|
||||
def test_stratified_groupkfold():
|
||||
from minio.error import ServerError
|
||||
from sklearn.model_selection import StratifiedGroupKFold
|
||||
@@ -108,6 +174,7 @@ def test_stratified_groupkfold():
|
||||
"split_type": splitter,
|
||||
"groups": X_train["Airline"],
|
||||
"estimator_list": [
|
||||
"catboost",
|
||||
"lgbm",
|
||||
"rf",
|
||||
"xgboost",
|
||||
@@ -203,4 +270,4 @@ def test_object():
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_groups()
|
||||
test_groups_for_classification_task()
|
||||
|
||||
@@ -30,7 +30,7 @@ class TestTrainingLog(unittest.TestCase):
|
||||
"keep_search_state": True,
|
||||
"estimator_list": estimator_list,
|
||||
}
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True)
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, data_home="test")
|
||||
automl.fit(X_train=X_train, y_train=y_train, **automl_settings)
|
||||
# Check if the training log file is populated.
|
||||
self.assertTrue(os.path.exists(filename))
|
||||
|
||||
@@ -108,7 +108,14 @@ class TestWarmStart(unittest.TestCase):
|
||||
|
||||
def test_FLAML_sample_size_in_starting_points(self):
|
||||
from minio.error import ServerError
|
||||
from openml.exceptions import OpenMLServerException
|
||||
|
||||
try:
|
||||
from openml.exceptions import OpenMLServerException
|
||||
except ImportError:
|
||||
|
||||
class OpenMLServerException(Exception):
|
||||
pass
|
||||
|
||||
from requests.exceptions import ChunkedEncodingError, SSLError
|
||||
|
||||
from flaml import AutoML
|
||||
|
||||
BIN
test/cal_housing_py3.pkz
Normal file
BIN
test/cal_housing_py3.pkz
Normal file
Binary file not shown.
60
test/check_dependency.py
Normal file
60
test/check_dependency.py
Normal file
@@ -0,0 +1,60 @@
|
||||
import subprocess
|
||||
from importlib.metadata import distributions
|
||||
|
||||
installed_libs = sorted(f"{dist.metadata['Name']}=={dist.version}" for dist in distributions())
|
||||
|
||||
first_tier_dependencies = [
|
||||
"numpy",
|
||||
"jupyter",
|
||||
"lightgbm",
|
||||
"xgboost",
|
||||
"scipy",
|
||||
"pandas",
|
||||
"scikit-learn",
|
||||
"thop",
|
||||
"pytest",
|
||||
"pytest-rerunfailures",
|
||||
"coverage",
|
||||
"pre-commit",
|
||||
"torch",
|
||||
"torchvision",
|
||||
"catboost",
|
||||
"rgf-python",
|
||||
"optuna",
|
||||
"openml",
|
||||
"statsmodels",
|
||||
"psutil",
|
||||
"dataclasses",
|
||||
"transformers[torch]",
|
||||
"transformers",
|
||||
"datasets",
|
||||
"evaluate",
|
||||
"nltk",
|
||||
"rouge_score",
|
||||
"hcrystalball",
|
||||
"seqeval",
|
||||
"pytorch-forecasting",
|
||||
"mlflow-skinny",
|
||||
"joblibspark",
|
||||
"joblib",
|
||||
"nbconvert",
|
||||
"nbformat",
|
||||
"ipykernel",
|
||||
"pytorch-lightning",
|
||||
"tensorboardX",
|
||||
"requests",
|
||||
"packaging",
|
||||
"dill",
|
||||
"ray",
|
||||
"prophet",
|
||||
]
|
||||
|
||||
|
||||
for lib in installed_libs:
|
||||
lib_name = lib.split("==")[0]
|
||||
if lib_name in first_tier_dependencies:
|
||||
print(lib)
|
||||
|
||||
# print current commit hash
|
||||
commit_hash = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode("utf-8").strip()
|
||||
print(f"Current commit hash: {commit_hash}")
|
||||
@@ -2,11 +2,24 @@ from typing import Any, Dict, List, Union
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from catboost import CatBoostClassifier, CatBoostRegressor, Pool
|
||||
import pytest
|
||||
from sklearn.metrics import f1_score, r2_score
|
||||
|
||||
try:
|
||||
from catboost import CatBoostClassifier, CatBoostRegressor, Pool
|
||||
except ImportError: # pragma: no cover
|
||||
CatBoostClassifier = None
|
||||
CatBoostRegressor = None
|
||||
Pool = None
|
||||
|
||||
def evaluate_cv_folds_with_underlying_model(X_train_all, y_train_all, kf, model: Any, task: str) -> pd.DataFrame:
|
||||
|
||||
def _is_catboost_model_type(model_type: type) -> bool:
|
||||
if CatBoostClassifier is not None and CatBoostRegressor is not None:
|
||||
return model_type is CatBoostClassifier or model_type is CatBoostRegressor
|
||||
return getattr(model_type, "__module__", "").startswith("catboost")
|
||||
|
||||
|
||||
def evaluate_cv_folds_with_underlying_model(X_train_all, y_train_all, kf, model: Any, task: str) -> List[float]:
|
||||
"""Mimic the FLAML CV process to calculate the metrics across each fold.
|
||||
|
||||
:param X_train_all: X training data
|
||||
@@ -17,7 +30,7 @@ def evaluate_cv_folds_with_underlying_model(X_train_all, y_train_all, kf, model:
|
||||
:return: An array containing the metrics
|
||||
"""
|
||||
rng = np.random.RandomState(2020)
|
||||
all_fold_metrics: List[Dict[str, Union[int, float]]] = []
|
||||
all_fold_metrics: List[float] = []
|
||||
for train_index, val_index in kf.split(X_train_all, y_train_all):
|
||||
X_train_split, y_train_split = X_train_all, y_train_all
|
||||
train_index = rng.permutation(train_index)
|
||||
@@ -25,9 +38,11 @@ def evaluate_cv_folds_with_underlying_model(X_train_all, y_train_all, kf, model:
|
||||
X_val = X_train_split.iloc[val_index]
|
||||
y_train, y_val = y_train_split[train_index], y_train_split[val_index]
|
||||
model_type = type(model)
|
||||
if model_type is not CatBoostClassifier and model_type is not CatBoostRegressor:
|
||||
if not _is_catboost_model_type(model_type):
|
||||
model.fit(X_train, y_train)
|
||||
else:
|
||||
if Pool is None:
|
||||
pytest.skip("catboost is not installed")
|
||||
use_best_model = True
|
||||
n = max(int(len(y_train) * 0.9), len(y_train) - 1000) if use_best_model else len(y_train)
|
||||
X_tr, y_tr = (X_train)[:n], y_train[:n]
|
||||
@@ -38,5 +53,5 @@ def evaluate_cv_folds_with_underlying_model(X_train_all, y_train_all, kf, model:
|
||||
reproduced_metric = 1 - f1_score(y_val, y_pred_classes)
|
||||
else:
|
||||
reproduced_metric = 1 - r2_score(y_val, y_pred_classes)
|
||||
all_fold_metrics.append(reproduced_metric)
|
||||
all_fold_metrics.append(float(reproduced_metric))
|
||||
return all_fold_metrics
|
||||
|
||||
@@ -60,7 +60,7 @@ def test_housing(as_frame=True):
|
||||
"starting_points": "data",
|
||||
"max_iter": 0,
|
||||
}
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, as_frame=as_frame)
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, as_frame=as_frame, data_home="test")
|
||||
automl.fit(X_train, y_train, **automl_settings)
|
||||
|
||||
|
||||
@@ -115,7 +115,7 @@ def test_suggest_classification():
|
||||
|
||||
def test_suggest_regression():
|
||||
location = "test/default"
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, as_frame=True)
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, as_frame=True, data_home="test")
|
||||
suggested = suggest_hyperparams("regression", X_train, y_train, "lgbm", location=location)
|
||||
print(suggested)
|
||||
suggested = preprocess_and_suggest_hyperparams("regression", X_train, y_train, "xgboost", location=location)
|
||||
@@ -137,7 +137,7 @@ def test_rf():
|
||||
print(rf)
|
||||
|
||||
location = "test/default"
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, as_frame=True)
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, as_frame=True, data_home="test")
|
||||
rf = RandomForestRegressor(default_location=location)
|
||||
rf.fit(X_train[:100], y_train[:100])
|
||||
rf.predict(X_train)
|
||||
@@ -155,7 +155,7 @@ def test_extratrees():
|
||||
print(classifier)
|
||||
|
||||
location = "test/default"
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, as_frame=True)
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, as_frame=True, data_home="test")
|
||||
regressor = ExtraTreesRegressor(default_location=location)
|
||||
regressor.fit(X_train[:100], y_train[:100])
|
||||
regressor.predict(X_train)
|
||||
@@ -175,7 +175,7 @@ def test_lgbm():
|
||||
print(classifier.classes_)
|
||||
|
||||
location = "test/default"
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, as_frame=True)
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, as_frame=True, data_home="test")
|
||||
regressor = LGBMRegressor(default_location=location)
|
||||
regressor.fit(X_train, y_train)
|
||||
regressor.predict(X_train)
|
||||
@@ -183,6 +183,8 @@ def test_lgbm():
|
||||
|
||||
|
||||
def test_xgboost():
|
||||
import numpy as np
|
||||
|
||||
from flaml.default import XGBClassifier, XGBRegressor
|
||||
|
||||
X_train, y_train = load_breast_cancer(return_X_y=True, as_frame=True)
|
||||
@@ -194,12 +196,71 @@ def test_xgboost():
|
||||
print(classifier.classes_)
|
||||
|
||||
location = "test/default"
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, as_frame=True)
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, as_frame=True, data_home="test")
|
||||
regressor = XGBRegressor(default_location=location)
|
||||
regressor.fit(X_train[:100], y_train[:100])
|
||||
regressor.predict(X_train)
|
||||
print(regressor)
|
||||
|
||||
# Test eval_set with categorical features (Issue: eval_set not preprocessed)
|
||||
np.random.seed(42)
|
||||
n = 500
|
||||
df = pd.DataFrame(
|
||||
{
|
||||
"num1": np.random.randn(n),
|
||||
"num2": np.random.rand(n) * 10,
|
||||
"cat1": np.random.choice(["A", "B", "C"], size=n),
|
||||
"cat2": np.random.choice(["X", "Y"], size=n),
|
||||
"target": np.random.choice([0, 1], size=n),
|
||||
}
|
||||
)
|
||||
|
||||
X = df.drop(columns="target")
|
||||
y = df["target"]
|
||||
|
||||
X_train_cat, X_valid_cat, y_train_cat, y_valid_cat = train_test_split(X, y, test_size=0.2, random_state=0)
|
||||
|
||||
# Convert categorical columns to pandas 'category' dtype
|
||||
for col in X_train_cat.select_dtypes(include="object").columns:
|
||||
X_train_cat[col] = X_train_cat[col].astype("category")
|
||||
X_valid_cat[col] = X_valid_cat[col].astype("category")
|
||||
|
||||
# Test XGBClassifier with eval_set
|
||||
classifier_eval = XGBClassifier(
|
||||
tree_method="hist",
|
||||
enable_categorical=True,
|
||||
eval_metric="logloss",
|
||||
use_label_encoder=False,
|
||||
early_stopping_rounds=10,
|
||||
random_state=0,
|
||||
n_estimators=10,
|
||||
)
|
||||
classifier_eval.fit(X_train_cat, y_train_cat, eval_set=[(X_valid_cat, y_valid_cat)], verbose=False)
|
||||
y_pred = classifier_eval.predict(X_valid_cat)
|
||||
assert len(y_pred) == len(y_valid_cat)
|
||||
|
||||
# Test XGBRegressor with eval_set
|
||||
y_reg = df["num1"] # Use num1 as target for regression
|
||||
X_reg = df.drop(columns=["num1", "target"])
|
||||
|
||||
X_train_reg, X_valid_reg, y_train_reg, y_valid_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=0)
|
||||
|
||||
for col in X_train_reg.select_dtypes(include="object").columns:
|
||||
X_train_reg[col] = X_train_reg[col].astype("category")
|
||||
X_valid_reg[col] = X_valid_reg[col].astype("category")
|
||||
|
||||
regressor_eval = XGBRegressor(
|
||||
tree_method="hist",
|
||||
enable_categorical=True,
|
||||
eval_metric="rmse",
|
||||
early_stopping_rounds=10,
|
||||
random_state=0,
|
||||
n_estimators=10,
|
||||
)
|
||||
regressor_eval.fit(X_train_reg, y_train_reg, eval_set=[(X_valid_reg, y_valid_reg)], verbose=False)
|
||||
y_pred = regressor_eval.predict(X_valid_reg)
|
||||
assert len(y_pred) == len(y_valid_reg)
|
||||
|
||||
|
||||
def test_nobudget():
|
||||
X_train, y_train = load_breast_cancer(return_X_y=True, as_frame=True)
|
||||
|
||||
@@ -3,6 +3,12 @@ import shutil
|
||||
import sys
|
||||
|
||||
import pytest
|
||||
|
||||
try:
|
||||
import transformers
|
||||
except ImportError:
|
||||
pytest.skip("transformers not installed", allow_module_level=True)
|
||||
|
||||
from utils import (
|
||||
get_automl_settings,
|
||||
get_toy_data_binclassification,
|
||||
@@ -24,6 +30,8 @@ model_path_list = [
|
||||
if sys.platform.startswith("darwin") and sys.version_info[0] == 3 and sys.version_info[1] == 11:
|
||||
pytest.skip("skipping Python 3.11 on MacOS", allow_module_level=True)
|
||||
|
||||
pytestmark = pytest.mark.spark # set to spark as parallel testing raised RuntimeError
|
||||
|
||||
|
||||
def test_switch_1_1():
|
||||
data_idx, model_path_idx = 0, 0
|
||||
|
||||
@@ -5,8 +5,20 @@ import sys
|
||||
import pytest
|
||||
from utils import get_automl_settings, get_toy_data_seqclassification
|
||||
|
||||
try:
|
||||
import transformers
|
||||
|
||||
@pytest.mark.skipif(sys.platform in ["darwin", "win32"], reason="do not run on mac os or windows")
|
||||
_transformers_installed = True
|
||||
except ImportError:
|
||||
_transformers_installed = False
|
||||
|
||||
pytestmark = pytest.mark.spark # set to spark as parallel testing raised MlflowException of changing parameter
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
sys.platform in ["darwin", "win32"] or not _transformers_installed,
|
||||
reason="do not run on mac os or windows or transformers not installed",
|
||||
)
|
||||
def test_cv():
|
||||
import requests
|
||||
|
||||
|
||||
@@ -5,8 +5,18 @@ import sys
|
||||
import pytest
|
||||
from utils import get_automl_settings, get_toy_data_multiplechoiceclassification
|
||||
|
||||
try:
|
||||
import transformers
|
||||
|
||||
@pytest.mark.skipif(sys.platform in ["darwin", "win32"], reason="do not run on mac os or windows")
|
||||
_transformers_installed = True
|
||||
except ImportError:
|
||||
_transformers_installed = False
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
sys.platform in ["darwin", "win32"] or not _transformers_installed,
|
||||
reason="do not run on mac os or windows or transformers not installed",
|
||||
)
|
||||
def test_mcc():
|
||||
import requests
|
||||
|
||||
|
||||
@@ -7,8 +7,24 @@ from utils import get_automl_settings, get_toy_data_seqclassification
|
||||
|
||||
from flaml.default import portfolio
|
||||
|
||||
if sys.platform.startswith("darwin") and sys.version_info[0] == 3 and sys.version_info[1] == 11:
|
||||
pytest.skip("skipping Python 3.11 on MacOS", allow_module_level=True)
|
||||
try:
|
||||
import transformers
|
||||
|
||||
_transformers_installed = True
|
||||
except ImportError:
|
||||
_transformers_installed = False
|
||||
|
||||
if (
|
||||
sys.platform.startswith("darwin")
|
||||
and sys.version_info >= (3, 11)
|
||||
or not _transformers_installed
|
||||
or sys.platform == "win32"
|
||||
):
|
||||
pytest.skip("skipping Python 3.11 on MacOS or without transformers or on Windows", allow_module_level=True)
|
||||
|
||||
pytestmark = (
|
||||
pytest.mark.spark
|
||||
) # set to spark as parallel testing raised ValueError: Feature NonExisting not implemented.
|
||||
|
||||
|
||||
def pop_args(fit_kwargs):
|
||||
@@ -24,23 +40,34 @@ def test_build_portfolio(path="./test/nlp/default", strategy="greedy"):
|
||||
portfolio.main()
|
||||
|
||||
|
||||
@pytest.mark.skipif(sys.platform == "win32", reason="do not run on windows")
|
||||
def test_starting_point_not_in_search_space():
|
||||
from flaml import AutoML
|
||||
"""Regression test for invalid starting points and custom_hp.
|
||||
|
||||
This test must not require network access to Hugging Face.
|
||||
"""
|
||||
|
||||
"""
|
||||
test starting_points located outside of the search space, and custom_hp is not set
|
||||
"""
|
||||
from flaml.automl.state import SearchState
|
||||
from flaml.automl.task.factory import task_factory
|
||||
|
||||
this_estimator_name = "transformer"
|
||||
X_train, y_train, X_val, y_val, _ = get_toy_data_seqclassification()
|
||||
X_train, y_train, _, _, _ = get_toy_data_seqclassification()
|
||||
task = task_factory("seq-classification", X_train, y_train)
|
||||
estimator_class = task.estimator_class_from_str(this_estimator_name)
|
||||
estimator_class.init()
|
||||
|
||||
automl = AutoML()
|
||||
automl_settings = get_automl_settings(estimator_name=this_estimator_name)
|
||||
|
||||
automl_settings["starting_points"] = {this_estimator_name: [{"learning_rate": 2e-3}]}
|
||||
|
||||
automl.fit(X_train, y_train, **automl_settings)
|
||||
assert automl._search_states[this_estimator_name].init_config[0]["learning_rate"] != 2e-3
|
||||
# SearchState is where invalid starting points are filtered out when max_iter > 1.
|
||||
search_state = SearchState(
|
||||
learner_class=estimator_class,
|
||||
data=X_train,
|
||||
task=task,
|
||||
starting_point={"learning_rate": 2e-3},
|
||||
max_iter=3,
|
||||
budget=10,
|
||||
)
|
||||
assert search_state.init_config and search_state.init_config[0].get("learning_rate") != 2e-3
|
||||
|
||||
"""
|
||||
test starting_points located outside of the search space, and custom_hp is set
|
||||
@@ -48,39 +75,60 @@ def test_starting_point_not_in_search_space():
|
||||
|
||||
from flaml import tune
|
||||
|
||||
X_train, y_train, X_val, y_val, _ = get_toy_data_seqclassification()
|
||||
X_train, y_train, _, _, _ = get_toy_data_seqclassification()
|
||||
|
||||
this_estimator_name = "transformer_ms"
|
||||
automl = AutoML()
|
||||
automl_settings = get_automl_settings(estimator_name=this_estimator_name)
|
||||
task = task_factory("seq-classification", X_train, y_train)
|
||||
estimator_class = task.estimator_class_from_str(this_estimator_name)
|
||||
estimator_class.init()
|
||||
|
||||
automl_settings["custom_hp"] = {
|
||||
this_estimator_name: {
|
||||
"model_path": {
|
||||
"domain": "albert-base-v2",
|
||||
},
|
||||
"learning_rate": {
|
||||
"domain": tune.choice([1e-4, 1e-5]),
|
||||
},
|
||||
"per_device_train_batch_size": {
|
||||
"domain": 2,
|
||||
},
|
||||
}
|
||||
custom_hp = {
|
||||
"model_path": {
|
||||
"domain": "albert-base-v2",
|
||||
},
|
||||
"learning_rate": {
|
||||
"domain": tune.choice([1e-4, 1e-5]),
|
||||
},
|
||||
"per_device_train_batch_size": {
|
||||
"domain": 2,
|
||||
},
|
||||
}
|
||||
automl_settings["starting_points"] = "data:test/nlp/default/"
|
||||
|
||||
automl.fit(X_train, y_train, **automl_settings)
|
||||
assert len(automl._search_states[this_estimator_name].init_config[0]) == len(
|
||||
automl._search_states[this_estimator_name]._search_space_domain
|
||||
) - len(automl_settings["custom_hp"][this_estimator_name]), (
|
||||
# Simulate a suggested starting point (e.g. from portfolio) which becomes invalid
|
||||
# after custom_hp constrains the space.
|
||||
invalid_starting_points = [
|
||||
{
|
||||
"learning_rate": 1e-5,
|
||||
"num_train_epochs": 1.0,
|
||||
"per_device_train_batch_size": 8,
|
||||
"seed": 43,
|
||||
"global_max_steps": 100,
|
||||
"model_path": "google/electra-base-discriminator",
|
||||
}
|
||||
]
|
||||
|
||||
search_state = SearchState(
|
||||
learner_class=estimator_class,
|
||||
data=X_train,
|
||||
task=task,
|
||||
starting_point=invalid_starting_points,
|
||||
custom_hp=custom_hp,
|
||||
max_iter=3,
|
||||
budget=10,
|
||||
)
|
||||
|
||||
assert search_state.init_config, "Expected a non-empty init_config list"
|
||||
init_config0 = search_state.init_config[0]
|
||||
assert init_config0 is not None
|
||||
assert len(init_config0) == len(search_state._search_space_domain) - len(custom_hp), (
|
||||
"The search space is updated with the custom_hp on {} hyperparameters of "
|
||||
"the specified estimator without an initial value. Thus a valid init config "
|
||||
"should only contain the cardinality of the search space minus {}".format(
|
||||
len(automl_settings["custom_hp"][this_estimator_name]),
|
||||
len(automl_settings["custom_hp"][this_estimator_name]),
|
||||
len(custom_hp),
|
||||
len(custom_hp),
|
||||
)
|
||||
)
|
||||
assert automl._search_states[this_estimator_name].search_space["model_path"] == "albert-base-v2"
|
||||
assert search_state.search_space["model_path"] == "albert-base-v2"
|
||||
|
||||
if os.path.exists("test/data/output/"):
|
||||
try:
|
||||
@@ -89,7 +137,6 @@ def test_starting_point_not_in_search_space():
|
||||
print("PermissionError when deleting test/data/output/")
|
||||
|
||||
|
||||
@pytest.mark.skipif(sys.platform == "win32", reason="do not run on windows")
|
||||
def test_points_to_evaluate():
|
||||
from flaml import AutoML
|
||||
|
||||
@@ -102,7 +149,13 @@ def test_points_to_evaluate():
|
||||
|
||||
automl_settings["custom_hp"] = {"transformer_ms": {"model_path": {"domain": "google/electra-small-discriminator"}}}
|
||||
|
||||
automl.fit(X_train, y_train, **automl_settings)
|
||||
try:
|
||||
automl.fit(X_train, y_train, **automl_settings)
|
||||
except OSError as e:
|
||||
message = str(e)
|
||||
if "Too Many Requests" in message or "rate limit" in message.lower():
|
||||
pytest.skip(f"Skipping HF model load/training: {message}")
|
||||
raise
|
||||
|
||||
if os.path.exists("test/data/output/"):
|
||||
try:
|
||||
@@ -112,7 +165,6 @@ def test_points_to_evaluate():
|
||||
|
||||
|
||||
# TODO: implement _test_zero_shot_model
|
||||
@pytest.mark.skipif(sys.platform == "win32", reason="do not run on windows")
|
||||
def test_zero_shot_nomodel():
|
||||
from flaml.default import preprocess_and_suggest_hyperparams
|
||||
|
||||
@@ -137,7 +189,14 @@ def test_zero_shot_nomodel():
|
||||
fit_kwargs = automl_settings.pop("fit_kwargs_by_estimator", {}).get(estimator_name)
|
||||
fit_kwargs.update(automl_settings)
|
||||
pop_args(fit_kwargs)
|
||||
model.fit(X_train, y_train, **fit_kwargs)
|
||||
|
||||
try:
|
||||
model.fit(X_train, y_train, **fit_kwargs)
|
||||
except OSError as e:
|
||||
message = str(e)
|
||||
if "Too Many Requests" in message or "rate limit" in message.lower():
|
||||
pytest.skip(f"Skipping HF model load/training: {message}")
|
||||
raise
|
||||
|
||||
if os.path.exists("test/data/output/"):
|
||||
try:
|
||||
|
||||
@@ -7,7 +7,7 @@ from sklearn.model_selection import train_test_split
|
||||
from flaml import tune
|
||||
from flaml.automl.model import LGBMEstimator
|
||||
|
||||
data = fetch_california_housing(return_X_y=False, as_frame=True)
|
||||
data = fetch_california_housing(return_X_y=False, as_frame=True, data_home="test")
|
||||
X, y = data.data, data.target
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
|
||||
X_train_ref = ray.put(X_train)
|
||||
|
||||
@@ -11,7 +11,7 @@ automl_settings = {
|
||||
"task": "regression",
|
||||
"log_file_name": "test/california.log",
|
||||
}
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True)
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, data_home="test")
|
||||
# Train with labeled input data
|
||||
automl.fit(X_train=X_train, y_train=y_train, **automl_settings)
|
||||
print(automl.model)
|
||||
|
||||
@@ -1,13 +1,17 @@
|
||||
import atexit
|
||||
import os
|
||||
import sys
|
||||
import warnings
|
||||
|
||||
import mlflow
|
||||
import numpy as np
|
||||
import pytest
|
||||
import sklearn.datasets as skds
|
||||
from packaging.version import Version
|
||||
|
||||
from flaml import AutoML
|
||||
from flaml.automl.data import auto_convert_dtypes_pandas, auto_convert_dtypes_spark, get_random_dataframe
|
||||
from flaml.automl.spark import disable_spark_ansi_mode, restore_spark_ansi_mode
|
||||
from flaml.tune.spark.utils import check_spark
|
||||
|
||||
warnings.simplefilter(action="ignore")
|
||||
@@ -27,7 +31,7 @@ else:
|
||||
.config(
|
||||
"spark.jars.packages",
|
||||
(
|
||||
"com.microsoft.azure:synapseml_2.12:1.0.4,"
|
||||
"com.microsoft.azure:synapseml_2.12:1.1.0,"
|
||||
"org.apache.hadoop:hadoop-azure:3.3.5,"
|
||||
"com.microsoft.azure:azure-storage:8.6.6,"
|
||||
f"org.mlflow:mlflow-spark_2.12:{mlflow.__version__}"
|
||||
@@ -53,15 +57,25 @@ else:
|
||||
except ImportError:
|
||||
skip_spark = True
|
||||
|
||||
spark, ansi_conf, adjusted = disable_spark_ansi_mode()
|
||||
atexit.register(restore_spark_ansi_mode, spark, ansi_conf, adjusted)
|
||||
|
||||
if sys.version_info >= (3, 11):
|
||||
skip_py311 = True
|
||||
else:
|
||||
skip_py311 = False
|
||||
|
||||
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
|
||||
|
||||
|
||||
def _test_spark_synapseml_lightgbm(spark=None, task="classification"):
|
||||
# TODO: remove the estimator assignment once SynapseML supports spark 4+.
|
||||
from flaml.automl.spark.utils import _spark_major_minor_version
|
||||
|
||||
if _spark_major_minor_version[0] >= 4:
|
||||
# skip synapseml lightgbm test for spark 4+
|
||||
return
|
||||
|
||||
if task == "classification":
|
||||
metric = "accuracy"
|
||||
X_train, y_train = skds.load_iris(return_X_y=True, as_frame=True)
|
||||
@@ -151,27 +165,32 @@ def test_spark_synapseml_rank():
|
||||
_test_spark_synapseml_lightgbm(spark, "rank")
|
||||
|
||||
|
||||
def test_spark_input_df():
|
||||
df = (
|
||||
spark.read.format("csv")
|
||||
.option("header", True)
|
||||
.option("inferSchema", True)
|
||||
.load("wasbs://publicwasb@mmlspark.blob.core.windows.net/company_bankruptcy_prediction_data.csv")
|
||||
)
|
||||
def test_spark_input_df_and_pickle():
|
||||
import pandas as pd
|
||||
|
||||
file_url = "https://mmlspark.blob.core.windows.net/publicwasb/company_bankruptcy_prediction_data.csv"
|
||||
df = pd.read_csv(file_url)
|
||||
df = spark.createDataFrame(df)
|
||||
train, test = df.randomSplit([0.8, 0.2], seed=1)
|
||||
feature_cols = df.columns[1:]
|
||||
featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
|
||||
train_data = featurizer.transform(train)["Bankrupt?", "features"]
|
||||
test_data = featurizer.transform(test)["Bankrupt?", "features"]
|
||||
automl = AutoML()
|
||||
|
||||
# TODO: remove the estimator assignment once SynapseML supports spark 4+.
|
||||
from flaml.automl.spark.utils import _spark_major_minor_version
|
||||
|
||||
estimator_list = ["rf_spark"] if _spark_major_minor_version[0] >= 4 else None
|
||||
|
||||
settings = {
|
||||
"time_budget": 30, # total running time in seconds
|
||||
"metric": "roc_auc",
|
||||
# "estimator_list": ["lgbm_spark"], # list of ML learners; we tune lightgbm in this example
|
||||
"task": "classification", # task type
|
||||
"log_file_name": "flaml_experiment.log", # flaml log file
|
||||
"seed": 7654321, # random seed
|
||||
"eval_method": "holdout",
|
||||
"estimator_list": estimator_list, # TODO: remove once SynapseML supports spark 4+
|
||||
}
|
||||
df = to_pandas_on_spark(to_pandas_on_spark(train_data).to_spark(index_col="index"))
|
||||
|
||||
@@ -182,6 +201,22 @@ def test_spark_input_df():
|
||||
**settings,
|
||||
)
|
||||
|
||||
# test pickle and load_pickle, should work for prediction
|
||||
automl.pickle("automl_spark.pkl")
|
||||
automl_loaded = AutoML().load_pickle("automl_spark.pkl")
|
||||
assert automl_loaded.best_estimator == automl.best_estimator
|
||||
assert automl_loaded.best_loss == automl.best_loss
|
||||
automl_loaded.predict(df)
|
||||
automl_loaded.model.estimator.transform(test_data)
|
||||
|
||||
import shutil
|
||||
|
||||
shutil.rmtree("automl_spark.pkl", ignore_errors=True)
|
||||
shutil.rmtree("automl_spark.pkl.flaml_artifacts", ignore_errors=True)
|
||||
|
||||
if estimator_list == ["rf_spark"]:
|
||||
return
|
||||
|
||||
try:
|
||||
model = automl.model.estimator
|
||||
predictions = model.transform(test_data)
|
||||
@@ -296,11 +331,88 @@ def _test_spark_large_df():
|
||||
print("time cost in minutes: ", (end_time - start_time) / 60)
|
||||
|
||||
|
||||
def test_get_random_dataframe():
|
||||
# Test with default parameters
|
||||
df = get_random_dataframe(n_rows=50, ratio_none=0.2, seed=123)
|
||||
assert df.shape == (50, 14) # Default is 200 rows and 14 columns
|
||||
|
||||
# Test column types
|
||||
assert "timestamp" in df.columns and np.issubdtype(df["timestamp"].dtype, np.datetime64)
|
||||
assert "id" in df.columns and np.issubdtype(df["id"].dtype, np.integer)
|
||||
assert "score" in df.columns and np.issubdtype(df["score"].dtype, np.floating)
|
||||
assert "category" in df.columns and df["category"].dtype.name == "category"
|
||||
|
||||
|
||||
def test_auto_convert_dtypes_pandas():
|
||||
# Create a test DataFrame with various types
|
||||
import pandas as pd
|
||||
|
||||
test_df = pd.DataFrame(
|
||||
{
|
||||
"int_col": ["1", "2", "3", "4", "5", "6", "6"],
|
||||
"float_col": ["1.1", "2.2", "3.3", "NULL", "5.5", "6.6", "6.6"],
|
||||
"date_col": ["2021-01-01", "2021-02-01", "NA", "2021-04-01", "2021-05-01", "2021-06-01", "2021-06-01"],
|
||||
"cat_col": ["A", "B", "A", "A", "B", "A", "B"],
|
||||
"string_col": ["text1", "text2", "text3", "text4", "text5", "text6", "text7"],
|
||||
}
|
||||
)
|
||||
|
||||
# Convert dtypes
|
||||
converted_df, schema = auto_convert_dtypes_pandas(test_df)
|
||||
|
||||
# Check conversions
|
||||
assert schema["int_col"] == "int"
|
||||
assert schema["float_col"] == "double"
|
||||
assert schema["date_col"] == "timestamp"
|
||||
assert schema["cat_col"] == "category"
|
||||
assert schema["string_col"] == "string"
|
||||
|
||||
|
||||
def test_auto_convert_dtypes_spark():
|
||||
"""Test auto_convert_dtypes_spark function with various data types."""
|
||||
import pandas as pd
|
||||
|
||||
# Create a test DataFrame with various types
|
||||
test_pdf = pd.DataFrame(
|
||||
{
|
||||
"int_col": ["1", "2", "3", "4", "NA"],
|
||||
"float_col": ["1.1", "2.2", "3.3", "NULL", "5.5"],
|
||||
"date_col": ["2021-01-01", "2021-02-01", "NA", "2021-04-01", "2021-05-01"],
|
||||
"cat_col": ["A", "B", "A", "C", "B"],
|
||||
"string_col": ["text1", "text2", "text3", "text4", "text5"],
|
||||
}
|
||||
)
|
||||
|
||||
# Convert pandas DataFrame to Spark DataFrame
|
||||
test_df = spark.createDataFrame(test_pdf)
|
||||
|
||||
# Convert dtypes
|
||||
converted_df, schema = auto_convert_dtypes_spark(test_df)
|
||||
|
||||
# Check conversions
|
||||
assert schema["int_col"] == "int"
|
||||
assert schema["float_col"] == "double"
|
||||
assert schema["date_col"] == "timestamp"
|
||||
assert schema["cat_col"] == "string" # Conceptual category in schema
|
||||
assert schema["string_col"] == "string"
|
||||
|
||||
# Verify the actual data types from the Spark DataFrame
|
||||
spark_dtypes = dict(converted_df.dtypes)
|
||||
assert spark_dtypes["int_col"] == "int"
|
||||
assert spark_dtypes["float_col"] == "double"
|
||||
assert spark_dtypes["date_col"] == "timestamp"
|
||||
assert spark_dtypes["cat_col"] == "string" # In Spark, categories are still strings
|
||||
assert spark_dtypes["string_col"] == "string"
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_spark_synapseml_classification()
|
||||
test_spark_synapseml_regression()
|
||||
test_spark_synapseml_rank()
|
||||
test_spark_input_df()
|
||||
# test_spark_synapseml_classification()
|
||||
# test_spark_synapseml_regression()
|
||||
# test_spark_synapseml_rank()
|
||||
test_spark_input_df_and_pickle()
|
||||
# test_get_random_dataframe()
|
||||
# test_auto_convert_dtypes_pandas()
|
||||
# test_auto_convert_dtypes_spark()
|
||||
|
||||
# import cProfile
|
||||
# import pstats
|
||||
|
||||
@@ -25,13 +25,13 @@ os.environ["FLAML_MAX_CONCURRENT"] = "2"
|
||||
spark_available, _ = check_spark()
|
||||
skip_spark = not spark_available
|
||||
|
||||
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
|
||||
|
||||
|
||||
def test_parallel_xgboost(hpo_method=None, data_size=1000):
|
||||
def test_parallel_xgboost_and_pickle(hpo_method=None, data_size=1000):
|
||||
automl_experiment = AutoML()
|
||||
automl_settings = {
|
||||
"time_budget": 10,
|
||||
"time_budget": 30,
|
||||
"metric": "ap",
|
||||
"task": "classification",
|
||||
"log_file_name": "test/sparse_classification.log",
|
||||
@@ -53,15 +53,27 @@ def test_parallel_xgboost(hpo_method=None, data_size=1000):
|
||||
print(automl_experiment.best_iteration)
|
||||
print(automl_experiment.best_estimator)
|
||||
|
||||
# test pickle and load_pickle, should work for prediction
|
||||
automl_experiment.pickle("automl_xgboost_spark.pkl")
|
||||
automl_loaded = AutoML().load_pickle("automl_xgboost_spark.pkl")
|
||||
assert automl_loaded.best_estimator == automl_experiment.best_estimator
|
||||
assert automl_loaded.best_loss == automl_experiment.best_loss
|
||||
automl_loaded.predict(X_train)
|
||||
|
||||
import shutil
|
||||
|
||||
shutil.rmtree("automl_xgboost_spark.pkl", ignore_errors=True)
|
||||
shutil.rmtree("automl_xgboost_spark.pkl.flaml_artifacts", ignore_errors=True)
|
||||
|
||||
|
||||
def test_parallel_xgboost_others():
|
||||
# use random search as the hpo_method
|
||||
test_parallel_xgboost(hpo_method="random")
|
||||
test_parallel_xgboost_and_pickle(hpo_method="random")
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="currently not supporting too large data, will support spark dataframe in the future")
|
||||
def test_large_dataset():
|
||||
test_parallel_xgboost(data_size=90000000)
|
||||
test_parallel_xgboost_and_pickle(data_size=90000000)
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
@@ -95,10 +107,10 @@ def test_custom_learner(data_size=1000):
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_parallel_xgboost()
|
||||
test_parallel_xgboost_others()
|
||||
# test_large_dataset()
|
||||
if skip_my_learner:
|
||||
print("please run pytest in the root directory of FLAML, i.e., the directory that contains the setup.py file")
|
||||
else:
|
||||
test_custom_learner()
|
||||
test_parallel_xgboost_and_pickle()
|
||||
# test_parallel_xgboost_others()
|
||||
# # test_large_dataset()
|
||||
# if skip_my_learner:
|
||||
# print("please run pytest in the root directory of FLAML, i.e., the directory that contains the setup.py file")
|
||||
# else:
|
||||
# test_custom_learner()
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
import os
|
||||
import unittest
|
||||
|
||||
import pytest
|
||||
from sklearn.datasets import load_wine
|
||||
|
||||
from flaml import AutoML
|
||||
@@ -24,6 +25,8 @@ if os.path.exists(os.path.join(os.getcwd(), "test", "spark", "custom_mylearner.p
|
||||
else:
|
||||
skip_my_learner = True
|
||||
|
||||
pytestmark = pytest.mark.spark
|
||||
|
||||
|
||||
class TestEnsemble(unittest.TestCase):
|
||||
def setUp(self) -> None:
|
||||
|
||||
@@ -9,7 +9,7 @@ from flaml.tune.spark.utils import check_spark
|
||||
spark_available, _ = check_spark()
|
||||
skip_spark = not spark_available
|
||||
|
||||
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
|
||||
|
||||
os.environ["FLAML_MAX_CONCURRENT"] = "2"
|
||||
|
||||
@@ -22,7 +22,7 @@ def base_automl(n_concurrent_trials=1, use_ray=False, use_spark=False, verbose=0
|
||||
except (ServerError, Exception):
|
||||
from sklearn.datasets import fetch_california_housing
|
||||
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True)
|
||||
X_train, y_train = fetch_california_housing(return_X_y=True, data_home="test")
|
||||
automl = AutoML()
|
||||
settings = {
|
||||
"time_budget": 3, # total running time in seconds
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
import atexit
|
||||
import importlib
|
||||
import os
|
||||
import sys
|
||||
@@ -13,6 +14,7 @@ from sklearn.metrics import r2_score
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
import flaml
|
||||
from flaml.automl.spark import disable_spark_ansi_mode, restore_spark_ansi_mode
|
||||
from flaml.automl.spark.utils import to_pandas_on_spark
|
||||
|
||||
try:
|
||||
@@ -21,6 +23,7 @@ try:
|
||||
from pyspark.ml.feature import VectorAssembler
|
||||
except ImportError:
|
||||
pass
|
||||
pytestmark = pytest.mark.spark
|
||||
warnings.filterwarnings("ignore")
|
||||
|
||||
skip_spark = importlib.util.find_spec("pyspark") is None
|
||||
@@ -119,6 +122,29 @@ def _check_mlflow_logging(possible_num_runs, metric, is_parent_run, experiment_i
|
||||
# mlflow.delete_experiment(experiment_id)
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_automl_nonsparkdata_noautolog_noparentrun():
|
||||
experiment_id = _test_automl_nonsparkdata(is_autolog=False, is_parent_run=False)
|
||||
_check_mlflow_logging(0, "r2", False, experiment_id, is_automl=True) # no logging
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_automl_sparkdata_noautolog_noparentrun():
|
||||
experiment_id = _test_automl_sparkdata(is_autolog=False, is_parent_run=False)
|
||||
_check_mlflow_logging(0, "mse", False, experiment_id, is_automl=True) # no logging
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_tune_noautolog_noparentrun_parallel():
|
||||
experiment_id = _test_tune(is_autolog=False, is_parent_run=False, is_parallel=True)
|
||||
_check_mlflow_logging(0, "r2", False, experiment_id)
|
||||
|
||||
|
||||
def test_tune_noautolog_noparentrun_nonparallel():
|
||||
experiment_id = _test_tune(is_autolog=False, is_parent_run=False, is_parallel=False)
|
||||
_check_mlflow_logging(3, "r2", False, experiment_id, skip_tags=True)
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_tune_autolog_parentrun_parallel():
|
||||
experiment_id = _test_tune(is_autolog=True, is_parent_run=True, is_parallel=True)
|
||||
@@ -130,6 +156,16 @@ def test_tune_autolog_parentrun_nonparallel():
|
||||
_check_mlflow_logging(3, "r2", True, experiment_id)
|
||||
|
||||
|
||||
def test_tune_autolog_noparentrun_nonparallel():
|
||||
experiment_id = _test_tune(is_autolog=True, is_parent_run=False, is_parallel=False)
|
||||
_check_mlflow_logging(3, "r2", False, experiment_id)
|
||||
|
||||
|
||||
def test_tune_noautolog_parentrun_nonparallel():
|
||||
experiment_id = _test_tune(is_autolog=False, is_parent_run=True, is_parallel=False)
|
||||
_check_mlflow_logging(3, "r2", True, experiment_id)
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_tune_autolog_noparentrun_parallel():
|
||||
experiment_id = _test_tune(is_autolog=True, is_parent_run=False, is_parallel=True)
|
||||
@@ -142,28 +178,12 @@ def test_tune_noautolog_parentrun_parallel():
|
||||
_check_mlflow_logging([4, 3], "r2", True, experiment_id)
|
||||
|
||||
|
||||
def test_tune_autolog_noparentrun_nonparallel():
|
||||
experiment_id = _test_tune(is_autolog=True, is_parent_run=False, is_parallel=False)
|
||||
_check_mlflow_logging(3, "r2", False, experiment_id)
|
||||
|
||||
|
||||
def test_tune_noautolog_parentrun_nonparallel():
|
||||
experiment_id = _test_tune(is_autolog=False, is_parent_run=True, is_parallel=False)
|
||||
_check_mlflow_logging(3, "r2", True, experiment_id)
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_tune_noautolog_noparentrun_parallel():
|
||||
experiment_id = _test_tune(is_autolog=False, is_parent_run=False, is_parallel=True)
|
||||
_check_mlflow_logging(0, "r2", False, experiment_id)
|
||||
|
||||
|
||||
def test_tune_noautolog_noparentrun_nonparallel():
|
||||
experiment_id = _test_tune(is_autolog=False, is_parent_run=False, is_parallel=False)
|
||||
_check_mlflow_logging(3, "r2", False, experiment_id, skip_tags=True)
|
||||
|
||||
|
||||
def _test_automl_sparkdata(is_autolog, is_parent_run):
|
||||
# TODO: remove the estimator assignment once SynapseML supports spark 4+.
|
||||
from flaml.automl.spark.utils import _spark_major_minor_version
|
||||
|
||||
estimator_list = ["rf_spark"] if _spark_major_minor_version[0] >= 4 else None
|
||||
|
||||
mlflow.end_run()
|
||||
mlflow_exp_name = f"test_mlflow_integration_{int(time.time())}"
|
||||
mlflow_experiment = mlflow.set_experiment(mlflow_exp_name)
|
||||
@@ -174,6 +194,9 @@ def _test_automl_sparkdata(is_autolog, is_parent_run):
|
||||
if is_parent_run:
|
||||
mlflow.start_run(run_name=f"automl_sparkdata_autolog_{is_autolog}")
|
||||
spark = pyspark.sql.SparkSession.builder.getOrCreate()
|
||||
spark, ansi_conf, adjusted = disable_spark_ansi_mode()
|
||||
atexit.register(restore_spark_ansi_mode, spark, ansi_conf, adjusted)
|
||||
|
||||
pd_df = load_diabetes(as_frame=True).frame
|
||||
df = spark.createDataFrame(pd_df)
|
||||
df = df.repartition(4).cache()
|
||||
@@ -192,6 +215,7 @@ def _test_automl_sparkdata(is_autolog, is_parent_run):
|
||||
"log_type": "all",
|
||||
"n_splits": 2,
|
||||
"model_history": True,
|
||||
"estimator_list": estimator_list,
|
||||
}
|
||||
df = to_pandas_on_spark(to_pandas_on_spark(train_data).to_spark(index_col="index"))
|
||||
automl.fit(
|
||||
@@ -251,12 +275,6 @@ def test_automl_sparkdata_noautolog_parentrun():
|
||||
_check_mlflow_logging(3, "mse", True, experiment_id, is_automl=True)
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_automl_sparkdata_noautolog_noparentrun():
|
||||
experiment_id = _test_automl_sparkdata(is_autolog=False, is_parent_run=False)
|
||||
_check_mlflow_logging(0, "mse", False, experiment_id, is_automl=True) # no logging
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_automl_nonsparkdata_autolog_parentrun():
|
||||
experiment_id = _test_automl_nonsparkdata(is_autolog=True, is_parent_run=True)
|
||||
@@ -275,12 +293,6 @@ def test_automl_nonsparkdata_noautolog_parentrun():
|
||||
_check_mlflow_logging([4, 3], "r2", True, experiment_id, is_automl=True)
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_automl_nonsparkdata_noautolog_noparentrun():
|
||||
experiment_id = _test_automl_nonsparkdata(is_autolog=False, is_parent_run=False)
|
||||
_check_mlflow_logging(0, "r2", False, experiment_id, is_automl=True) # no logging
|
||||
|
||||
|
||||
@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
def test_exit_pyspark_autolog():
|
||||
import pyspark
|
||||
@@ -318,6 +330,9 @@ def _init_spark_for_main():
|
||||
"https://mmlspark.blob.core.windows.net/publicwasb/log_model_allowlist.txt",
|
||||
)
|
||||
|
||||
spark, ansi_conf, adjusted = disable_spark_ansi_mode()
|
||||
atexit.register(restore_spark_ansi_mode, spark, ansi_conf, adjusted)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
_init_spark_for_main()
|
||||
|
||||
@@ -2,6 +2,7 @@ import os
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
import pytest
|
||||
import scipy.sparse
|
||||
from sklearn.datasets import load_iris, load_wine
|
||||
|
||||
@@ -12,6 +13,7 @@ from flaml.tune.spark.utils import check_spark
|
||||
|
||||
spark_available, _ = check_spark()
|
||||
skip_spark = not spark_available
|
||||
pytestmark = pytest.mark.spark
|
||||
|
||||
os.environ["FLAML_MAX_CONCURRENT"] = "2"
|
||||
|
||||
@@ -260,7 +262,11 @@ class TestMultiClass(unittest.TestCase):
|
||||
"n_concurrent_trials": 2,
|
||||
"use_spark": True,
|
||||
}
|
||||
X_train = scipy.sparse.random(1554, 21, dtype=int)
|
||||
# NOTE: Avoid `dtype=int` here. On some NumPy/SciPy combinations (notably
|
||||
# Windows + Python 3.13), `scipy.sparse.random(..., dtype=int)` may trigger
|
||||
# integer sampling paths which raise "low is out of bounds for int32".
|
||||
# A float sparse matrix is sufficient to validate sparse-input support.
|
||||
X_train = scipy.sparse.random(1554, 21, dtype=np.float32)
|
||||
y_train = np.random.randint(3, size=1554)
|
||||
automl_experiment.fit(X_train=X_train, y_train=y_train, **automl_settings)
|
||||
print(automl_experiment.classes_)
|
||||
|
||||
@@ -9,7 +9,7 @@ from flaml.tune.spark.utils import check_spark
|
||||
spark_available, _ = check_spark()
|
||||
skip_spark = not spark_available
|
||||
|
||||
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
|
||||
|
||||
here = os.path.abspath(os.path.dirname(__file__))
|
||||
os.environ["FLAML_MAX_CONCURRENT"] = "2"
|
||||
|
||||
@@ -25,7 +25,7 @@ try:
|
||||
except ImportError:
|
||||
skip_spark = True
|
||||
|
||||
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
|
||||
|
||||
|
||||
def test_overtime():
|
||||
|
||||
@@ -2,8 +2,23 @@ import os
|
||||
import sys
|
||||
|
||||
import pytest
|
||||
from minio.error import ServerError
|
||||
from openml.exceptions import OpenMLServerException
|
||||
|
||||
try:
|
||||
from minio.error import ServerError
|
||||
except ImportError:
|
||||
|
||||
class ServerError(Exception):
|
||||
pass
|
||||
|
||||
|
||||
try:
|
||||
from openml.exceptions import OpenMLServerException
|
||||
except ImportError:
|
||||
|
||||
class OpenMLServerException(Exception):
|
||||
pass
|
||||
|
||||
|
||||
from requests.exceptions import ChunkedEncodingError, SSLError
|
||||
|
||||
from flaml.tune.spark.utils import check_spark
|
||||
@@ -11,19 +26,19 @@ from flaml.tune.spark.utils import check_spark
|
||||
spark_available, _ = check_spark()
|
||||
skip_spark = not spark_available
|
||||
|
||||
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
|
||||
|
||||
os.environ["FLAML_MAX_CONCURRENT"] = "2"
|
||||
|
||||
|
||||
def run_automl(budget=3, dataset_format="dataframe", hpo_method=None):
|
||||
def run_automl(budget=30, dataset_format="dataframe", hpo_method=None):
|
||||
import urllib3
|
||||
|
||||
from flaml.automl.data import load_openml_dataset
|
||||
|
||||
performance_check_budget = 3600
|
||||
if sys.platform == "darwin" or "nt" in os.name or "3.10" not in sys.version:
|
||||
budget = 3 # revise the buget if the platform is not linux + python 3.10
|
||||
budget = 30 # revise the buget if the platform is not linux + python 3.10
|
||||
if budget >= performance_check_budget:
|
||||
max_iter = 60
|
||||
performance_check_budget = None
|
||||
@@ -76,6 +91,11 @@ def run_automl(budget=3, dataset_format="dataframe", hpo_method=None):
|
||||
print("Best ML leaner:", automl.best_estimator)
|
||||
print("Best hyperparmeter config:", automl.best_config)
|
||||
print(f"Best accuracy on validation data: {1 - automl.best_loss:.4g}")
|
||||
if performance_check_budget is not None and automl.best_estimator is None:
|
||||
# skip the performance check if no model is trained
|
||||
# this happens sometimes in github actions ubuntu python 3.12 environment
|
||||
print("Warning: no model is trained, skip performance check")
|
||||
return
|
||||
print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")
|
||||
print(automl.model.estimator)
|
||||
print(automl.best_config_per_estimator)
|
||||
|
||||
@@ -14,7 +14,7 @@ from flaml.tune.spark.utils import check_spark
|
||||
spark_available, _ = check_spark()
|
||||
skip_spark = not spark_available
|
||||
|
||||
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
|
||||
|
||||
os.environ["FLAML_MAX_CONCURRENT"] = "2"
|
||||
X, y = load_breast_cancer(return_X_y=True)
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
import atexit
|
||||
import os
|
||||
from functools import partial
|
||||
from timeit import timeit
|
||||
@@ -14,6 +15,7 @@ try:
|
||||
from pyspark.sql import SparkSession
|
||||
|
||||
from flaml.automl.ml import sklearn_metric_loss_score
|
||||
from flaml.automl.spark import disable_spark_ansi_mode, restore_spark_ansi_mode
|
||||
from flaml.automl.spark.metrics import spark_metric_loss_score
|
||||
from flaml.automl.spark.utils import (
|
||||
iloc_pandas_on_spark,
|
||||
@@ -24,6 +26,7 @@ try:
|
||||
unique_value_first_index,
|
||||
)
|
||||
from flaml.tune.spark.utils import (
|
||||
_spark_major_minor_version,
|
||||
check_spark,
|
||||
get_broadcast_data,
|
||||
get_n_cpus,
|
||||
@@ -35,8 +38,39 @@ try:
|
||||
except ImportError:
|
||||
print("Spark is not installed. Skip all spark tests.")
|
||||
skip_spark = True
|
||||
_spark_major_minor_version = (0, 0)
|
||||
|
||||
pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
|
||||
|
||||
pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]
|
||||
|
||||
|
||||
@pytest.mark.skipif(_spark_major_minor_version[0] < 4, reason="Requires Spark 4.0+")
|
||||
def test_to_pandas_on_spark_temp_override():
|
||||
import pyspark.pandas as ps
|
||||
from pyspark.sql import Row
|
||||
|
||||
from flaml.automl.spark.utils import to_pandas_on_spark
|
||||
|
||||
spark_session = SparkSession.builder.getOrCreate()
|
||||
spark, ansi_conf, adjusted = disable_spark_ansi_mode()
|
||||
atexit.register(restore_spark_ansi_mode, spark, ansi_conf, adjusted)
|
||||
|
||||
# Ensure we can toggle options
|
||||
orig = ps.get_option("compute.fail_on_ansi_mode")
|
||||
|
||||
try:
|
||||
spark_session.conf.set("spark.sql.ansi.enabled", "true")
|
||||
ps.set_option("compute.fail_on_ansi_mode", True)
|
||||
|
||||
# create tiny spark df
|
||||
sdf = spark_session.createDataFrame([Row(a=1, b=2)])
|
||||
# Should not raise as our function temporarily disables fail_on_ansi_mode
|
||||
pds = to_pandas_on_spark(sdf)
|
||||
assert "a" in pds.columns
|
||||
finally:
|
||||
# restore test environment
|
||||
ps.set_option("compute.fail_on_ansi_mode", orig)
|
||||
spark_session.conf.set("spark.sql.ansi.enabled", "false")
|
||||
|
||||
|
||||
def test_with_parameters_spark():
|
||||
|
||||
@@ -5,17 +5,38 @@ import sys
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
import openml
|
||||
|
||||
try:
|
||||
import openml
|
||||
except ImportError:
|
||||
openml = None
|
||||
import pandas as pd
|
||||
import pytest
|
||||
import scipy.sparse
|
||||
from minio.error import ServerError
|
||||
|
||||
try:
|
||||
from minio.error import ServerError
|
||||
except ImportError:
|
||||
|
||||
class ServerError(Exception):
|
||||
pass
|
||||
|
||||
|
||||
from requests.exceptions import SSLError
|
||||
from sklearn.metrics import mean_absolute_error, mean_squared_error
|
||||
|
||||
from flaml import AutoVW
|
||||
from flaml.tune import loguniform, polynomial_expansion_set
|
||||
|
||||
try:
|
||||
from vowpalwabbit import pyvw
|
||||
except ImportError:
|
||||
skip_vw_test = True
|
||||
else:
|
||||
skip_vw_test = False
|
||||
|
||||
pytest.skip("skipping if no openml", allow_module_level=True) if openml is None else None
|
||||
|
||||
VW_DS_DIR = "test/data/"
|
||||
NS_LIST = list(string.ascii_lowercase) + list(string.ascii_uppercase)
|
||||
logger = logging.getLogger(__name__)
|
||||
@@ -351,14 +372,9 @@ def get_vw_tuning_problem(tuning_hp="NamesapceInteraction"):
|
||||
return vw_oml_problem_args, vw_online_aml_problem
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
"3.10" in sys.version or "3.11" in sys.version,
|
||||
reason="do not run on py >= 3.10",
|
||||
)
|
||||
@pytest.mark.skipif(skip_vw_test, reason="vowpalwabbit not installed")
|
||||
class TestAutoVW(unittest.TestCase):
|
||||
def test_vw_oml_problem_and_vanilla_vw(self):
|
||||
from vowpalwabbit import pyvw
|
||||
|
||||
try:
|
||||
vw_oml_problem_args, vw_online_aml_problem = get_vw_tuning_problem()
|
||||
except (SSLError, ServerError, Exception) as e:
|
||||
|
||||
@@ -59,6 +59,17 @@ def _test_hf_data():
|
||||
except requests.exceptions.ConnectionError:
|
||||
return
|
||||
|
||||
# Tests will only run if there is a GPU available
|
||||
try:
|
||||
import ray
|
||||
|
||||
pg = ray.util.placement_group([{"CPU": 1, "GPU": 1}])
|
||||
|
||||
if not pg.wait(timeout_seconds=10): # Wait 10 seconds for resources
|
||||
raise RuntimeError("No available node types can fulfill resource request!")
|
||||
except RuntimeError:
|
||||
return
|
||||
|
||||
custom_sent_keys = ["sentence1", "sentence2"]
|
||||
label_key = "label"
|
||||
|
||||
|
||||
@@ -4,10 +4,17 @@ from collections import defaultdict
|
||||
|
||||
import numpy as np
|
||||
import pytest
|
||||
import thop
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
|
||||
try:
|
||||
import thop
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
except ImportError:
|
||||
thop = None
|
||||
torch = None
|
||||
nn = None
|
||||
F = None
|
||||
|
||||
try:
|
||||
import torchvision
|
||||
@@ -16,6 +23,11 @@ except ImportError:
|
||||
|
||||
from flaml import tune
|
||||
|
||||
if thop is None or torch is None or nn is None or F is None or torchvision is None:
|
||||
pytest.skip(
|
||||
"skipping test_lexiflow.py because torch, torchvision or thop is not installed.", allow_module_level=True
|
||||
)
|
||||
|
||||
DEVICE = torch.device("cpu")
|
||||
BATCHSIZE = 128
|
||||
N_TRAIN_EXAMPLES = BATCHSIZE * 30
|
||||
|
||||
99
test/tune/test_search_thread.py
Normal file
99
test/tune/test_search_thread.py
Normal file
@@ -0,0 +1,99 @@
|
||||
"""Tests for SearchThread nested dictionary update fix."""
|
||||
|
||||
import pytest
|
||||
|
||||
from flaml.tune.searcher.search_thread import _recursive_dict_update
|
||||
|
||||
|
||||
def test_recursive_dict_update_simple():
|
||||
"""Test simple non-nested dictionary update."""
|
||||
target = {"a": 1, "b": 2}
|
||||
source = {"c": 3}
|
||||
_recursive_dict_update(target, source)
|
||||
assert target == {"a": 1, "b": 2, "c": 3}
|
||||
|
||||
|
||||
def test_recursive_dict_update_override():
|
||||
"""Test that source values override target values for non-dict values."""
|
||||
target = {"a": 1, "b": 2}
|
||||
source = {"b": 3}
|
||||
_recursive_dict_update(target, source)
|
||||
assert target == {"a": 1, "b": 3}
|
||||
|
||||
|
||||
def test_recursive_dict_update_nested():
|
||||
"""Test nested dictionary merge (the main use case for XGBoost params)."""
|
||||
target = {
|
||||
"num_boost_round": 10,
|
||||
"params": {
|
||||
"max_depth": 12,
|
||||
"eta": 0.020168455186106736,
|
||||
"min_child_weight": 1.4504723523894132,
|
||||
"scale_pos_weight": 3.794258636185337,
|
||||
"gamma": 0.4985070123025904,
|
||||
},
|
||||
}
|
||||
source = {
|
||||
"params": {
|
||||
"verbosity": 3,
|
||||
"booster": "gbtree",
|
||||
"eval_metric": "auc",
|
||||
"tree_method": "hist",
|
||||
"objective": "binary:logistic",
|
||||
}
|
||||
}
|
||||
_recursive_dict_update(target, source)
|
||||
|
||||
# Check that sampled params are preserved
|
||||
assert target["params"]["max_depth"] == 12
|
||||
assert target["params"]["eta"] == 0.020168455186106736
|
||||
assert target["params"]["min_child_weight"] == 1.4504723523894132
|
||||
assert target["params"]["scale_pos_weight"] == 3.794258636185337
|
||||
assert target["params"]["gamma"] == 0.4985070123025904
|
||||
|
||||
# Check that const params are added
|
||||
assert target["params"]["verbosity"] == 3
|
||||
assert target["params"]["booster"] == "gbtree"
|
||||
assert target["params"]["eval_metric"] == "auc"
|
||||
assert target["params"]["tree_method"] == "hist"
|
||||
assert target["params"]["objective"] == "binary:logistic"
|
||||
|
||||
# Check top-level param is preserved
|
||||
assert target["num_boost_round"] == 10
|
||||
|
||||
|
||||
def test_recursive_dict_update_deeply_nested():
|
||||
"""Test deeply nested dictionary merge."""
|
||||
target = {"a": {"b": {"c": 1, "d": 2}}}
|
||||
source = {"a": {"b": {"e": 3}}}
|
||||
_recursive_dict_update(target, source)
|
||||
assert target == {"a": {"b": {"c": 1, "d": 2, "e": 3}}}
|
||||
|
||||
|
||||
def test_recursive_dict_update_mixed_types():
|
||||
"""Test that non-dict values in source replace dict values in target."""
|
||||
target = {"a": {"b": 1}}
|
||||
source = {"a": 2}
|
||||
_recursive_dict_update(target, source)
|
||||
assert target == {"a": 2}
|
||||
|
||||
|
||||
def test_recursive_dict_update_empty_dicts():
|
||||
"""Test with empty dictionaries."""
|
||||
target = {}
|
||||
source = {"a": 1}
|
||||
_recursive_dict_update(target, source)
|
||||
assert target == {"a": 1}
|
||||
|
||||
target = {"a": 1}
|
||||
source = {}
|
||||
_recursive_dict_update(target, source)
|
||||
assert target == {"a": 1}
|
||||
|
||||
|
||||
def test_recursive_dict_update_none_values():
|
||||
"""Test that None values are properly handled."""
|
||||
target = {"a": 1, "b": None}
|
||||
source = {"b": 2, "c": None}
|
||||
_recursive_dict_update(target, source)
|
||||
assert target == {"a": 1, "b": 2, "c": None}
|
||||
@@ -324,3 +324,26 @@ def test_no_optuna():
|
||||
import flaml.tune.searcher.suggestion
|
||||
|
||||
subprocess.check_call([sys.executable, "-m", "pip", "install", "optuna==2.8.0"])
|
||||
|
||||
|
||||
def test_unresolved_search_space(caplog):
|
||||
import logging
|
||||
|
||||
from flaml import tune
|
||||
from flaml.tune.searcher.blendsearch import BlendSearch
|
||||
|
||||
if caplog is not None:
|
||||
caplog.set_level(logging.INFO)
|
||||
|
||||
BlendSearch(metric="loss", mode="min", space={"lr": tune.uniform(0.001, 0.1), "depth": tune.randint(1, 10)})
|
||||
try:
|
||||
text = caplog.text
|
||||
except AttributeError:
|
||||
text = ""
|
||||
assert (
|
||||
"unresolved search space" not in text and text
|
||||
), "BlendSearch should not produce warning about unresolved search space"
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_unresolved_search_space(None)
|
||||
|
||||
@@ -53,6 +53,11 @@ def _easy_objective(config):
|
||||
|
||||
|
||||
def test_nested_run():
|
||||
"""
|
||||
nested tuning example: Tune -> AutoML -> MLflow autolog
|
||||
mlflow logging is complicated in nested tuning. It's better to turn off mlflow autologging to avoid
|
||||
potential issues in FLAML's mlflow_integration.adopt_children() function.
|
||||
"""
|
||||
from flaml import AutoML, tune
|
||||
|
||||
data, labels = sklearn.datasets.load_breast_cancer(return_X_y=True)
|
||||
|
||||
@@ -6,12 +6,12 @@ from sklearn.model_selection import train_test_split
|
||||
from flaml import tune
|
||||
from flaml.automl.model import LGBMEstimator
|
||||
|
||||
data = fetch_california_housing(return_X_y=False, as_frame=True)
|
||||
data = fetch_california_housing(return_X_y=False, as_frame=True, data_home="test")
|
||||
df, X, y = data.frame, data.data, data.target
|
||||
df_train, _, X_train, X_test, _, y_test = train_test_split(df, X, y, test_size=0.33, random_state=42)
|
||||
csv_file_name = "test/housing.csv"
|
||||
df_train.to_csv(csv_file_name, index=False)
|
||||
# X, y = fetch_california_housing(return_X_y=True, as_frame=True)
|
||||
# X, y = fetch_california_housing(return_X_y=True, as_frame=True, data_home="test")
|
||||
# X_train, X_test, y_train, y_test = train_test_split(
|
||||
# X, y, test_size=0.33, random_state=42
|
||||
# )
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user