mirror of
https://github.com/microsoft/FLAML.git
synced 2026-02-09 02:09:16 +08:00
Support spark dataframe as input dataset and spark models as estimators (#934)
* add basic support to Spark dataframe add support to SynapseML LightGBM model update to pyspark>=3.2.0 to leverage pandas_on_Spark API * clean code, add TODOs * add sample_train_data for pyspark.pandas dataframe, fix bugs * improve some functions, fix bugs * fix dict change size during iteration * update model predict * update LightGBM model, update test * update SynapseML LightGBM params * update synapseML and tests * update TODOs * Added support to roc_auc for spark models * Added support to score of spark estimator * Added test for automl score of spark estimator * Added cv support to pyspark.pandas dataframe * Update test, fix bugs * Added tests * Updated docs, tests, added a notebook * Fix bugs in non-spark env * Fix bugs and improve tests * Fix uninstall pyspark * Fix tests error * Fix java.lang.OutOfMemoryError: Java heap space * Fix test_performance * Update test_sparkml to test_0sparkml to use the expected spark conf * Remove unnecessary widgets in notebook * Fix iloc java.lang.StackOverflowError * fix pre-commit * Added params check for spark dataframes * Refactor code for train_test_split to a function * Update train_test_split_pyspark * Refactor if-else, remove unnecessary code * Remove y from predict, remove mem control from n_iter compute * Update workflow * Improve _split_pyspark * Fix test failure of too short training time * Fix typos, improve docstrings * Fix index errors of pandas_on_spark, add spark loss metric * Fix typo of ndcgAtK * Update NDCG metrics and tests * Remove unuseful logger * Use cache and count to ensure consistent indexes * refactor for merge maain * fix errors of refactor * Updated SparkLightGBMEstimator and cache * Updated config2params * Remove unused import * Fix unknown parameters * Update default_estimator_list * Add unit tests for spark metrics
This commit is contained in:
19
.github/workflows/python-package.yml
vendored
19
.github/workflows/python-package.yml
vendored
@@ -25,7 +25,6 @@ jobs:
|
||||
matrix:
|
||||
os: [ubuntu-latest, macos-latest, windows-2019]
|
||||
python-version: ["3.7", "3.8", "3.9", "3.10"]
|
||||
|
||||
steps:
|
||||
- uses: actions/checkout@v3
|
||||
- name: Set up Python ${{ matrix.python-version }}
|
||||
@@ -45,21 +44,18 @@ jobs:
|
||||
export CFLAGS="$CFLAGS -I/usr/local/opt/libomp/include"
|
||||
export CXXFLAGS="$CXXFLAGS -I/usr/local/opt/libomp/include"
|
||||
export LDFLAGS="$LDFLAGS -Wl,-rpath,/usr/local/opt/libomp/lib -L/usr/local/opt/libomp/lib -lomp"
|
||||
- name: On Linux, install Spark stand-alone cluster and PySpark
|
||||
if: matrix.os == 'ubuntu-latest'
|
||||
- name: On Linux + python 3.8, install pyspark 3.2.3
|
||||
if: matrix.os == 'ubuntu-latest' && matrix.python-version == '3.8'
|
||||
run: |
|
||||
sudo apt-get update && sudo apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends ca-certificates-java ca-certificates openjdk-17-jdk-headless && sudo apt-get clean && sudo rm -rf /var/lib/apt/lists/*
|
||||
wget --progress=dot:giga "https://www.apache.org/dyn/closer.lua/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz?action=download" -O - | tar -xzC /tmp; archive=$(basename "spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz") bash -c "sudo mv -v /tmp/\${archive/%.tgz/} /spark"
|
||||
pip install --no-cache-dir pyspark>=3.0
|
||||
export SPARK_HOME=/spark
|
||||
export PYTHONPATH=/spark/python/lib/py4j-0.10.9.5-src.zip:/spark/python
|
||||
export PATH=$PATH:$SPARK_HOME/bin
|
||||
python -m pip install --upgrade pip wheel
|
||||
pip install pyspark==3.2.3
|
||||
- name: Install packages and dependencies
|
||||
run: |
|
||||
python -m pip install --upgrade pip wheel
|
||||
pip install -e .
|
||||
python -c "import flaml"
|
||||
pip install -e .[test]
|
||||
pip list | grep "pyspark"
|
||||
- name: If linux, install ray 2
|
||||
if: matrix.os == 'ubuntu-latest'
|
||||
run: |
|
||||
@@ -76,6 +72,11 @@ jobs:
|
||||
if: matrix.python-version != '3.10'
|
||||
run: |
|
||||
pip install -e .[vw]
|
||||
- name: Uninstall pyspark on python 3.9
|
||||
if: matrix.python-version == '3.9'
|
||||
run: |
|
||||
# Uninstall pyspark to test env without pyspark
|
||||
pip uninstall -y pyspark
|
||||
- name: Lint with flake8
|
||||
run: |
|
||||
# stop the build if there are Python syntax errors or undefined names
|
||||
|
||||
Reference in New Issue
Block a user