Support spark dataframe as input dataset and spark models as estimators (#934)

* add basic support to Spark dataframe add support to SynapseML LightGBM model update to pyspark>=3.2.0 to leverage pandas_on_Spark API * clean code, add TODOs * add sample_train_data for pyspark.pandas dataframe, fix bugs * improve some functions, fix bugs * fix dict change size during iteration * update model predict * update LightGBM model, update test * update SynapseML LightGBM params * update synapseML and tests * update TODOs * Added support to roc_auc for spark models * Added support to score of spark estimator * Added test for automl score of spark estimator * Added cv support to pyspark.pandas dataframe * Update test, fix bugs * Added tests * Updated docs, tests, added a notebook * Fix bugs in non-spark env * Fix bugs and improve tests * Fix uninstall pyspark * Fix tests error * Fix java.lang.OutOfMemoryError: Java heap space * Fix test_performance * Update test_sparkml to test_0sparkml to use the expected spark conf * Remove unnecessary widgets in notebook * Fix iloc java.lang.StackOverflowError * fix pre-commit * Added params check for spark dataframes * Refactor code for train_test_split to a function * Update train_test_split_pyspark * Refactor if-else, remove unnecessary code * Remove y from predict, remove mem control from n_iter compute * Update workflow * Improve _split_pyspark * Fix test failure of too short training time * Fix typos, improve docstrings * Fix index errors of pandas_on_spark, add spark loss metric * Fix typo of ndcgAtK * Update NDCG metrics and tests * Remove unuseful logger * Use cache and count to ensure consistent indexes * refactor for merge maain * fix errors of refactor * Updated SparkLightGBMEstimator and cache * Updated config2params * Remove unused import * Fix unknown parameters * Update default_estimator_list * Add unit tests for spark metrics
2026-02-09 02:09:16 +08:00 · 2023-03-26 03:59:46 +08:00
parent a3e770eac5
commit 50334f2c52
24 changed files with 3017 additions and 235 deletions
--- a/.github/workflows/python-package.yml
+++ b/.github/workflows/python-package.yml
@@ -25,7 +25,6 @@ jobs:
      matrix:
        os: [ubuntu-latest, macos-latest, windows-2019]
        python-version: ["3.7", "3.8", "3.9", "3.10"]
-
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python ${{ matrix.python-version }}
@@ -45,21 +44,18 @@ jobs:
          export CFLAGS="$CFLAGS -I/usr/local/opt/libomp/include"
          export CXXFLAGS="$CXXFLAGS -I/usr/local/opt/libomp/include"
          export LDFLAGS="$LDFLAGS -Wl,-rpath,/usr/local/opt/libomp/lib -L/usr/local/opt/libomp/lib -lomp"
-      - name: On Linux, install Spark stand-alone cluster and PySpark
-        if: matrix.os == 'ubuntu-latest'
+      - name: On Linux + python 3.8, install pyspark 3.2.3
+        if: matrix.os == 'ubuntu-latest' && matrix.python-version == '3.8'
        run: |
-          sudo apt-get update && sudo apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends ca-certificates-java ca-certificates openjdk-17-jdk-headless && sudo apt-get clean && sudo rm -rf /var/lib/apt/lists/*
-          wget --progress=dot:giga "https://www.apache.org/dyn/closer.lua/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz?action=download" -O - | tar -xzC /tmp; archive=$(basename "spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz") bash -c "sudo mv -v /tmp/\${archive/%.tgz/} /spark"
-          pip install --no-cache-dir pyspark>=3.0
-          export SPARK_HOME=/spark
-          export PYTHONPATH=/spark/python/lib/py4j-0.10.9.5-src.zip:/spark/python
-          export PATH=$PATH:$SPARK_HOME/bin
+          python -m pip install --upgrade pip wheel
+          pip install pyspark==3.2.3
      - name: Install packages and dependencies
        run: |
          python -m pip install --upgrade pip wheel
          pip install -e .
          python -c "import flaml"
          pip install -e .[test]
+          pip list | grep "pyspark"
      - name: If linux, install ray 2
        if: matrix.os == 'ubuntu-latest'
        run: |
@@ -76,6 +72,11 @@ jobs:
        if: matrix.python-version != '3.10'
        run: |
          pip install -e .[vw]
+      - name: Uninstall pyspark on python 3.9
+        if: matrix.python-version == '3.9'
+        run: |
+          # Uninstall pyspark to test env without pyspark
+          pip uninstall -y pyspark
      - name: Lint with flake8
        run: |
          # stop the build if there are Python syntax errors or undefined names