Bump brace-expansion from 1.1.11 to 1.1.12 in /website (#1453 )

Bumps [brace-expansion](https://github.com/juliangruber/brace-expansion) from 1.1.11 to 1.1.12. - [Release notes](https://github.com/juliangruber/brace-expansion/releases) - [Commits](https://github.com/juliangruber/brace-expansion/compare/1.1.11...v1.1.12) --- updated-dependencies: - dependency-name: brace-expansion dependency-version: 1.1.12 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Li Jiang <bnujli@gmail.com>
Bump version to 2.3.6 (#1451 )
2026-02-14 20:59:16 +08:00 · 2025-08-14 10:50:51 +08:00 · 2025-08-05 14:29:36 +08:00 · 2025-08-02 08:05:50 +08:00 · 2025-07-09 18:33:10 +08:00 · 2025-05-28 12:56:48 +08:00
113 changed files with 14447 additions and 889 deletions
--- a/.github/ISSUE_TEMPLATE.md
+++ b/.github/ISSUE_TEMPLATE.md
@@ -0,0 +1,73 @@
+### Description
+
+<!-- A clear and concise description of the issue or feature request. -->
+
+### Environment
+
+- FLAML version: <!-- Specify the FLAML version (e.g., v0.2.0) -->
+- Python version: <!-- Specify the Python version (e.g., 3.8) -->
+- Operating System: <!-- Specify the OS (e.g., Windows 10, Ubuntu 20.04) -->
+
+### Steps to Reproduce (for bugs)
+
+<!-- Provide detailed steps to reproduce the issue. Include code snippets, configuration files, or any other relevant information. -->
+
+1. Step 1
+1. Step 2
+1. ...
+
+### Expected Behavior
+
+<!-- Describe what you expected to happen. -->
+
+### Actual Behavior
+
+<!-- Describe what actually happened. Include any error messages, stack traces, or unexpected behavior. -->
+
+### Screenshots / Logs (if applicable)
+
+<!-- If relevant, include screenshots or logs that help illustrate the issue. -->
+
+### Additional Information
+
+<!-- Include any additional information that might be helpful, such as specific configurations, data samples, or context about the environment. -->
+
+### Possible Solution (if you have one)
+
+<!-- If you have suggestions on how to address the issue, provide them here. -->
+
+### Is this a Bug or Feature Request?
+
+<!-- Choose one: Bug | Feature Request -->
+
+### Priority
+
+<!-- Choose one: High | Medium | Low -->
+
+### Difficulty
+
+<!-- Choose one: Easy | Moderate | Hard -->
+
+### Any related issues?
+
+<!-- If this is related to another issue, reference it here. -->
+
+### Any relevant discussions?
+
+<!-- If there are any discussions or forum threads related to this issue, provide links. -->
+
+### Checklist
+
+<!-- Please check the items that you have completed -->
+
+- [ ] I have searched for similar issues and didn't find any duplicates.
+- [ ] I have provided a clear and concise description of the issue.
+- [ ] I have included the necessary environment details.
+- [ ] I have outlined the steps to reproduce the issue.
+- [ ] I have included any relevant logs or screenshots.
+- [ ] I have indicated whether this is a bug or a feature request.
+- [ ] I have set the priority and difficulty levels.
+
+### Additional Comments
+
+<!-- Any additional comments or context that you think would be helpful. -->
--- a/.github/ISSUE_TEMPLATE/bug_report.yml
+++ b/.github/ISSUE_TEMPLATE/bug_report.yml
@@ -0,0 +1,53 @@
+name: Bug Report
+description: File a bug report
+title: "[Bug]: "
+labels: ["bug"]
+
+body:
+  - type: textarea
+    id: description
+    attributes:
+      label: Describe the bug
+      description: A clear and concise description of what the bug is.
+      placeholder: What went wrong?
+  - type: textarea
+    id: reproduce
+    attributes:
+      label: Steps to reproduce
+      description: |
+        Steps to reproduce the behavior:
+
+        1. Step 1
+        2. Step 2
+        3. ...
+        4. See error
+      placeholder: How can we replicate the issue?
+  - type: textarea
+    id: modelused
+    attributes:
+      label: Model Used
+      description: A description of the model that was used when the error was encountered
+      placeholder: gpt-4, mistral-7B etc
+  - type: textarea
+    id: expected_behavior
+    attributes:
+      label: Expected Behavior
+      description: A clear and concise description of what you expected to happen.
+      placeholder: What should have happened?
+  - type: textarea
+    id: screenshots
+    attributes:
+      label: Screenshots and logs
+      description: If applicable, add screenshots and logs to help explain your problem.
+      placeholder: Add screenshots here
+  - type: textarea
+    id: additional_information
+    attributes:
+      label: Additional Information
+      description: |
+        - FLAML Version: <!-- Specify the FLAML version (e.g., v0.2.0) -->
+        - Operating System: <!-- Specify the OS (e.g., Windows 10, Ubuntu 20.04) -->
+        - Python Version: <!-- Specify the Python version (e.g., 3.8) -->
+        - Related Issues: <!-- Link to any related issues here (e.g., #1) -->
+        - Any other relevant information.
+      placeholder: Any additional details
--- a/.github/ISSUE_TEMPLATE/config.yml
+++ b/.github/ISSUE_TEMPLATE/config.yml
@@ -0,0 +1 @@
+blank_issues_enabled: true
--- a/.github/ISSUE_TEMPLATE/feature_request.yml
+++ b/.github/ISSUE_TEMPLATE/feature_request.yml
@@ -0,0 +1,26 @@
+name: Feature Request
+description: File a feature request
+labels: ["enhancement"]
+title: "[Feature Request]: "
+
+body:
+  - type: textarea
+    id: problem_description
+    attributes:
+      label: Is your feature request related to a problem? Please describe.
+      description: A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
+      placeholder: What problem are you trying to solve?
+
+  - type: textarea
+    id: solution_description
+    attributes:
+      label: Describe the solution you'd like
+      description: A clear and concise description of what you want to happen.
+      placeholder: How do you envision the solution?
+
+  - type: textarea
+    id: additional_context
+    attributes:
+      label: Additional context
+      description: Add any other context or screenshots about the feature request here.
+      placeholder: Any additional information
--- a/.github/ISSUE_TEMPLATE/general_issue.yml
+++ b/.github/ISSUE_TEMPLATE/general_issue.yml
@@ -0,0 +1,41 @@
+name: General Issue
+description: File a general issue
+title: "[Issue]: "
+labels: []
+
+body:
+  - type: textarea
+    id: description
+    attributes:
+      label: Describe the issue
+      description: A clear and concise description of what the issue is.
+      placeholder: What went wrong?
+  - type: textarea
+    id: reproduce
+    attributes:
+      label: Steps to reproduce
+      description: |
+        Steps to reproduce the behavior:
+
+        1. Step 1
+        2. Step 2
+        3. ...
+        4. See error
+      placeholder: How can we replicate the issue?
+  - type: textarea
+    id: screenshots
+    attributes:
+      label: Screenshots and logs
+      description: If applicable, add screenshots and logs to help explain your problem.
+      placeholder: Add screenshots here
+  - type: textarea
+    id: additional_information
+    attributes:
+      label: Additional Information
+      description: |
+        - FLAML Version: <!-- Specify the FLAML version (e.g., v0.2.0) -->
+        - Operating System: <!-- Specify the OS (e.g., Windows 10, Ubuntu 20.04) -->
+        - Python Version: <!-- Specify the Python version (e.g., 3.8) -->
+        - Related Issues: <!-- Link to any related issues here (e.g., #1) -->
+        - Any other relevant information.
+      placeholder: Any additional details
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -12,8 +12,7 @@

 ## Checks

-<!-- - I've used [pre-commit](https://microsoft.github.io/FLAML/docs/Contribute#pre-commit) to lint the changes in this PR (note the same in integrated in our CI checks). -->
-
+- [ ] I've used [pre-commit](https://microsoft.github.io/FLAML/docs/Contribute#pre-commit) to lint the changes in this PR (note the same in integrated in our CI checks).
 - [ ] I've included any doc changes needed for https://microsoft.github.io/FLAML/. See https://microsoft.github.io/FLAML/docs/Contribute#documentation to build and test documentation locally.
 - [ ] I've added tests (if relevant) corresponding to the changes introduced in this PR.
 - [ ] I've made sure all auto checks have passed.
--- a/.github/workflows/CD.yml
+++ b/.github/workflows/CD.yml
@@ -12,26 +12,17 @@ jobs:
  deploy:
    strategy:
      matrix:
-        os: ['ubuntu-latest']
-        python-version: [3.8]
+        os: ["ubuntu-latest"]
+        python-version: ["3.10"]
    runs-on: ${{ matrix.os }}
    environment: package
    steps:
      - name: Checkout
-        uses: actions/checkout@v3
-      - name: Cache conda
-        uses: actions/cache@v3
+        uses: actions/checkout@v4
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v5
        with:
-          path: ~/conda_pkgs_dir
-          key: conda-${{ matrix.os }}-python-${{ matrix.python-version }}-${{ hashFiles('environment.yml') }}
-      - name: Setup Miniconda
-        uses: conda-incubator/setup-miniconda@v2
-        with:
-          auto-update-conda: true
-          auto-activate-base: false
-          activate-environment: hcrystalball
          python-version: ${{ matrix.python-version }}
-          use-only-tar-bz2: true
      - name: Install from source
        # This is required for the pre-commit tests
        shell: pwsh
@@ -42,7 +33,7 @@ jobs:
      - name: Build
        shell: pwsh
        run: |
-          pip install twine
+          pip install twine wheel setuptools
          python setup.py sdist bdist_wheel
      - name: Publish to PyPI
        env:
--- a/.github/workflows/deploy-website.yml
+++ b/.github/workflows/deploy-website.yml
@@ -37,11 +37,11 @@ jobs:
      - name: setup python
        uses: actions/setup-python@v4
        with:
-          python-version: "3.8"
+          python-version: "3.10"
      - name: pydoc-markdown install
        run: |
          python -m pip install --upgrade pip
-          pip install pydoc-markdown==4.5.0
+          pip install pydoc-markdown==4.7.0
      - name: pydoc-markdown run
        run: |
          pydoc-markdown
@@ -73,11 +73,11 @@ jobs:
      - name: setup python
        uses: actions/setup-python@v4
        with:
-          python-version: "3.8"
+          python-version: "3.10"
      - name: pydoc-markdown install
        run: |
          python -m pip install --upgrade pip
-          pip install pydoc-markdown==4.5.0
+          pip install pydoc-markdown==4.7.0
      - name: pydoc-markdown run
        run: |
          pydoc-markdown
--- a/.github/workflows/python-package.yml
+++ b/.github/workflows/python-package.yml
@@ -14,6 +14,12 @@ on:
      - 'setup.py'
  pull_request:
    branches: ['main']
+    paths:
+      - 'flaml/**'
+      - 'test/**'
+      - 'notebook/**'
+      - '.github/workflows/python-package.yml'
+      - 'setup.py'
  merge_group:
    types: [checks_requested]

@@ -29,8 +35,8 @@ jobs:
    strategy:
      fail-fast: false
      matrix:
-        os: [ubuntu-latest, macos-latest, windows-2019]
-        python-version: ["3.8", "3.9", "3.10", "3.11"]
+        os: [ubuntu-latest, macos-latest, windows-latest]
+        python-version: ["3.9", "3.10", "3.11"]
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python ${{ matrix.python-version }}
@@ -50,14 +56,19 @@ jobs:
          export LDFLAGS="$LDFLAGS -Wl,-rpath,/usr/local/opt/libomp/lib -L/usr/local/opt/libomp/lib -lomp"
      - name: Install packages and dependencies
        run: |
-          python -m pip install --upgrade pip wheel
+          python -m pip install --upgrade pip wheel setuptools
          pip install -e .
          python -c "import flaml"
          pip install -e .[test]
-      - name: On Ubuntu python 3.8, install pyspark 3.2.3
-        if: matrix.python-version == '3.8' && matrix.os == 'ubuntu-latest'
+      - name: On Ubuntu python 3.10, install pyspark 3.4.1
+        if: matrix.python-version == '3.10' && matrix.os == 'ubuntu-latest'
        run: |
-          pip install pyspark==3.2.3
+          pip install pyspark==3.4.1
+          pip list | grep "pyspark"
+      - name: On Ubuntu python 3.11, install pyspark 3.5.1
+        if: matrix.python-version == '3.11' && matrix.os == 'ubuntu-latest'
+        run: |
+          pip install pyspark==3.5.1
          pip list | grep "pyspark"
      - name: If linux and python<3.11, install ray 2
        if: matrix.os == 'ubuntu-latest' && matrix.python-version != '3.11'
@@ -77,20 +88,15 @@ jobs:
        if: matrix.python-version == '3.8' || matrix.python-version == '3.9'
        run: |
          pip install -e .[vw]
-      - name: Uninstall pyspark on (python 3.9) or windows
-        if: matrix.python-version == '3.9' || matrix.os == 'windows-2019'
-        run: |
-          # Uninstall pyspark to test env without pyspark
-          pip uninstall -y pyspark
      - name: Test with pytest
        if: matrix.python-version != '3.10'
        run: |
-          pytest test
+          pytest test/ --ignore=test/autogen
      - name: Coverage
        if: matrix.python-version == '3.10'
        run: |
          pip install coverage
-          coverage run -a -m pytest test
+          coverage run -a -m pytest test --ignore=test/autogen
          coverage xml
      - name: Upload coverage to Codecov
        if: matrix.python-version == '3.10'
--- a/.gitignore
+++ b/.gitignore
@@ -163,6 +163,24 @@ output/
 flaml/tune/spark/mylearner.py
 *.pkl

+data/
+benchmark/pmlb/csv_datasets
+benchmark/*.csv
+
+checkpoints/
+test/default
+test/housing.json
+test/nlp/default/transformer_ms/seq-classification.json
+
+flaml/fabric/fanova/_fanova.c
 # local config files
 *.config.local
+
+local_debug/
 patch.diff
+
+# Test things
+notebook/lightning_logs/
+lightning_logs/
+flaml/autogen/extensions/tmp/
+test/autogen/my_tmp/
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -23,6 +23,13 @@ repos:
    - id: end-of-file-fixer
    - id: no-commit-to-branch

+  - repo: https://github.com/asottile/pyupgrade
+    rev: v2.31.1
+    hooks:
+      - id: pyupgrade
+        args: [--py38-plus]
+        name: Upgrade code
+
  - repo: https://github.com/psf/black
    rev: 23.3.0
    hooks:
--- a/2
+++ b/2
@@ -1,5 +1,5 @@
 # basic setup
-FROM mcr.microsoft.com/devcontainers/python:3.8
+FROM mcr.microsoft.com/devcontainers/python:3.10
 RUN apt-get update && apt-get -y update
 RUN apt-get install -y sudo git npm

--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
 [![PyPI version](https://badge.fury.io/py/FLAML.svg)](https://badge.fury.io/py/FLAML)
 ![Conda version](https://img.shields.io/conda/vn/conda-forge/flaml)
 [![Build](https://github.com/microsoft/FLAML/actions/workflows/python-package.yml/badge.svg)](https://github.com/microsoft/FLAML/actions/workflows/python-package.yml)
-![Python Version](https://img.shields.io/badge/3.8%20%7C%203.9%20%7C%203.10-blue)
+[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/FLAML)](https://pypi.org/project/FLAML/)
 [![Downloads](https://pepy.tech/badge/flaml)](https://pepy.tech/project/flaml)
 [![](https://img.shields.io/discord/1025786666260111483?logo=discord&style=flat)](https://discord.gg/Cppx2vSPVP)

@@ -14,6 +14,8 @@
    <br>
 </p>

+:fire: FLAML supports AutoML and Hyperparameter Tuning in [Microsoft Fabric Data Science](https://learn.microsoft.com/en-us/fabric/data-science/automated-machine-learning-fabric). In addition, we've introduced Python 3.11 support, along with a range of new estimators, and comprehensive integration with MLflow—thanks to contributions from the Microsoft Fabric product team.
+
 :fire: Heads-up: We have migrated [AutoGen](https://microsoft.github.io/autogen/) into a dedicated [github repository](https://github.com/microsoft/autogen). Alongside this move, we have also launched a dedicated [Discord](https://discord.gg/pAbnFJrkgZ) server and a [website](https://microsoft.github.io/autogen/) for comprehensive documentation.

 :fire: The automated multi-agent chat framework in [AutoGen](https://microsoft.github.io/autogen/) is in preview from v2.0.0.
@@ -22,8 +24,6 @@

 :fire: [autogen](https://microsoft.github.io/autogen/) is released with support for ChatGPT and GPT-4, based on [Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference](https://arxiv.org/abs/2303.04673).

-:fire: FLAML supports Code-First AutoML & Tuning – Private Preview in [Microsoft Fabric Data Science](https://learn.microsoft.com/en-us/fabric/data-science/).
-
 ## What is FLAML

 FLAML is a lightweight Python library for efficient automation of machine
@@ -40,7 +40,7 @@ FLAML has a .NET implementation in [ML.NET](http://dot.net/ml), an open-source,

 ## Installation

-FLAML requires **Python version >= 3.8**. It can be installed from pip:
+FLAML requires **Python version >= 3.9**. It can be installed from pip:

 ```bash
 pip install flaml
@@ -154,3 +154,9 @@ provided by the bot. You will only need to do this once across all repos using o
 This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
 For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
 contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
+
+## Contributors Wall
+
+<a href="https://github.com/microsoft/flaml/graphs/contributors">
+  <img src="https://contrib.rocks/image?repo=microsoft/flaml&max=204" />
+</a>
--- a/flaml/init.py
+++ b/flaml/init.py
@@ -1,10 +1,20 @@
 import logging
+import warnings

-from flaml.automl import AutoML, logger_formatter
+try:
+    from flaml.automl import AutoML, logger_formatter
+
+    has_automl = True
+except ImportError:
+    has_automl = False
 from flaml.onlineml.autovw import AutoVW
 from flaml.tune.searcher import CFO, FLOW2, BlendSearch, BlendSearchTuner, RandomSearch
 from flaml.version import __version__

 # Set the root logger.
 logger = logging.getLogger(__name__)
-logger.setLevel(logging.INFO)
+if logger.level == logging.NOTSET:
+    logger.setLevel(logging.INFO)
+
+if not has_automl:
+    warnings.warn("flaml.automl is not available. Please install flaml[automl] to enable AutoML functionalities.")
--- a/flaml/autogen/agentchat/contrib/math_user_proxy_agent.py
+++ b/flaml/autogen/agentchat/contrib/math_user_proxy_agent.py
@@ -156,7 +156,7 @@ class MathUserProxyAgent(UserProxyAgent):
                    when the number of auto reply reaches the max_consecutive_auto_reply or when is_termination_msg is True.
            default_auto_reply (str or dict or None): the default auto reply message when no code execution or llm based reply is generated.
            max_invalid_q_per_step (int): (ADDED) the maximum number of invalid queries per step.
-            **kwargs (dict): other kwargs in [UserProxyAgent](user_proxy_agent#__init__).
+            **kwargs (dict): other kwargs in [UserProxyAgent](../user_proxy_agent#__init__).
        """
        super().__init__(
            name=name,
--- a/flaml/autogen/agentchat/contrib/retrieve_user_proxy_agent.py
+++ b/flaml/autogen/agentchat/contrib/retrieve_user_proxy_agent.py
@@ -123,7 +123,7 @@ class RetrieveUserProxyAgent(UserProxyAgent):
                    can be found at `https://www.sbert.net/docs/pretrained_models.html`. The default model is a
                    fast model. If you want to use a high performance model, `all-mpnet-base-v2` is recommended.
                - customized_prompt (Optional, str): the customized prompt for the retrieve chat. Default is None.
-            **kwargs (dict): other kwargs in [UserProxyAgent](user_proxy_agent#__init__).
+            **kwargs (dict): other kwargs in [UserProxyAgent](../user_proxy_agent#__init__).
        """
        super().__init__(
            name=name,
--- a/flaml/autogen/code_utils.py
+++ b/flaml/autogen/code_utils.py
@@ -125,7 +125,7 @@ def improve_function(file_name, func_name, objective, **config):
    """(work in progress) Improve the function to achieve the objective."""
    params = {**_IMPROVE_FUNCTION_CONFIG, **config}
    # read the entire file into a str
-    with open(file_name, "r") as f:
+    with open(file_name) as f:
        file_string = f.read()
    response = oai.Completion.create(
        {"func_name": func_name, "objective": objective, "file_string": file_string}, **params
@@ -158,7 +158,7 @@ def improve_code(files, objective, suggest_only=True, **config):
    code = ""
    for file_name in files:
        # read the entire file into a string
-        with open(file_name, "r") as f:
+        with open(file_name) as f:
            file_string = f.read()
        code += f"""{file_name}:
 {file_string}
--- a/flaml/autogen/math_utils.py
+++ b/flaml/autogen/math_utils.py
@@ -130,7 +130,7 @@ def _fix_a_slash_b(string: str) -> str:
    try:
        a = int(a_str)
        b = int(b_str)
-        assert string == "{}/{}".format(a, b)
+        assert string == f"{a}/{b}"
        new_string = "\\frac{" + str(a) + "}{" + str(b) + "}"
        return new_string
    except Exception:
--- a/flaml/autogen/retrieve_utils.py
+++ b/flaml/autogen/retrieve_utils.py
@@ -126,7 +126,7 @@ def split_files_to_chunks(
    """Split a list of files into chunks of max_tokens."""
    chunks = []
    for file in files:
-        with open(file, "r") as f:
+        with open(file) as f:
            text = f.read()
        chunks += split_text_to_chunks(text, max_tokens, chunk_mode, must_break_at_empty_line)
    return chunks
--- a/flaml/automl/init.py
+++ b/flaml/automl/init.py
@@ -1,5 +1,9 @@
-from flaml.automl.automl import AutoML, size
 from flaml.automl.logger import logger_formatter
-from flaml.automl.state import AutoMLState, SearchState

-__all__ = ["AutoML", "AutoMLState", "SearchState", "logger_formatter", "size"]
+try:
+    from flaml.automl.automl import AutoML, size
+    from flaml.automl.state import AutoMLState, SearchState
+
+    __all__ = ["AutoML", "AutoMLState", "SearchState", "logger_formatter", "size"]
+except ImportError:
+    __all__ = ["logger_formatter"]
--- a/flaml/automl/automl.py
+++ b/flaml/automl/automl.py
@@ -7,8 +7,10 @@ from __future__ import annotations
 import json
 import logging
 import os
+import random
 import sys
 import time
+from concurrent.futures import as_completed
 from functools import partial
 from typing import Callable, List, Optional, Union

@@ -16,7 +18,7 @@ import numpy as np

 from flaml import tune
 from flaml.automl.logger import logger, logger_formatter
-from flaml.automl.ml import train_estimator
+from flaml.automl.ml import huggingface_metric_to_mode, sklearn_metric_name_set, spark_metric_name_dict, train_estimator
 from flaml.automl.spark import DataFrame, Series, psDataFrame, psSeries
 from flaml.automl.state import AutoMLState, SearchState
 from flaml.automl.task.factory import task_factory
@@ -45,6 +47,7 @@ ERROR = (

 try:
    from sklearn.base import BaseEstimator
+    from sklearn.pipeline import Pipeline
 except ImportError:
    BaseEstimator = object
    ERROR = ERROR or ImportError("please install flaml[automl] option to use the flaml.automl package.")
@@ -54,6 +57,14 @@ try:
 except ImportError:
    mlflow = None

+try:
+    from flaml.fabric.mlflow import MLflowIntegration, get_mlflow_log_latency, infer_signature, is_autolog_enabled
+
+    internal_mlflow = True
+except ImportError:
+    internal_mlflow = False
+
+
 try:
    from ray import __version__ as ray_version

@@ -171,15 +182,22 @@ class AutoML(BaseEstimator):
                'better' only logs configs with better loss than previos iters
                'all' logs all the tried configs.
            model_history: A boolean of whether to keep the best
-                model per estimator. Make sure memory is large enough if setting to True.
+                model per estimator. Make sure memory is large enough if setting to True. Default False.
            log_training_metric: A boolean of whether to log the training
                metric for each model.
            mem_thres: A float of the memory size constraint in bytes.
            pred_time_limit: A float of the prediction latency constraint in seconds.
                It refers to the average prediction time per row in validation data.
-            train_time_limit: A float of the training time constraint in seconds.
+            train_time_limit: None or a float of the training time constraint in seconds for each trial.
+                Only valid for sequential search.
            verbose: int, default=3 | Controls the verbosity, higher means more
                messages.
+                verbose=0: logger level = CRITICAL
+                verbose=1: logger level = ERROR
+                verbose=2: logger level = WARNING
+                verbose=3: logger level = INFO
+                verbose=4: logger level = DEBUG
+                verbose>5: logger level = NOTSET
            retrain_full: bool or str, default=True | whether to retrain the
                selected model on the full training data when using holdout.
                True - retrain only after search finishes; False - no retraining;
@@ -193,7 +211,7 @@ class AutoML(BaseEstimator):
                * Valid str options depend on different tasks.
                For classification tasks, valid choices are
                    ["auto", 'stratified', 'uniform', 'time', 'group']. "auto" -> stratified.
-                For regression tasks, valid choices are ["auto", 'uniform', 'time'].
+                For regression tasks, valid choices are ["auto", 'uniform', 'time', 'group'].
                    "auto" -> uniform.
                For time series forecast tasks, must be "auto" or 'time'.
                For ranking task, must be "auto" or 'group'.
@@ -247,7 +265,10 @@ class AutoML(BaseEstimator):
                search is considered to converge.
            force_cancel: boolean, default=False | Whether to forcely cancel Spark jobs if the
                search time exceeded the time budget.
-            append_log: boolean, default=False | Whether to directly append the log
+            mlflow_exp_name: str, default=None | The name of the mlflow experiment. This should be specified if
+                enable mlflow autologging on Spark. Otherwise it will log all the results into the experiment of the
+                same name as the basename of main entry file.
+            append_log: boolean, default=False | Whetehr to directly append the log
                records to the input log file if it exists.
            auto_augment: boolean, default=True | Whether to automatically
                augment rare classes.
@@ -320,9 +341,7 @@ class AutoML(BaseEstimator):
            }
        }
        ```
-            mlflow_logging: boolean, default=True | Whether to log the training results to mlflow.
-                This requires mlflow to be installed and to have an active mlflow run.
-                FLAML will create nested runs.
+            mlflow_logging: boolean, default=True | Whether to log the training results to mlflow. Not valid if mlflow is not installed.

        """
        if ERROR:
@@ -331,6 +350,8 @@ class AutoML(BaseEstimator):
        self._state = AutoMLState()
        self._state.learner_classes = {}
        self._settings = settings
+        self._automl_user_configurations = settings.copy()
+        self._settings.pop("automl_user_configurations", None)
        # no budget by default
        settings["time_budget"] = settings.get("time_budget", -1)
        settings["task"] = settings.get("task", "classification")
@@ -362,6 +383,7 @@ class AutoML(BaseEstimator):
        settings["preserve_checkpoint"] = settings.get("preserve_checkpoint", True)
        settings["early_stop"] = settings.get("early_stop", False)
        settings["force_cancel"] = settings.get("force_cancel", False)
+        settings["mlflow_exp_name"] = settings.get("mlflow_exp_name", None)
        settings["append_log"] = settings.get("append_log", False)
        settings["min_sample_size"] = settings.get("min_sample_size", MIN_SAMPLE_TRAIN)
        settings["use_ray"] = settings.get("use_ray", False)
@@ -377,6 +399,7 @@ class AutoML(BaseEstimator):
        settings["mlflow_logging"] = settings.get("mlflow_logging", True)

        self._estimator_type = "classifier" if settings["task"] in CLASSIFICATION else "regressor"
+        self.best_run_id = None

    def get_params(self, deep: bool = False) -> dict:
        return self._settings.copy()
@@ -409,6 +432,8 @@ class AutoML(BaseEstimator):
            If `model_history` was set to True, then the returned model is trained.
        """
        state = self._search_states.get(estimator_name)
+        if state and estimator_name == self._best_estimator:
+            return self.model
        return state and getattr(state, "trained_estimator", None)

    @property
@@ -475,14 +500,29 @@ class AutoML(BaseEstimator):
        with open(filename, "w") as f:
            json.dump(best, f)

+    @property
+    def supported_metrics(self):
+        """
+        Returns a tuple of supported metrics for the task.
+
+            Returns:
+                    metrics (Tuple): sklearn metrics from sklearn package;
+                                    huggingface metrics from datasets package;
+                                    spark metrics from pyspark package
+
+        """
+
+        return sklearn_metric_name_set, huggingface_metric_to_mode.keys(), spark_metric_name_dict
+
    @property
    def feature_transformer(self):
-        """Returns feature transformer which is used to preprocess data before applying training or inference."""
-        return getattr(self, "_transformer", None)
+        """Returns AutoML Transformer"""
+        data_precessor = getattr(self, "_transformer", None)
+        return data_precessor

    @property
    def label_transformer(self):
-        """Returns label transformer which is used to preprocess labels before scoring, and inverse transform labels after inference."""
+        """Returns AutoML label transformer"""
        return getattr(self, "_label_transformer", None)

    @property
@@ -521,8 +561,8 @@ class AutoML(BaseEstimator):

    def score(
        self,
-        X: Union[DataFrame, psDataFrame],
-        y: Union[Series, psSeries],
+        X: DataFrame | psDataFrame,
+        y: Series | psSeries,
        **kwargs,
    ):
        estimator = getattr(self, "_trained_estimator", None)
@@ -536,7 +576,7 @@ class AutoML(BaseEstimator):

    def predict(
        self,
-        X: Union[np.array, DataFrame, List[str], List[List[str]], psDataFrame],
+        X: np.array | DataFrame | list[str] | list[list[str]] | psDataFrame,
        **pred_kwargs,
    ):
        """Predict label from features.
@@ -611,7 +651,7 @@ class AutoML(BaseEstimator):
        """
        self._state.learner_classes[learner_name] = learner_class

-    def get_estimator_from_log(self, log_file_name: str, record_id: int, task: Union[str, Task]):
+    def get_estimator_from_log(self, log_file_name: str, record_id: int, task: str | Task):
        """Get the estimator from log file.

        Args:
@@ -653,7 +693,7 @@ class AutoML(BaseEstimator):
        dataframe=None,
        label=None,
        time_budget=np.inf,
-        task: Optional[Union[str, Task]] = None,
+        task: str | Task | None = None,
        eval_method=None,
        split_ratio=None,
        n_splits=None,
@@ -709,7 +749,7 @@ class AutoML(BaseEstimator):
                * Valid str options depend on different tasks.
                For classification tasks, valid choices are
                    ["auto", 'stratified', 'uniform', 'time', 'group']. "auto" -> stratified.
-                For regression tasks, valid choices are ["auto", 'uniform', 'time'].
+                For regression tasks, valid choices are ["auto", 'uniform', 'time', 'group'].
                    "auto" -> uniform.
                For time series forecast tasks, must be "auto" or 'time'.
                For ranking task, must be "auto" or 'group'.
@@ -779,7 +819,7 @@ class AutoML(BaseEstimator):
                    max_epochs: int, default = 20 | Maximum number of epochs to run training,
                        only used by TemporalFusionTransformerEstimator.
                    batch_size: int, default = 64 | Batch size for training model, only
-                        used by TemporalFusionTransformerEstimator.
+                        used by TemporalFusionTransformerEstimator and TCNEstimator.
        """
        task = task or self._settings.get("task")
        if isinstance(task, str):
@@ -802,7 +842,7 @@ class AutoML(BaseEstimator):
        )
        task.validate_data(self, self._state, X_train, y_train, dataframe, label, groups=groups)

-        logger.info("log file name {}".format(log_file_name))
+        logger.info(f"log file name {log_file_name}")

        best_config = None
        best_val_loss = float("+inf")
@@ -855,9 +895,7 @@ class AutoML(BaseEstimator):
        else:
            self._state.fit_kwargs_by_estimator[best_estimator] = self._state.fit_kwargs

-        logger.info(
-            "estimator = {}, config = {}, #training instances = {}".format(best_estimator, best_config, sample_size)
-        )
+        logger.info(f"estimator = {best_estimator}, config = {best_config}, #training instances = {sample_size}")
        # Partially copied from fit() function
        # Initilize some attributes required for retrain_from_log
        self._split_type = task.decide_split_type(
@@ -1028,7 +1066,7 @@ class AutoML(BaseEstimator):
        return points

    @property
-    def resource_attr(self) -> Optional[str]:
+    def resource_attr(self) -> str | None:
        """Attribute of the resource dimension.

        Returns:
@@ -1038,7 +1076,7 @@ class AutoML(BaseEstimator):
        return "FLAML_sample_size" if self._sample else None

    @property
-    def min_resource(self) -> Optional[float]:
+    def min_resource(self) -> float | None:
        """Attribute for pruning.

        Returns:
@@ -1047,7 +1085,7 @@ class AutoML(BaseEstimator):
        return self._min_sample_size if self._sample else None

    @property
-    def max_resource(self) -> Optional[float]:
+    def max_resource(self) -> float | None:
        """Attribute for pruning.

        Returns:
@@ -1069,7 +1107,7 @@ class AutoML(BaseEstimator):
            pickle.dump(self, f, pickle.HIGHEST_PROTOCOL)

    @property
-    def trainable(self) -> Callable[[dict], Optional[float]]:
+    def trainable(self) -> Callable[[dict], float | None]:
        """Training function.
        Returns:
            A function that evaluates each config and returns the loss.
@@ -1155,7 +1193,7 @@ class AutoML(BaseEstimator):
        dataframe=None,
        label=None,
        metric=None,
-        task: Optional[Union[str, Task]] = None,
+        task: str | Task | None = None,
        n_jobs=None,
        # gpu_per_trial=0,
        log_file_name=None,
@@ -1203,6 +1241,7 @@ class AutoML(BaseEstimator):
        skip_transform=None,
        mlflow_logging=None,
        fit_kwargs_by_estimator=None,
+        mlflow_exp_name=None,
        **fit_kwargs,
    ):
        """Find a model for a given task.
@@ -1296,14 +1335,15 @@ class AutoML(BaseEstimator):
                'all' logs all the tried configs.
            model_history: A boolean of whether to keep the trained best
                model per estimator. Make sure memory is large enough if setting to True.
-                Default value is False: best_model_for_estimator would return a
+                Default value is False. If False, best_model_for_estimator would return a
                untrained model for non-best learner.
            log_training_metric: A boolean of whether to log the training
                metric for each model.
            mem_thres: A float of the memory size constraint in bytes.
            pred_time_limit: A float of the prediction latency constraint in seconds.
                It refers to the average prediction time per row in validation data.
-            train_time_limit: None or a float of the training time constraint in seconds.
+            train_time_limit: None or a float of the training time constraint in seconds for each trial.
+                Only valid for sequential search.
            X_val: None or a numpy array or a pandas dataframe of validation data.
            y_val: None or a numpy array or a pandas series of validation labels.
            sample_weight_val: None or a numpy array of the sample weight of
@@ -1316,6 +1356,12 @@ class AutoML(BaseEstimator):
                for training data.
            verbose: int, default=3 | Controls the verbosity, higher means more
                messages.
+                verbose=0: logger level = CRITICAL
+                verbose=1: logger level = ERROR
+                verbose=2: logger level = WARNING
+                verbose=3: logger level = INFO
+                verbose=4: logger level = DEBUG
+                verbose>5: logger level = NOTSET
            retrain_full: bool or str, default=True | whether to retrain the
                selected model on the full training data when using holdout.
                True - retrain only after search finishes; False - no retraining;
@@ -1329,7 +1375,7 @@ class AutoML(BaseEstimator):
                * Valid str options depend on different tasks.
                For classification tasks, valid choices are
                    ["auto", 'stratified', 'uniform', 'time', 'group']. "auto" -> stratified.
-                For regression tasks, valid choices are ["auto", 'uniform', 'time'].
+                For regression tasks, valid choices are ["auto", 'uniform', 'time', 'group'].
                    "auto" -> uniform.
                For time series forecast tasks, must be "auto" or 'time'.
                For ranking task, must be "auto" or 'group'.
@@ -1382,7 +1428,10 @@ class AutoML(BaseEstimator):
            early_stop: boolean, default=False | Whether to stop early if the
                search is considered to converge.
            force_cancel: boolean, default=False | Whether to forcely cancel the PySpark job if overtime.
-            append_log: boolean, default=False | Whether to directly append the log
+            mlflow_exp_name: str, default=None | The name of the mlflow experiment. This should be specified if
+                enable mlflow autologging on Spark. Otherwise it will log all the results into the experiment of the
+                same name as the basename of main entry file.
+            append_log: boolean, default=False | Whetehr to directly append the log
                records to the input log file if it exists.
            auto_augment: boolean, default=True | Whether to automatically
                augment rare classes.
@@ -1467,9 +1516,7 @@ class AutoML(BaseEstimator):
            skip_transform: boolean, default=False | Whether to pre-process data prior to modeling.
            mlflow_logging: boolean, default=None | Whether to log the training results to mlflow.
                Default value is None, which means the logging decision is made based on
-                AutoML.__init__'s mlflow_logging argument.
-                This requires mlflow to be installed and to have an active mlflow run.
-                FLAML will create nested runs.
+                AutoML.__init__'s mlflow_logging argument. Not valid if mlflow is not installed.
            fit_kwargs_by_estimator: dict, default=None | The user specified keywords arguments, grouped by estimator name.
                For TransformersEstimator, available fit_kwargs can be found from
                [TrainingArgumentsForAuto](nlp/huggingface/training_args).
@@ -1519,7 +1566,7 @@ class AutoML(BaseEstimator):
                    max_epochs: int, default = 20 | Maximum number of epochs to run training,
                        only used by TemporalFusionTransformerEstimator.
                    batch_size: int, default = 64 | Batch size for training model, only
-                        used by TemporalFusionTransformerEstimator.
+                        used by TemporalFusionTransformerEstimator and TCNEstimator.
        """

        self._state._start_time_flag = self._start_time_flag = time.time()
@@ -1570,6 +1617,7 @@ class AutoML(BaseEstimator):
        )
        early_stop = self._settings.get("early_stop") if early_stop is None else early_stop
        force_cancel = self._settings.get("force_cancel") if force_cancel is None else force_cancel
+        mlflow_exp_name = self._settings.get("mlflow_exp_name") if mlflow_exp_name is None else mlflow_exp_name
        # no search budget is provided?
        no_budget = time_budget < 0 and max_iter is None and not early_stop
        append_log = self._settings.get("append_log") if append_log is None else append_log
@@ -1592,6 +1640,13 @@ class AutoML(BaseEstimator):
            _ch.setFormatter(logger_formatter)
            logger.addHandler(_ch)

+        if model_history:
+            logger.warning(
+                "With `model_history` set to `True` by default, all intermediate models are retained in memory, "
+                "which may significantly increase memory usage and slow down training. "
+                "Consider setting `model_history=False` to optimize memory and accelerate the training process."
+            )
+
        if not use_ray and not use_spark and n_concurrent_trials > 1:
            if ray_available:
                logger.warning(
@@ -1622,7 +1677,6 @@ class AutoML(BaseEstimator):
        self._use_ray = use_ray
        # use the following condition if we have an estimation of average_trial_time and average_trial_overhead
        # self._use_ray = use_ray or n_concurrent_trials > ( average_trial_time + average_trial_overhead) / (average_trial_time)
-
        if self._use_ray is not False:
            import ray

@@ -1656,11 +1710,29 @@ class AutoML(BaseEstimator):
        self._state.fit_kwargs = fit_kwargs
        custom_hp = custom_hp or self._settings.get("custom_hp")
        self._skip_transform = self._settings.get("skip_transform") if skip_transform is None else skip_transform
-        self._mlflow_logging = self._settings.get("mlflow_logging") if mlflow_logging is None else mlflow_logging
+        self._mlflow_logging = (
+            False
+            if mlflow is None
+            else self._settings.get("mlflow_logging")
+            if mlflow_logging is None
+            else mlflow_logging
+        )
        fit_kwargs_by_estimator = fit_kwargs_by_estimator or self._settings.get("fit_kwargs_by_estimator")
        self._state.fit_kwargs_by_estimator = fit_kwargs_by_estimator.copy()  # shallow copy of fit_kwargs_by_estimator
        self._state.weight_val = sample_weight_val
-
+        self._mlflow_exp_name = mlflow_exp_name
+        self.mlflow_integration = None
+        self.autolog_extra_tag = {
+            "extra_tag.sid": f"flaml_{flaml_version}_{int(time.time())}_{random.randint(1001, 9999)}"
+        }
+        if internal_mlflow and self._mlflow_logging and (mlflow.active_run() or is_autolog_enabled()):
+            try:
+                self.mlflow_integration = MLflowIntegration("automl", mlflow_exp_name, extra_tag=self.autolog_extra_tag)
+                self._mlflow_exp_name = self.mlflow_integration.experiment_name
+                if not (mlflow.active_run() is not None or is_autolog_enabled()):
+                    self.mlflow_integration.only_history = True
+            except KeyError:
+                logger.info("Not in Fabric, Skipped")
        task.validate_data(
            self,
            self._state,
@@ -1688,7 +1760,7 @@ class AutoML(BaseEstimator):
            logger.info(f"Data split method: {self._split_type}")
        eval_method = self._decide_eval_method(eval_method, time_budget)
        self._state.eval_method = eval_method
-        logger.info("Evaluation method: {}".format(eval_method))
+        logger.info(f"Evaluation method: {eval_method}")
        self._state.cv_score_agg_func = cv_score_agg_func or self._settings.get("cv_score_agg_func")

        self._retrain_in_budget = retrain_full == "budget" and (eval_method == "holdout" and self._state.X_val is None)
@@ -1705,13 +1777,9 @@ class AutoML(BaseEstimator):
                if sample_size:
                    _sample_size_from_starting_points[_estimator] = sample_size
                elif _point_per_estimator and isinstance(_point_per_estimator, list):
-                    _sample_size_set = set(
-                        [
-                            config["FLAML_sample_size"]
-                            for config in _point_per_estimator
-                            if "FLAML_sample_size" in config
-                        ]
-                    )
+                    _sample_size_set = {
+                        config["FLAML_sample_size"] for config in _point_per_estimator if "FLAML_sample_size" in config
+                    }
                    if _sample_size_set:
                        _sample_size_from_starting_points[_estimator] = min(_sample_size_set)
                    if len(_sample_size_set) > 1:
@@ -1729,6 +1797,11 @@ class AutoML(BaseEstimator):
        self._min_sample_size_input = min_sample_size
        self._prepare_data(eval_method, split_ratio, n_splits)

+        # infer the signature of the input/output data
+        if self.mlflow_integration is not None:
+            self.estimator_signature = infer_signature(self._state.X_train, self._state.y_train)
+            self.pipeline_signature = infer_signature(X_train, y_train, dataframe, label)
+
        # TODO pull this to task as decide_sample_size
        if isinstance(self._min_sample_size, dict):
            self._sample = {
@@ -1827,6 +1900,11 @@ class AutoML(BaseEstimator):
            and (max_iter > 0 or retrain_full is True)
            or max_iter == 1
        )
+        if self.mlflow_integration is not None and all(
+            [self.mlflow_integration.parent_run_id is None, not self.mlflow_integration.only_history]
+        ):
+            # force not retrain if no active run
+            self._state.retrain_final = False
        # add custom learner
        for estimator_name in estimator_list:
            if estimator_name not in self._state.learner_classes:
@@ -1898,7 +1976,7 @@ class AutoML(BaseEstimator):
                max_iter=max_iter / len(estimator_list) if self._learner_selector == "roundrobin" else max_iter,
                budget=self._state.time_budget,
            )
-        logger.info("List of ML learners in AutoML Run: {}".format(estimator_list))
+        logger.info(f"List of ML learners in AutoML Run: {estimator_list}")
        self.estimator_list = estimator_list
        self._active_estimators = estimator_list.copy()
        self._ensemble = ensemble
@@ -1940,7 +2018,7 @@ class AutoML(BaseEstimator):
                )
            ):
                logger.warning(
-                    "Time taken to find the best model is {0:.0f}% of the "
+                    "Time taken to find the best model is {:.0f}% of the "
                    "provided time budget and not all estimators' hyperparameter "
                    "search converged. Consider increasing the time budget.".format(
                        self._time_taken_best_iter / self._state.time_budget * 100
@@ -1959,6 +2037,8 @@ class AutoML(BaseEstimator):
            )  # NOTE: this is after kwargs is updated to fit_kwargs_by_estimator
            del self._state.groups, self._state.groups_all, self._state.groups_val
        logger.setLevel(old_level)
+        if self.mlflow_integration is not None:
+            self.mlflow_integration.resume_mlflow()

    def _search_parallel(self):
        if self._use_ray is not False:
@@ -2055,6 +2135,14 @@ class AutoML(BaseEstimator):

        if self._use_spark:
            # use spark as parallel backend
+            mlflow_log_latency = (
+                get_mlflow_log_latency(model_history=self._state.model_history) if self.mlflow_integration else 0
+            )
+            (
+                logger.info(f"Estimated mlflow_log_latency: {mlflow_log_latency} seconds.")
+                if mlflow_log_latency > 0
+                else None
+            )
            analysis = tune.run(
                self.trainable,
                search_alg=search_alg,
@@ -2067,6 +2155,9 @@ class AutoML(BaseEstimator):
                use_ray=False,
                use_spark=True,
                force_cancel=self._force_cancel,
+                mlflow_exp_name=self._mlflow_exp_name,
+                automl_info=(mlflow_log_latency,),  # pass automl info to tune.run
+                extra_tag=self.autolog_extra_tag,
                # raise_on_failed_trial=False,
                # keep_checkpoints_num=1,
                # checkpoint_score_attr="min-val_loss",
@@ -2127,6 +2218,8 @@ class AutoML(BaseEstimator):
                    self._search_states[estimator].best_config = config
                if better or self._log_type == "all":
                    self._log_trial(search_state, estimator)
+                if self.mlflow_integration:
+                    self.mlflow_integration.record_state(self, search_state, estimator)

    def _log_trial(self, search_state, estimator):
        if self._training_log:
@@ -2140,36 +2233,6 @@ class AutoML(BaseEstimator):
                estimator,
                search_state.sample_size,
            )
-        if self._mlflow_logging and mlflow is not None and mlflow.active_run():
-            with mlflow.start_run(nested=True):
-                mlflow.log_metric("iter_counter", self._track_iter)
-                if (search_state.metric_for_logging is not None) and (
-                    "intermediate_results" in search_state.metric_for_logging
-                ):
-                    for each_entry in search_state.metric_for_logging["intermediate_results"]:
-                        with mlflow.start_run(nested=True):
-                            mlflow.log_metrics(each_entry)
-                            mlflow.log_metric("iter_counter", self._iter_per_learner[estimator])
-                    del search_state.metric_for_logging["intermediate_results"]
-                if search_state.metric_for_logging:
-                    mlflow.log_metrics(search_state.metric_for_logging)
-                mlflow.log_metric("trial_time", search_state.trial_time)
-                mlflow.log_metric("wall_clock_time", self._state.time_from_start)
-                mlflow.log_metric("validation_loss", search_state.val_loss)
-                mlflow.log_params(search_state.config)
-                mlflow.log_param("learner", estimator)
-                mlflow.log_param("sample_size", search_state.sample_size)
-                mlflow.log_metric("best_validation_loss", search_state.best_loss)
-                mlflow.log_param("best_config", search_state.best_config)
-                mlflow.log_param("best_learner", self._best_estimator)
-                mlflow.log_metric(
-                    self._state.metric if isinstance(self._state.metric, str) else self._state.error_metric,
-                    1 - search_state.val_loss
-                    if self._state.error_metric.startswith("1-")
-                    else -search_state.val_loss
-                    if self._state.error_metric.startswith("-")
-                    else search_state.val_loss,
-                )

    def _search_sequential(self):
        try:
@@ -2323,9 +2386,18 @@ class AutoML(BaseEstimator):
                verbose=max(self.verbose - 3, 0),
                use_ray=False,
                use_spark=False,
+                force_cancel=self._force_cancel,
+                mlflow_exp_name=self._mlflow_exp_name,
+                automl_info=(0,),  # pass automl info to tune.run
+                extra_tag=self.autolog_extra_tag,
            )
            time_used = time.time() - start_run_time
            better = False
+            (
+                logger.debug(f"result in automl: {analysis.trials}, {analysis.trials[-1].last_result}")
+                if analysis.trials
+                else logger.debug("result in automl: [], None")
+            )
            if analysis.trials and analysis.trials[-1].last_result:
                result = analysis.trials[-1].last_result
                search_state.update(result, time_used=time_used)
@@ -2388,6 +2460,8 @@ class AutoML(BaseEstimator):
                    search_state.trained_estimator.cleanup()
                if better or self._log_type == "all":
                    self._log_trial(search_state, estimator)
+                if self.mlflow_integration:
+                    self.mlflow_integration.record_state(self, search_state, estimator)

                logger.info(
                    " at {:.1f}s,\testimator {}'s best error={:.4f},\tbest estimator {}'s best error={:.4f}".format(
@@ -2440,7 +2514,7 @@ class AutoML(BaseEstimator):
                    state.best_config,
                    self.data_size_full,
                )
-                logger.info("retrain {} for {:.1f}s".format(self._best_estimator, retrain_time))
+                logger.info(f"retrain {self._best_estimator} for {retrain_time:.1f}s")
                self._retrained_config[best_config_sig] = state.best_config_train_time = retrain_time
                est_retrain_time = 0
            self._state.time_from_start = time.time() - self._start_time_flag
@@ -2462,8 +2536,8 @@ class AutoML(BaseEstimator):
        self._time_taken_best_iter = 0
        self._config_history = {}
        self._max_iter_per_learner = 10000
-        self._iter_per_learner = dict([(e, 0) for e in self.estimator_list])
-        self._iter_per_learner_fullsize = dict([(e, 0) for e in self.estimator_list])
+        self._iter_per_learner = {e: 0 for e in self.estimator_list}
+        self._iter_per_learner_fullsize = {e: 0 for e in self.estimator_list}
        self._fullsize_reached = False
        self._trained_estimator = None
        self._best_estimator = None
@@ -2479,6 +2553,21 @@ class AutoML(BaseEstimator):
            self._selected = state = self._search_states[estimator]
            state.best_config_sample_size = self._state.data_size[0]
            state.best_config = state.init_config[0] if state.init_config else {}
+            self._track_iter = 0
+            self._config_history[self._track_iter] = (estimator, state.best_config, self._state.time_from_start)
+            self._best_iteration = self._track_iter
+            state.val_loss = getattr(state, "val_loss", float("inf"))
+            state.best_loss = getattr(state, "best_loss", float("inf"))
+            state.config = getattr(state, "config", state.best_config.copy())
+            state.metric_for_logging = getattr(state, "metric_for_logging", None)
+            state.sample_size = getattr(state, "sample_size", self._state.data_size[0])
+            state.learner_class = getattr(state, "learner_class", self._state.learner_classes.get(estimator))
+            if hasattr(self, "mlflow_integration") and self.mlflow_integration:
+                self.mlflow_integration.record_state(
+                    automl=self,
+                    search_state=state,
+                    estimator=estimator,
+                )
        elif self._use_ray is False and self._use_spark is False:
            self._search_sequential()
        else:
@@ -2488,6 +2577,12 @@ class AutoML(BaseEstimator):
            self._training_log.checkpoint()
        self._state.time_from_start = time.time() - self._start_time_flag
        if self._best_estimator:
+            if self.mlflow_integration:
+                self.mlflow_integration.log_automl(self)
+                if mlflow.active_run() is None:
+                    if self.mlflow_integration.parent_run_id is not None and self.mlflow_integration.autolog:
+                        # ensure result of retrain autolog to parent run
+                        mlflow.start_run(run_id=self.mlflow_integration.parent_run_id)
            self._selected = self._search_states[self._best_estimator]
            self.modelcount = sum(search_state.total_iter for search_state in self._search_states.values())
            if self._trained_estimator:
@@ -2624,13 +2719,67 @@ class AutoML(BaseEstimator):
                        self._best_estimator,
                        state.best_config,
                        self.data_size_full,
+                        is_retrain=True,
                    )
-                    logger.info("retrain {} for {:.1f}s".format(self._best_estimator, retrain_time))
+                    logger.info(f"retrain {self._best_estimator} for {retrain_time:.1f}s")
                    state.best_config_train_time = retrain_time
                    if self._trained_estimator:
                        logger.info(f"retrained model: {self._trained_estimator.model}")
+                        if self.best_run_id is not None:
+                            logger.info(f"Best MLflow run name: {self.best_run_name}")
+                            logger.info(f"Best MLflow run id: {self.best_run_id}")
+                        if self.mlflow_integration is not None:
+                            # try log retrained model
+                            if all(
+                                [
+                                    self.mlflow_integration.manual_log,
+                                    not self.mlflow_integration.has_model,
+                                    self.mlflow_integration.parent_run_id is not None,
+                                ]
+                            ):
+                                if mlflow.active_run() is None:
+                                    mlflow.start_run(run_id=self.mlflow_integration.parent_run_id)
+                                if self.best_estimator.endswith("_spark"):
+                                    self.mlflow_integration.log_model(
+                                        self._trained_estimator.model,
+                                        self.best_estimator,
+                                        signature=self.estimator_signature,
+                                        run_id=self.mlflow_integration.parent_run_id,
+                                    )
+                                else:
+                                    self.mlflow_integration.pickle_and_log_automl_artifacts(
+                                        self,
+                                        self.model,
+                                        self.best_estimator,
+                                        signature=self.pipeline_signature,
+                                        run_id=self.mlflow_integration.parent_run_id,
+                                    )
                else:
-                    logger.info("not retraining because the time budget is too small.")
+                    logger.warning("not retraining because the time budget is too small.")
+        self.wait_futures()
+
+    def wait_futures(self):
+        if self.mlflow_integration is not None:
+            logger.debug("Collecting results from submitted record_state tasks")
+            t1 = time.perf_counter()
+            for future in as_completed(self.mlflow_integration.futures):
+                _task = self.mlflow_integration.futures[future]
+                try:
+                    result = future.result()
+                    logger.debug(f"Result for record_state task {_task}: {result}")
+                except Exception as e:
+                    logger.warning(f"Exception for record_state task {_task}: {e}")
+            for future in as_completed(self.mlflow_integration.futures_log_model):
+                _task = self.mlflow_integration.futures_log_model[future]
+                try:
+                    result = future.result()
+                    logger.debug(f"Result for log_model task {_task}: {result}")
+                except Exception as e:
+                    logger.warning(f"Exception for log_model task {_task}: {e}")
+            t2 = time.perf_counter()
+            logger.debug(f"Collecting results from tasks submitted to executors costs {t2-t1} seconds.")
+        else:
+            logger.debug("No futures to wait for.")

    def __del__(self):
        if (
@@ -2702,3 +2851,7 @@ class AutoML(BaseEstimator):
                q += inv[i] / s
                if p < q:
                    return estimator_list[i]
+
+    @property
+    def automl_pipeline(self):
+        return None
--- a/flaml/automl/contrib/histgb.py
+++ b/flaml/automl/contrib/histgb.py
@@ -1,7 +1,7 @@
 try:
    from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor
-except ImportError:
-    pass
+except ImportError as e:
+    print(f"scikit-learn is required for HistGradientBoostingEstimator. Please install it; error: {e}")

 from flaml import tune
 from flaml.automl.model import SKLearnEstimator
--- a/flaml/automl/data.py
+++ b/flaml/automl/data.py
@@ -2,13 +2,17 @@
 #  * Copyright (c) Microsoft Corporation. All rights reserved.
 #  * Licensed under the MIT License. See LICENSE file in the
 #  * project root for license information.
+import json
 import os
-from datetime import datetime
+import random
+import uuid
+from datetime import datetime, timedelta
+from decimal import ROUND_HALF_UP, Decimal
 from typing import TYPE_CHECKING, Union

 import numpy as np

-from flaml.automl.spark import DataFrame, Series, pd, ps, psDataFrame, psSeries
+from flaml.automl.spark import DataFrame, F, Series, T, pd, ps, psDataFrame, psSeries
 from flaml.automl.training_log import training_log_reader

 try:
@@ -19,6 +23,7 @@ except ImportError:
 if TYPE_CHECKING:
    from flaml.automl.task import Task

+
 TS_TIMESTAMP_COL = "ds"
 TS_VALUE_COL = "y"

@@ -293,7 +298,7 @@ class DataTransformer:
                    y = y.rename(TS_VALUE_COL)
            for column in X.columns:
                # sklearn\utils\validation.py needs int/float values
-                if X[column].dtype.name in ("object", "category"):
+                if X[column].dtype.name in ("object", "category", "string"):
                    if X[column].nunique() == 1 or X[column].nunique(dropna=True) == n - X[column].isnull().sum():
                        X.drop(columns=column, inplace=True)
                        drop = True
@@ -445,3 +450,331 @@ class DataTransformer:
 def group_counts(groups):
    _, i, c = np.unique(groups, return_counts=True, return_index=True)
    return c[np.argsort(i)]
+
+
+def get_random_dataframe(n_rows: int = 200, ratio_none: float = 0.1, seed: int = 42) -> DataFrame:
+    """Generate a random pandas DataFrame with various data types for testing.
+    This function creates a DataFrame with multiple column types including:
+    - Timestamps
+    - Integers
+    - Floats
+    - Categorical values
+    - Booleans
+    - Lists (tags)
+    - Decimal strings
+    - UUIDs
+    - Binary data (as hex strings)
+    - JSON blobs
+    - Nullable text fields
+    Parameters
+    ----------
+    n_rows : int, default=200
+        Number of rows in the generated DataFrame
+    ratio_none : float, default=0.1
+        Probability of generating None values in applicable columns
+    seed : int, default=42
+        Random seed for reproducibility
+    Returns
+    -------
+    pd.DataFrame
+        A DataFrame with 14 columns of various data types
+    Examples
+    --------
+    >>> df = get_random_dataframe(100, 0.05, 123)
+    >>> df.shape
+    (100, 14)
+    >>> df.dtypes
+    timestamp       datetime64[ns]
+    id                       int64
+    score                  float64
+    status                  object
+    flag                    object
+    count                   object
+    value                   object
+    tags                    object
+    rating                  object
+    uuid                    object
+    binary                  object
+    json_blob               object
+    category              category
+    nullable_text           object
+    dtype: object
+    """
+
+    np.random.seed(seed)
+    random.seed(seed)
+
+    def random_tags():
+        tags = ["AI", "ML", "data", "robotics", "vision"]
+        return random.sample(tags, k=random.randint(1, 3)) if random.random() > ratio_none else None
+
+    def random_decimal():
+        return (
+            str(Decimal(random.uniform(1, 5)).quantize(Decimal("0.01"), rounding=ROUND_HALF_UP))
+            if random.random() > ratio_none
+            else None
+        )
+
+    def random_json_blob():
+        blob = {"a": random.randint(1, 10), "b": random.random()}
+        return json.dumps(blob) if random.random() > ratio_none else None
+
+    def random_binary():
+        return bytes(random.randint(0, 255) for _ in range(4)).hex() if random.random() > ratio_none else None
+
+    data = {
+        "timestamp": [
+            datetime(2020, 1, 1) + timedelta(days=np.random.randint(0, 1000)) if np.random.rand() > ratio_none else None
+            for _ in range(n_rows)
+        ],
+        "id": range(1, n_rows + 1),
+        "score": np.random.uniform(0, 100, n_rows),
+        "status": np.random.choice(
+            ["active", "inactive", "pending", None],
+            size=n_rows,
+            p=[(1 - ratio_none) / 3, (1 - ratio_none) / 3, (1 - ratio_none) / 3, ratio_none],
+        ),
+        "flag": np.random.choice(
+            [True, False, None], size=n_rows, p=[(1 - ratio_none) / 2, (1 - ratio_none) / 2, ratio_none]
+        ),
+        "count": [np.random.randint(0, 100) if np.random.rand() > ratio_none else None for _ in range(n_rows)],
+        "value": [round(np.random.normal(50, 15), 2) if np.random.rand() > ratio_none else None for _ in range(n_rows)],
+        "tags": [random_tags() for _ in range(n_rows)],
+        "rating": [random_decimal() for _ in range(n_rows)],
+        "uuid": [str(uuid.uuid4()) if np.random.rand() > ratio_none else None for _ in range(n_rows)],
+        "binary": [random_binary() for _ in range(n_rows)],
+        "json_blob": [random_json_blob() for _ in range(n_rows)],
+        "category": pd.Categorical(
+            np.random.choice(
+                ["A", "B", "C", None],
+                size=n_rows,
+                p=[(1 - ratio_none) / 3, (1 - ratio_none) / 3, (1 - ratio_none) / 3, ratio_none],
+            )
+        ),
+        "nullable_text": [random.choice(["Good", "Bad", "Average", None]) for _ in range(n_rows)],
+    }
+
+    return pd.DataFrame(data)
+
+
+def auto_convert_dtypes_spark(
+    df: psDataFrame,
+    na_values: list = None,
+    category_threshold: float = 0.3,
+    convert_threshold: float = 0.6,
+    sample_ratio: float = 0.1,
+) -> tuple[psDataFrame, dict]:
+    """Automatically convert data types in a PySpark DataFrame using heuristics.
+
+    This function analyzes a sample of the DataFrame to infer appropriate data types
+    and applies the conversions. It handles timestamps, numeric values, booleans,
+    and categorical fields.
+
+    Args:
+        df: A PySpark DataFrame to convert.
+        na_values: List of strings to be considered as NA/NaN. Defaults to
+            ['NA', 'na', 'NULL', 'null', ''].
+        category_threshold: Maximum ratio of unique values to total values
+            to consider a column categorical. Defaults to 0.3.
+        convert_threshold: Minimum ratio of successfully converted values required
+            to apply a type conversion. Defaults to 0.6.
+        sample_ratio: Fraction of data to sample for type inference. Defaults to 0.1.
+
+    Returns:
+        tuple: (The DataFrame with converted types, A dictionary mapping column names to
+                their inferred types as strings)
+
+    Note:
+        - 'category' in the schema dict is conceptual as PySpark doesn't have a true
+            category type like pandas
+        - The function uses sampling for efficiency with large datasets
+    """
+    n_rows = df.count()
+    if na_values is None:
+        na_values = ["NA", "na", "NULL", "null", ""]
+
+    # Normalize NA-like values
+    for colname, coltype in df.dtypes:
+        if coltype == "string":
+            df = df.withColumn(
+                colname,
+                F.when(F.trim(F.lower(F.col(colname))).isin([v.lower() for v in na_values]), None).otherwise(
+                    F.col(colname)
+                ),
+            )
+
+    schema = {}
+    for colname in df.columns:
+        # Sample once at an appropriate ratio
+        sample_ratio_to_use = min(1.0, sample_ratio if n_rows * sample_ratio > 100 else 100 / n_rows)
+        col_sample = df.select(colname).sample(withReplacement=False, fraction=sample_ratio_to_use).dropna()
+        sample_count = col_sample.count()
+
+        inferred_type = "string"  # Default
+
+        if col_sample.dtypes[0][1] != "string":
+            schema[colname] = col_sample.dtypes[0][1]
+            continue
+
+        if sample_count == 0:
+            schema[colname] = "string"
+            continue
+
+        # Check if timestamp
+        ts_col = col_sample.withColumn("parsed", F.to_timestamp(F.col(colname)))
+
+        # Check numeric
+        if (
+            col_sample.withColumn("n", F.col(colname).cast("double")).filter("n is not null").count()
+            >= sample_count * convert_threshold
+        ):
+            # All whole numbers?
+            all_whole = (
+                col_sample.withColumn("n", F.col(colname).cast("double"))
+                .filter("n is not null")
+                .withColumn("frac", F.abs(F.col("n") % 1))
+                .filter("frac > 0.000001")
+                .count()
+                == 0
+            )
+            inferred_type = "int" if all_whole else "double"
+
+        # Check low-cardinality (category-like)
+        elif (
+            sample_count > 0
+            and col_sample.select(F.countDistinct(F.col(colname))).collect()[0][0] / sample_count <= category_threshold
+        ):
+            inferred_type = "category"  # Will just be string, but marked as such
+
+        # Check if timestamp
+        elif ts_col.filter(F.col("parsed").isNotNull()).count() >= sample_count * convert_threshold:
+            inferred_type = "timestamp"
+
+        schema[colname] = inferred_type
+
+    # Apply inferred schema
+    for colname, inferred_type in schema.items():
+        if inferred_type == "int":
+            df = df.withColumn(colname, F.col(colname).cast(T.IntegerType()))
+        elif inferred_type == "double":
+            df = df.withColumn(colname, F.col(colname).cast(T.DoubleType()))
+        elif inferred_type == "boolean":
+            df = df.withColumn(
+                colname,
+                F.when(F.lower(F.col(colname)).isin("true", "yes", "1"), True)
+                .when(F.lower(F.col(colname)).isin("false", "no", "0"), False)
+                .otherwise(None),
+            )
+        elif inferred_type == "timestamp":
+            df = df.withColumn(colname, F.to_timestamp(F.col(colname)))
+        elif inferred_type == "category":
+            df = df.withColumn(colname, F.col(colname).cast(T.StringType()))  # Marked conceptually
+
+        # otherwise keep as string (or original type)
+
+    return df, schema
+
+
+def auto_convert_dtypes_pandas(
+    df: DataFrame,
+    na_values: list = None,
+    category_threshold: float = 0.3,
+    convert_threshold: float = 0.6,
+    sample_ratio: float = 1.0,
+) -> tuple[DataFrame, dict]:
+    """Automatically convert data types in a pandas DataFrame using heuristics.
+
+    This function analyzes the DataFrame to infer appropriate data types
+    and applies the conversions. It handles timestamps, timedeltas, numeric values,
+    and categorical fields.
+
+    Args:
+        df: A pandas DataFrame to convert.
+        na_values: List of strings to be considered as NA/NaN. Defaults to
+            ['NA', 'na', 'NULL', 'null', ''].
+        category_threshold: Maximum ratio of unique values to total values
+            to consider a column categorical. Defaults to 0.3.
+        convert_threshold: Minimum ratio of successfully converted values required
+            to apply a type conversion. Defaults to 0.6.
+        sample_ratio: Fraction of data to sample for type inference. Not used in pandas version
+            but included for API compatibility. Defaults to 1.0.
+
+    Returns:
+        tuple: (The DataFrame with converted types, A dictionary mapping column names to
+                their inferred types as strings)
+    """
+    if na_values is None:
+        na_values = {"NA", "na", "NULL", "null", ""}
+
+    df_converted = df.convert_dtypes()
+    schema = {}
+
+    # Sample if needed (for API compatibility)
+    if sample_ratio < 1.0:
+        df = df.sample(frac=sample_ratio)
+
+    n_rows = len(df)
+
+    for col in df.columns:
+        series = df[col]
+        # Replace NA-like values if string
+        series_cleaned = series.map(lambda x: np.nan if isinstance(x, str) and x.strip() in na_values else x)
+
+        # Skip conversion if already non-object data type, except bool which can potentially be categorical
+        if (
+            not isinstance(series_cleaned.dtype, pd.BooleanDtype)
+            and not isinstance(series_cleaned.dtype, pd.StringDtype)
+            and series_cleaned.dtype != "object"
+        ):
+            # Keep the original data type for non-object dtypes
+            df_converted[col] = series
+            schema[col] = str(series_cleaned.dtype)
+            continue
+
+        # print(f"type: {series_cleaned.dtype}, column: {series_cleaned.name}")
+
+        if not isinstance(series_cleaned.dtype, pd.BooleanDtype):
+            # Try numeric (int or float)
+            numeric = pd.to_numeric(series_cleaned, errors="coerce")
+            if numeric.notna().sum() >= n_rows * convert_threshold:
+                if (numeric.dropna() % 1 == 0).all():
+                    try:
+                        df_converted[col] = numeric.astype("int")  # Nullable integer
+                        schema[col] = "int"
+                        continue
+                    except Exception:
+                        pass
+                df_converted[col] = numeric.astype("double")
+                schema[col] = "double"
+                continue
+
+            # Try datetime
+            datetime_converted = pd.to_datetime(series_cleaned, errors="coerce")
+            if datetime_converted.notna().sum() >= n_rows * convert_threshold:
+                df_converted[col] = datetime_converted
+                schema[col] = "timestamp"
+                continue
+
+            # Try timedelta
+            try:
+                timedelta_converted = pd.to_timedelta(series_cleaned, errors="coerce")
+                if timedelta_converted.notna().sum() >= n_rows * convert_threshold:
+                    df_converted[col] = timedelta_converted
+                    schema[col] = "timedelta"
+                    continue
+            except TypeError:
+                pass
+
+        # Try category
+        try:
+            unique_ratio = series_cleaned.nunique(dropna=True) / n_rows if n_rows > 0 else 1.0
+            if unique_ratio <= category_threshold:
+                df_converted[col] = series_cleaned.astype("category")
+                schema[col] = "category"
+                continue
+        except Exception:
+            pass
+        df_converted[col] = series_cleaned.astype("string")
+        schema[col] = "string"
+
+    return df_converted, schema
--- a/flaml/automl/logger.py
+++ b/flaml/automl/logger.py
@@ -1,7 +1,37 @@
 import logging
+import os
+
+
+class ColoredFormatter(logging.Formatter):
+    # ANSI escape codes for colors
+    COLORS = {
+        # logging.DEBUG: "\033[36m",  # Cyan
+        # logging.INFO: "\033[32m",   # Green
+        logging.WARNING: "\033[33m",  # Yellow
+        logging.ERROR: "\033[31m",  # Red
+        logging.CRITICAL: "\033[1;31m",  # Bright Red
+    }
+    RESET = "\033[0m"  # Reset to default
+
+    def __init__(self, fmt, datefmt, use_color=True):
+        super().__init__(fmt, datefmt)
+        self.use_color = use_color
+
+    def format(self, record):
+        formatted = super().format(record)
+        if self.use_color:
+            color = self.COLORS.get(record.levelno, "")
+            if color:
+                return f"{color}{formatted}{self.RESET}"
+        return formatted
+

 logger = logging.getLogger(__name__)
-logger_formatter = logging.Formatter(
-    "[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s", "%m-%d %H:%M:%S"
+use_color = True
+if os.getenv("FLAML_LOG_NO_COLOR"):
+    use_color = False
+
+logger_formatter = ColoredFormatter(
+    "[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s", "%m-%d %H:%M:%S", use_color
 )
 logger.propagate = False
--- a/flaml/automl/ml.py
+++ b/flaml/automl/ml.py
@@ -13,6 +13,7 @@ from flaml.automl.model import BaseEstimator, TransformersEstimator
 from flaml.automl.spark import ERROR as SPARK_ERROR
 from flaml.automl.spark import DataFrame, Series, psDataFrame, psSeries
 from flaml.automl.task.task import Task
+from flaml.automl.time_series import TimeSeriesDataset

 try:
    from sklearn.metrics import (
@@ -33,7 +34,6 @@ except ImportError:
 if SPARK_ERROR is None:
    from flaml.automl.spark.metrics import spark_metric_loss_score

-from flaml.automl.time_series import TimeSeriesDataset

 logger = logging.getLogger(__name__)

@@ -89,6 +89,11 @@ huggingface_metric_to_mode = {
    "wer": "min",
 }
 huggingface_submetric_to_metric = {"rouge1": "rouge", "rouge2": "rouge"}
+spark_metric_name_dict = {
+    "Regression": ["r2", "rmse", "mse", "mae", "var"],
+    "Binary Classification": ["pr_auc", "roc_auc"],
+    "Multi-class Classification": ["accuracy", "log_loss", "f1", "micro_f1", "macro_f1"],
+}


 def metric_loss_score(
@@ -122,7 +127,7 @@ def metric_loss_score(
            import datasets

            datasets_metric_name = huggingface_submetric_to_metric.get(metric_name, metric_name.split(":")[0])
-            metric = datasets.load_metric(datasets_metric_name)
+            metric = datasets.load_metric(datasets_metric_name, trust_remote_code=True)
            metric_mode = huggingface_metric_to_mode[datasets_metric_name]

            if metric_name.startswith("seqeval"):
@@ -334,6 +339,14 @@ def compute_estimator(
    if fit_kwargs is None:
        fit_kwargs = {}

+    fe_params = {}
+    for param, value in config_dic.items():
+        if param.startswith("fe."):
+            fe_params[param] = value
+
+    for param, value in fe_params.items():
+        config_dic.pop(param)
+
    estimator_class = estimator_class or task.estimator_class_from_str(estimator_name)
    estimator = estimator_class(
        **config_dic,
@@ -401,12 +414,21 @@ def train_estimator(
    free_mem_ratio=0,
 ) -> Tuple[EstimatorSubclass, float]:
    start_time = time.time()
+    fe_params = {}
+    for param, value in config_dic.items():
+        if param.startswith("fe."):
+            fe_params[param] = value
+
+    for param, value in fe_params.items():
+        config_dic.pop(param)
+
    estimator_class = estimator_class or task.estimator_class_from_str(estimator_name)
    estimator = estimator_class(
        **config_dic,
        task=task,
        n_jobs=n_jobs,
    )
+
    if fit_kwargs is None:
        fit_kwargs = {}

--- a/flaml/automl/model.py
+++ b/flaml/automl/model.py
--- a/flaml/automl/nlp/huggingface/data_collator.py
+++ b/flaml/automl/nlp/huggingface/data_collator.py
@@ -32,7 +32,7 @@ class DataCollatorForMultipleChoiceClassification(DataCollatorWithPadding):
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = list(chain(*flattened_features))
-        batch = super(DataCollatorForMultipleChoiceClassification, self).__call__(flattened_features)
+        batch = super().__call__(flattened_features)
        # Un-flatten
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # Add back labels
--- a/flaml/automl/nlp/huggingface/utils.py
+++ b/flaml/automl/nlp/huggingface/utils.py
@@ -245,7 +245,7 @@ def tokenize_row(
    return_column_name=False,
 ):
    if prefix:
-        this_row = tuple(["".join(x) for x in zip(prefix, this_row)])
+        this_row = tuple("".join(x) for x in zip(prefix, this_row))

    # tokenizer.pad_token = tokenizer.eos_token
    tokenized_example = tokenizer(
--- a/flaml/automl/nlp/utils.py
+++ b/flaml/automl/nlp/utils.py
@@ -32,7 +32,7 @@ def is_a_list_of_str(this_obj):

 def _clean_value(value: Any) -> str:
    if isinstance(value, float):
-        return "{:.5}".format(value)
+        return f"{value:.5}"
    else:
        return str(value).replace("/", "_")

@@ -86,7 +86,7 @@ class Counter:
    @staticmethod
    def get_trial_fold_name(local_dir, trial_config, trial_id):
        Counter.counter += 1
-        experiment_tag = "{0}_{1}".format(str(Counter.counter), format_vars(trial_config))
+        experiment_tag = f"{str(Counter.counter)}_{format_vars(trial_config)}"
        logdir = get_logdir_name(_generate_dirname(experiment_tag, trial_id=trial_id), local_dir)
        return logdir

--- a/flaml/automl/spark/configs.py
+++ b/flaml/automl/spark/configs.py
@@ -1,97 +0,0 @@
-ParamList_LightGBM_Base = [
-    "baggingFraction",
-    "baggingFreq",
-    "baggingSeed",
-    "binSampleCount",
-    "boostFromAverage",
-    "boostingType",
-    "catSmooth",
-    "categoricalSlotIndexes",
-    "categoricalSlotNames",
-    "catl2",
-    "chunkSize",
-    "dataRandomSeed",
-    "defaultListenPort",
-    "deterministic",
-    "driverListenPort",
-    "dropRate",
-    "dropSeed",
-    "earlyStoppingRound",
-    "executionMode",
-    "extraSeed" "featureFraction",
-    "featureFractionByNode",
-    "featureFractionSeed",
-    "featuresCol",
-    "featuresShapCol",
-    "fobj" "improvementTolerance",
-    "initScoreCol",
-    "isEnableSparse",
-    "isProvideTrainingMetric",
-    "labelCol",
-    "lambdaL1",
-    "lambdaL2",
-    "leafPredictionCol",
-    "learningRate",
-    "matrixType",
-    "maxBin",
-    "maxBinByFeature",
-    "maxCatThreshold",
-    "maxCatToOnehot",
-    "maxDeltaStep",
-    "maxDepth",
-    "maxDrop",
-    "metric",
-    "microBatchSize",
-    "minDataInLeaf",
-    "minDataPerBin",
-    "minDataPerGroup",
-    "minGainToSplit",
-    "minSumHessianInLeaf",
-    "modelString",
-    "monotoneConstraints",
-    "monotoneConstraintsMethod",
-    "monotonePenalty",
-    "negBaggingFraction",
-    "numBatches",
-    "numIterations",
-    "numLeaves",
-    "numTasks",
-    "numThreads",
-    "objectiveSeed",
-    "otherRate",
-    "parallelism",
-    "passThroughArgs",
-    "posBaggingFraction",
-    "predictDisableShapeCheck",
-    "predictionCol",
-    "repartitionByGroupingColumn",
-    "seed",
-    "skipDrop",
-    "slotNames",
-    "timeout",
-    "topK",
-    "topRate",
-    "uniformDrop",
-    "useBarrierExecutionMode",
-    "useMissing",
-    "useSingleDatasetMode",
-    "validationIndicatorCol",
-    "verbosity",
-    "weightCol",
-    "xGBoostDartMode",
-    "zeroAsMissing",
-    "objective",
-]
-ParamList_LightGBM_Classifier = ParamList_LightGBM_Base + [
-    "isUnbalance",
-    "probabilityCol",
-    "rawPredictionCol",
-    "thresholds",
-]
-ParamList_LightGBM_Regressor = ParamList_LightGBM_Base + ["tweedieVariancePower"]
-ParamList_LightGBM_Ranker = ParamList_LightGBM_Base + [
-    "groupCol",
-    "evalAt",
-    "labelGain",
-    "maxPosition",
-]
--- a/flaml/automl/spark/metrics.py
+++ b/flaml/automl/spark/metrics.py
@@ -1,3 +1,4 @@
+import json
 from typing import Union

 import numpy as np
@@ -9,7 +10,7 @@ from pyspark.ml.evaluation import (
    RegressionEvaluator,
 )

-from flaml.automl.spark import F, psSeries
+from flaml.automl.spark import F, T, psDataFrame, psSeries, sparkDataFrame


 def ps_group_counts(groups: Union[psSeries, np.ndarray]) -> np.ndarray:
@@ -36,6 +37,16 @@ def _compute_label_from_probability(df, probability_col, prediction_col):
    return df


+def string_to_array(s):
+    try:
+        return json.loads(s)
+    except json.JSONDecodeError:
+        return []
+
+
+string_to_array_udf = F.udf(string_to_array, T.ArrayType(T.DoubleType()))
+
+
 def spark_metric_loss_score(
    metric_name: str,
    y_predict: psSeries,
@@ -135,6 +146,11 @@ def spark_metric_loss_score(
        )
    elif metric_name == "log_loss":
        # For log_loss, prediction_col should be probability, and we need to convert it to label
+        # handle data like "{'type': '1', 'values': '[1, 2, 3]'}"
+        # Fix cannot resolve "array_max(prediction)" due to data type mismatch: Parameter 1 requires the "ARRAY" type,
+        # however "prediction" has the type "STRUCT<type: TINYINT, size: INT, indices: ARRAY<INT>, values: ARRAY<DOUBLE>>"
+        df = df.withColumn(prediction_col, df[prediction_col].cast(T.StringType()))
+        df = df.withColumn(prediction_col, string_to_array_udf(df[prediction_col]))
        df = _compute_label_from_probability(df, prediction_col, prediction_col + "_label")
        evaluator = MulticlassClassificationEvaluator(
            metricName="logLoss",
--- a/flaml/automl/state.py
+++ b/flaml/automl/state.py
@@ -65,6 +65,7 @@ class SearchState:
        custom_hp=None,
        max_iter=None,
        budget=None,
+        featurization="auto",
    ):
        self.init_eci = learner_class.cost_relative2lgbm() if budget >= 0 else 1
        self._search_space_domain = {}
@@ -82,6 +83,7 @@ class SearchState:
        else:
            data_size = data.shape
            search_space = learner_class.search_space(data_size=data_size, task=task)
+
        self.data_size = data_size

        if custom_hp is not None:
@@ -91,9 +93,7 @@ class SearchState:
            starting_point = AutoMLState.sanitize(starting_point)
            if max_iter > 1 and not self.valid_starting_point(starting_point, search_space):
                # If the number of iterations is larger than 1, remove invalid point
-                logger.warning(
-                    "Starting point {} removed because it is outside of the search space".format(starting_point)
-                )
+                logger.warning(f"Starting point {starting_point} removed because it is outside of the search space")
                starting_point = None
        elif isinstance(starting_point, list):
            starting_point = [AutoMLState.sanitize(x) for x in starting_point]
@@ -208,7 +208,7 @@ class SearchState:
        self.val_loss, self.config = obj, config

    def get_hist_config_sig(self, sample_size, config):
-        config_values = tuple([config[k] for k in self._hp_names if k in config])
+        config_values = tuple(config[k] for k in self._hp_names if k in config)
        config_sig = str(sample_size) + "_" + str(config_values)
        return config_sig

@@ -290,9 +290,11 @@ class AutoMLState:
        budget = (
            None
            if state.time_budget < 0
-            else state.time_budget - state.time_from_start
-            if sample_size == state.data_size[0]
-            else (state.time_budget - state.time_from_start) / 2 * sample_size / state.data_size[0]
+            else (
+                state.time_budget - state.time_from_start
+                if sample_size == state.data_size[0]
+                else (state.time_budget - state.time_from_start) / 2 * sample_size / state.data_size[0]
+            )
        )

        (
@@ -353,6 +355,7 @@ class AutoMLState:
        estimator: str,
        config_w_resource: dict,
        sample_size: Optional[int] = None,
+        is_retrain: bool = False,
    ):
        if not sample_size:
            sample_size = config_w_resource.get("FLAML_sample_size", len(self.y_train_all))
@@ -378,9 +381,8 @@ class AutoMLState:
            this_estimator_kwargs[
                "groups"
            ] = groups  # NOTE: _train_with_config is after kwargs is updated to fit_kwargs_by_estimator
-
+        this_estimator_kwargs.update({"is_retrain": is_retrain})
        budget = None if self.time_budget < 0 else self.time_budget - self.time_from_start
-
        estimator, train_time = train_estimator(
            X_train=sampled_X_train,
            y_train=sampled_y_train,
--- a/flaml/automl/task/generic_task.py
+++ b/flaml/automl/task/generic_task.py
@@ -16,12 +16,7 @@ from flaml.automl.spark.utils import (
    unique_pandas_on_spark,
    unique_value_first_index,
 )
-from flaml.automl.task.task import (
-    TS_FORECAST,
-    TS_FORECASTPANEL,
-    Task,
-    get_classification_objective,
-)
+from flaml.automl.task.task import TS_FORECAST, TS_FORECASTPANEL, Task, get_classification_objective
 from flaml.config import RANDOM_SEED

 try:
@@ -53,13 +48,24 @@ class GenericTask(Task):
            from flaml.automl.contrib.histgb import HistGradientBoostingEstimator
            from flaml.automl.model import (
                CatBoostEstimator,
+                ElasticNetEstimator,
                ExtraTreesEstimator,
                KNeighborsEstimator,
+                LassoLarsEstimator,
                LGBMEstimator,
                LRL1Classifier,
                LRL2Classifier,
                RandomForestEstimator,
+                SGDEstimator,
+                SparkAFTSurvivalRegressionEstimator,
+                SparkGBTEstimator,
+                SparkGLREstimator,
                SparkLGBMEstimator,
+                SparkLinearRegressionEstimator,
+                SparkLinearSVCEstimator,
+                SparkNaiveBayesEstimator,
+                SparkRandomForestEstimator,
+                SVCEstimator,
                TransformersEstimator,
                TransformersEstimatorModelSelection,
                XGBoostLimitDepthEstimator,
@@ -72,6 +78,7 @@ class GenericTask(Task):
                "rf": RandomForestEstimator,
                "lgbm": LGBMEstimator,
                "lgbm_spark": SparkLGBMEstimator,
+                "rf_spark": SparkRandomForestEstimator,
                "lrl1": LRL1Classifier,
                "lrl2": LRL2Classifier,
                "catboost": CatBoostEstimator,
@@ -80,6 +87,16 @@ class GenericTask(Task):
                "transformer": TransformersEstimator,
                "transformer_ms": TransformersEstimatorModelSelection,
                "histgb": HistGradientBoostingEstimator,
+                "svc": SVCEstimator,
+                "sgd": SGDEstimator,
+                "nb_spark": SparkNaiveBayesEstimator,
+                "enet": ElasticNetEstimator,
+                "lassolars": LassoLarsEstimator,
+                "glr_spark": SparkGLREstimator,
+                "lr_spark": SparkLinearRegressionEstimator,
+                "svc_spark": SparkLinearSVCEstimator,
+                "gbt_spark": SparkGBTEstimator,
+                "aft_spark": SparkAFTSurvivalRegressionEstimator,
            }
        return self._estimators

@@ -271,8 +288,8 @@ class GenericTask(Task):
            seed=RANDOM_SEED,
        )
        columns_to_drop = [c for c in df_all_train.columns if c in [stratify_column, "sample_weight"]]
-        X_train = df_all_train.drop(columns_to_drop)
-        X_val = df_all_val.drop(columns_to_drop)
+        X_train = df_all_train.drop(columns=columns_to_drop)
+        X_val = df_all_val.drop(columns=columns_to_drop)
        y_train = df_all_train[stratify_column]
        y_val = df_all_val[stratify_column]

@@ -425,8 +442,8 @@ class GenericTask(Task):
                X_train_all, y_train_all = shuffle(X_train_all, y_train_all, random_state=RANDOM_SEED)
            if data_is_df:
                X_train_all.reset_index(drop=True, inplace=True)
-            if isinstance(y_train_all, pd.Series):
-                y_train_all.reset_index(drop=True, inplace=True)
+        if isinstance(y_train_all, pd.Series):
+            y_train_all.reset_index(drop=True, inplace=True)

        X_train, y_train = X_train_all, y_train_all
        state.groups_all = state.groups
@@ -497,14 +514,37 @@ class GenericTask(Task):
                    last = first[i] + 1
                rest.extend(range(last, len(y_train_all)))
                X_first = X_train_all.iloc[first] if data_is_df else X_train_all[first]
-                X_rest = X_train_all.iloc[rest] if data_is_df else X_train_all[rest]
-                y_rest = (
-                    y_train_all[rest]
-                    if isinstance(y_train_all, np.ndarray)
-                    else iloc_pandas_on_spark(y_train_all, rest)
-                    if is_spark_dataframe
-                    else y_train_all.iloc[rest]
-                )
+                if len(first) < len(y_train_all) / 2:
+                    # Get X_rest and y_rest with drop, sparse matrix can't apply np.delete
+                    X_rest = (
+                        np.delete(X_train_all, first, axis=0)
+                        if isinstance(X_train_all, np.ndarray)
+                        else X_train_all.drop(first.tolist())
+                        if data_is_df
+                        else X_train_all[rest]
+                    )
+                    y_rest = (
+                        np.delete(y_train_all, first, axis=0)
+                        if isinstance(y_train_all, np.ndarray)
+                        else y_train_all.drop(first.tolist())
+                        if data_is_df
+                        else y_train_all[rest]
+                    )
+                else:
+                    X_rest = (
+                        iloc_pandas_on_spark(X_train_all, rest)
+                        if is_spark_dataframe
+                        else X_train_all.iloc[rest]
+                        if data_is_df
+                        else X_train_all[rest]
+                    )
+                    y_rest = (
+                        iloc_pandas_on_spark(y_train_all, rest)
+                        if is_spark_dataframe
+                        else y_train_all.iloc[rest]
+                        if data_is_df
+                        else y_train_all[rest]
+                    )
                stratify = y_rest if split_type == "stratified" else None
                X_train, X_val, y_train, y_val = self._train_test_split(
                    state, X_rest, y_rest, first, rest, split_ratio, stratify
@@ -513,6 +553,12 @@ class GenericTask(Task):
                y_train = concat(label_set, y_train) if data_is_df else np.concatenate([label_set, y_train])
                X_val = concat(X_first, X_val)
                y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
+
+                if isinstance(y_train, (psDataFrame, pd.DataFrame)) and y_train.shape[1] == 1:
+                    y_train = y_train[y_train.columns[0]]
+                    y_val = y_val[y_val.columns[0]]
+                    y_train.name = y_val.name = y_rest.name
+
            elif self.is_regression():
                X_train, X_val, y_train, y_val = self._train_test_split(
                    state, X_train_all, y_train_all, split_ratio=split_ratio
@@ -659,7 +705,6 @@ class GenericTask(Task):
            fit_kwargs = {}
        if cv_score_agg_func is None:
            cv_score_agg_func = default_cv_score_agg_func
-        start_time = time.time()
        val_loss_folds = []
        log_metric_folds = []
        metric = None
@@ -701,7 +746,10 @@ class GenericTask(Task):
            elif isinstance(kf, TimeSeriesSplit):
                kf = kf.split(X_train_split, y_train_split)
            else:
-                kf = kf.split(X_train_split)
+                try:
+                    kf = kf.split(X_train_split)
+                except TypeError:
+                    kf = kf.split(X_train_split, y_train_split)

        for train_index, val_index in kf:
            if shuffle:
@@ -724,10 +772,10 @@ class GenericTask(Task):
            if not is_spark_dataframe:
                y_train, y_val = y_train_split[train_index], y_train_split[val_index]
                if weight is not None:
-                    fit_kwargs["sample_weight"], weight_val = (
-                        weight[train_index],
-                        weight[val_index],
+                    fit_kwargs["sample_weight"] = (
+                        weight[train_index] if isinstance(weight, np.ndarray) else weight.iloc[train_index]
                    )
+                    weight_val = weight[val_index] if isinstance(weight, np.ndarray) else weight.iloc[val_index]
                if groups is not None:
                    fit_kwargs["groups"] = (
                        groups[train_index] if isinstance(groups, np.ndarray) else groups.iloc[train_index]
@@ -766,8 +814,6 @@ class GenericTask(Task):
            if is_spark_dataframe:
                X_train.spark.unpersist()  # uncache data to free memory
                X_val.spark.unpersist()  # uncache data to free memory
-            if budget and time.time() - start_time >= budget:
-                break
        val_loss, metric = cv_score_agg_func(val_loss_folds, log_metric_folds)
        n = total_fold_num
        pred_time /= n
@@ -810,27 +856,23 @@ class GenericTask(Task):
        elif self.is_ts_forecastpanel():
            estimator_list = ["tft"]
        else:
+            estimator_list = [
+                "lgbm",
+                "rf",
+                "xgboost",
+                "extra_tree",
+                "xgb_limitdepth",
+                "lgbm_spark",
+                "rf_spark",
+                "sgd",
+            ]
            try:
                import catboost

-                estimator_list = [
-                    "lgbm",
-                    "rf",
-                    "catboost",
-                    "xgboost",
-                    "extra_tree",
-                    "xgb_limitdepth",
-                    "lgbm_spark",
-                ]
+                estimator_list += ["catboost"]
            except ImportError:
-                estimator_list = [
-                    "lgbm",
-                    "rf",
-                    "xgboost",
-                    "extra_tree",
-                    "xgb_limitdepth",
-                    "lgbm_spark",
-                ]
+                pass
+
            # if self.is_ts_forecast():
            #     # catboost is removed because it has a `name` parameter, making it incompatible with hcrystalball
            #     if "catboost" in estimator_list:
@@ -862,9 +904,7 @@ class GenericTask(Task):
            return metric

        if self.is_nlp():
-            from flaml.automl.nlp.utils import (
-                load_default_huggingface_metric_for_task,
-            )
+            from flaml.automl.nlp.utils import load_default_huggingface_metric_for_task

            return load_default_huggingface_metric_for_task(self.name)
        elif self.is_binary():
--- a/flaml/automl/task/task.py
+++ b/flaml/automl/task/task.py
@@ -192,7 +192,7 @@ class Task(ABC):
                * Valid str options depend on different tasks.
                For classification tasks, valid choices are
                    ["auto", 'stratified', 'uniform', 'time', 'group']. "auto" -> stratified.
-                For regression tasks, valid choices are ["auto", 'uniform', 'time'].
+                For regression tasks, valid choices are ["auto", 'uniform', 'time', 'group'].
                    "auto" -> uniform.
                For time series forecast tasks, must be "auto" or 'time'.
                For ranking task, must be "auto" or 'group'.
--- a/flaml/automl/task/time_series_task.py
+++ b/flaml/automl/task/time_series_task.py
@@ -36,11 +36,17 @@ class TimeSeriesTask(Task):
                LGBM_TS,
                RF_TS,
                SARIMAX,
+                Average,
                CatBoost_TS,
                ExtraTrees_TS,
                HoltWinters,
+                LassoLars_TS,
+                Naive,
                Orbit,
                Prophet,
+                SeasonalAverage,
+                SeasonalNaive,
+                TCNEstimator,
                TemporalFusionTransformerEstimator,
                XGBoost_TS,
                XGBoostLimitDepth_TS,
@@ -57,8 +63,19 @@ class TimeSeriesTask(Task):
                "holt-winters": HoltWinters,
                "catboost": CatBoost_TS,
                "tft": TemporalFusionTransformerEstimator,
+                "lassolars": LassoLars_TS,
+                "tcn": TCNEstimator,
+                "snaive": SeasonalNaive,
+                "naive": Naive,
+                "savg": SeasonalAverage,
+                "avg": Average,
            }

+            if self._estimators["tcn"] is None:
+                # remove TCN if import failed
+                del self._estimators["tcn"]
+                logger.info("Couldn't import pytorch_lightning, skipping TCN estimator")
+
            try:
                from prophet import Prophet as foo

@@ -71,7 +88,7 @@ class TimeSeriesTask(Task):

                self._estimators["orbit"] = Orbit
            except ImportError:
-                logger.info("Couldn't import Prophet, skipping")
+                logger.info("Couldn't import orbit, skipping")

        return self._estimators

--- a/flaml/automl/time_series/init.py
+++ b/flaml/automl/time_series/init.py
@@ -1,16 +1,27 @@
 from .tft import TemporalFusionTransformerEstimator
-from .ts_data import TimeSeriesDataset
 from .ts_model import (
    ARIMA,
    LGBM_TS,
    RF_TS,
    SARIMAX,
+    Average,
    CatBoost_TS,
    ExtraTrees_TS,
    HoltWinters,
+    LassoLars_TS,
+    Naive,
    Orbit,
    Prophet,
+    SeasonalAverage,
+    SeasonalNaive,
    TimeSeriesEstimator,
    XGBoost_TS,
    XGBoostLimitDepth_TS,
 )
+
+try:
+    from .tcn import TCNEstimator
+except ImportError:
+    TCNEstimator = None
+
+from .ts_data import TimeSeriesDataset
--- a/flaml/automl/time_series/tcn.py
+++ b/flaml/automl/time_series/tcn.py
@@ -0,0 +1,285 @@
+# This file is adapted from
+# https://github.com/locuslab/TCN/blob/master/TCN/tcn.py
+# https://github.com/locuslab/TCN/blob/master/TCN/adding_problem/add_test.py
+
+import datetime
+import logging
+import time
+
+import pandas as pd
+import pytorch_lightning as pl
+import torch
+import torch.nn as nn
+import torch.optim as optim
+from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
+from pytorch_lightning.loggers import TensorBoardLogger
+from torch.nn.utils import weight_norm
+from torch.utils.data import DataLoader, TensorDataset
+
+from flaml import tune
+from flaml.automl.data import add_time_idx_col
+from flaml.automl.logger import logger, logger_formatter
+from flaml.automl.time_series.ts_data import TimeSeriesDataset
+from flaml.automl.time_series.ts_model import TimeSeriesEstimator
+
+
+class Chomp1d(nn.Module):
+    def __init__(self, chomp_size):
+        super().__init__()
+        self.chomp_size = chomp_size
+
+    def forward(self, x):
+        return x[:, :, : -self.chomp_size].contiguous()
+
+
+class TemporalBlock(nn.Module):
+    def __init__(self, n_inputs, n_outputs, kernel_size, stride, dilation, padding, dropout=0.2):
+        super().__init__()
+        self.conv1 = weight_norm(
+            nn.Conv1d(n_inputs, n_outputs, kernel_size, stride=stride, padding=padding, dilation=dilation)
+        )
+        self.chomp1 = Chomp1d(padding)
+        self.relu1 = nn.ReLU()
+        self.dropout1 = nn.Dropout(dropout)
+
+        self.conv2 = weight_norm(
+            nn.Conv1d(n_outputs, n_outputs, kernel_size, stride=stride, padding=padding, dilation=dilation)
+        )
+        self.chomp2 = Chomp1d(padding)
+        self.relu2 = nn.ReLU()
+        self.dropout2 = nn.Dropout(dropout)
+
+        self.net = nn.Sequential(
+            self.conv1, self.chomp1, self.relu1, self.dropout1, self.conv2, self.chomp2, self.relu2, self.dropout2
+        )
+        self.downsample = nn.Conv1d(n_inputs, n_outputs, 1) if n_inputs != n_outputs else None
+        self.relu = nn.ReLU()
+        self.init_weights()
+
+    def init_weights(self):
+        self.conv1.weight.data.normal_(0, 0.01)
+        self.conv2.weight.data.normal_(0, 0.01)
+        if self.downsample is not None:
+            self.downsample.weight.data.normal_(0, 0.01)
+
+    def forward(self, x):
+        out = self.net(x)
+        res = x if self.downsample is None else self.downsample(x)
+        return self.relu(out + res)
+
+
+class TCNForecaster(nn.Module):
+    def __init__(
+        self,
+        input_feature_num,
+        num_outputs,
+        num_channels,
+        kernel_size=2,
+        dropout=0.2,
+    ):
+        super().__init__()
+        layers = []
+        num_levels = len(num_channels)
+        for i in range(num_levels):
+            dilation_size = 2**i
+            in_channels = input_feature_num if i == 0 else num_channels[i - 1]
+            out_channels = num_channels[i]
+            layers += [
+                TemporalBlock(
+                    in_channels,
+                    out_channels,
+                    kernel_size,
+                    stride=1,
+                    dilation=dilation_size,
+                    padding=(kernel_size - 1) * dilation_size,
+                    dropout=dropout,
+                )
+            ]
+
+        self.network = nn.Sequential(*layers)
+        self.linear = nn.Linear(num_channels[-1], num_outputs)
+
+    def forward(self, x):
+        y1 = self.network(x)
+        return self.linear(y1[:, :, -1])
+
+
+class TCNForecasterLightningModule(pl.LightningModule):
+    def __init__(self, model: TCNForecaster, learning_rate: float = 1e-3):
+        super().__init__()
+        self.model = model
+        self.learning_rate = learning_rate
+        self.loss_fn = nn.MSELoss()
+
+    def forward(self, x):
+        return self.model(x)
+
+    def step(self, batch, batch_idx):
+        x, y = batch
+        y_hat = self.model(x)
+        loss = self.loss_fn(y_hat, y)
+        return loss
+
+    def training_step(self, batch, batch_idx):
+        loss = self.step(batch, batch_idx)
+        self.log("train_loss", loss)
+        return loss
+
+    def validation_step(self, batch, batch_idx):
+        loss = self.step(batch, batch_idx)
+        self.log("val_loss", loss)
+        return loss
+
+    def configure_optimizers(self):
+        return torch.optim.Adam(self.parameters(), lr=self.learning_rate)
+
+
+class DataframeDataset(torch.utils.data.Dataset):
+    def __init__(self, dataframe, target_column, features_columns, sequence_length, train=True):
+        self.data = torch.tensor(dataframe[features_columns].to_numpy(), dtype=torch.float)
+        self.sequence_length = sequence_length
+        if train:
+            self.labels = torch.tensor(dataframe[target_column].to_numpy(), dtype=torch.float)
+        self.is_train = train
+
+    def __len__(self):
+        return len(self.data) - self.sequence_length + 1
+
+    def __getitem__(self, idx):
+        data = self.data[idx : idx + self.sequence_length]
+        data = data.permute(1, 0)
+        if self.is_train:
+            label = self.labels[idx : idx + self.sequence_length]
+            return data, label
+        else:
+            return data
+
+
+class TCNEstimator(TimeSeriesEstimator):
+    """The class for tuning TCN Forecaster"""
+
+    @classmethod
+    def search_space(cls, data, task, pred_horizon, **params):
+        space = {
+            "num_levels": {
+                "domain": tune.randint(lower=4, upper=20),  # hidden = 2^num_hidden
+                "init_value": 4,
+            },
+            "num_hidden": {
+                "domain": tune.randint(lower=4, upper=8),  # hidden = 2^num_hidden
+                "init_value": 5,
+            },
+            "kernel_size": {
+                "domain": tune.choice([2, 3, 5, 7]),  # common choices for kernel size
+                "init_value": 3,
+            },
+            "dropout": {
+                "domain": tune.uniform(lower=0.0, upper=0.5),  # standard range for dropout
+                "init_value": 0.1,
+            },
+            "learning_rate": {
+                "domain": tune.loguniform(lower=1e-4, upper=1e-1),  # typical range for learning rate
+                "init_value": 1e-3,
+            },
+        }
+        return space
+
+    def __init__(self, task="ts_forecast", n_jobs=1, **params):
+        super().__init__(task, **params)
+        logging.getLogger("pytorch_lightning").setLevel(logging.WARNING)
+
+    def fit(self, X_train: TimeSeriesDataset, y_train=None, budget=None, **kwargs):
+        start_time = time.time()
+        if budget is not None:
+            deltabudget = datetime.timedelta(seconds=budget)
+        else:
+            deltabudget = None
+        X_train = self.enrich(X_train)
+        super().fit(X_train, y_train, budget, **kwargs)
+
+        self.batch_size = kwargs.get("batch_size", 64)
+        self.horizon = kwargs.get("period", 1)
+        self.feature_cols = X_train.time_varying_known_reals
+        self.target_col = X_train.target_names[0]
+
+        train_dataset = DataframeDataset(
+            X_train.train_data,
+            self.target_col,
+            self.feature_cols,
+            self.horizon,
+        )
+        train_loader = DataLoader(train_dataset, batch_size=self.batch_size, shuffle=False)
+        if not X_train.test_data.empty:
+            val_dataset = DataframeDataset(
+                X_train.test_data,
+                self.target_col,
+                self.feature_cols,
+                self.horizon,
+            )
+        else:
+            val_dataset = DataframeDataset(
+                X_train.train_data.sample(frac=0.2, random_state=kwargs.get("random_state", 0)),
+                self.target_col,
+                self.feature_cols,
+                self.horizon,
+            )
+
+        val_loader = DataLoader(val_dataset, batch_size=self.batch_size, shuffle=False)
+
+        model = TCNForecaster(
+            len(self.feature_cols),
+            self.horizon,
+            [2 ** self.params["num_hidden"]] * self.params["num_levels"],
+            self.params["kernel_size"],
+            self.params["dropout"],
+        )
+
+        pl_module = TCNForecasterLightningModule(model, self.params["learning_rate"])
+
+        # Training loop
+        # gpus is deprecated in v1.7 and removed in v2.0
+        # accelerator="auto" can cast all condition.
+        trainer = pl.Trainer(
+            max_epochs=kwargs.get("max_epochs", 10),
+            accelerator="auto",
+            callbacks=[
+                EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=10, verbose=False, mode="min"),
+                LearningRateMonitor(),
+            ],
+            logger=TensorBoardLogger(kwargs.get("log_dir", "logs/lightning_logs")),  # logging results to a tensorboard
+            max_time=deltabudget,
+            enable_model_summary=False,
+            enable_progress_bar=False,
+        )
+        trainer.fit(
+            pl_module,
+            train_dataloaders=train_loader,
+            val_dataloaders=val_loader,
+        )
+        best_model = trainer.model
+        self._model = best_model
+        train_time = time.time() - start_time
+        return train_time
+
+    def predict(self, X):
+        X = self.enrich(X)
+        if isinstance(X, TimeSeriesDataset):
+            df = X.X_val
+        else:
+            df = X
+        dataset = DataframeDataset(
+            df,
+            self.target_col,
+            self.feature_cols,
+            self.horizon,
+            train=False,
+        )
+        data_loader = DataLoader(dataset, batch_size=self.batch_size, shuffle=False)
+        self._model.eval()
+        raw_preds = []
+        for batch_x in data_loader:
+            raw_pred = self._model(batch_x)
+            raw_preds.append(raw_pred)
+        raw_preds = torch.cat(raw_preds, dim=0)
+        preds = pd.Series(raw_preds.detach().numpy().ravel())
+        return preds
--- a/flaml/automl/time_series/ts_data.py
+++ b/flaml/automl/time_series/ts_data.py
@@ -393,7 +393,7 @@ class DataTransformerTS:

        for column in X.columns:
            # sklearn/utils/validation.py needs int/float values
-            if X[column].dtype.name in ("object", "category"):
+            if X[column].dtype.name in ("object", "category", "string"):
                if (
                    # drop columns where all values are the same
                    X[column].nunique() == 1
--- a/flaml/automl/time_series/ts_model.py
+++ b/flaml/automl/time_series/ts_model.py
@@ -26,6 +26,7 @@ from flaml.automl.data import TS_TIMESTAMP_COL, TS_VALUE_COL
 from flaml.automl.model import (
    CatBoostEstimator,
    ExtraTreesEstimator,
+    LassoLarsEstimator,
    LGBMEstimator,
    RandomForestEstimator,
    SKLearnEstimator,
@@ -631,6 +632,125 @@ class HoltWinters(StatsModelsEstimator):
        return train_time


+class SimpleForecaster(StatsModelsEstimator):
+    """Base class for Naive Forecaster like Seasonal Naive, Naive, Seasonal Average, Average"""
+
+    @classmethod
+    def _search_space(cls, data: TimeSeriesDataset, task: Task, pred_horizon: int, **params):
+        return {
+            "season": {
+                "domain": tune.randint(1, pred_horizon),
+                "init_value": pred_horizon,
+            }
+        }
+
+    def joint_preprocess(self, X_train, y_train=None):
+        X_train = self.enrich(X_train)
+
+        self.regressors = []
+
+        if isinstance(X_train, TimeSeriesDataset):
+            data = X_train
+            target_col = data.target_names[0]
+            # this class only supports univariate regression
+            train_df = data.train_data[self.regressors + [target_col]]
+            train_df.index = to_datetime(data.train_data[data.time_col])
+        else:
+            target_col = TS_VALUE_COL
+            train_df = self._join(X_train, y_train)
+
+        self.time_col = data.time_col
+        self.target_names = data.target_names
+
+        train_df = self._preprocess(train_df)
+        return train_df, target_col
+
+    def fit(self, X_train, y_train=None, budget=None, **kwargs):
+        import warnings
+
+        warnings.filterwarnings("ignore")
+        from statsmodels.tsa.holtwinters import SimpleExpSmoothing
+
+        self.season = self.params.get("season", 1)
+        current_time = time.time()
+        super().fit(X_train, y_train, budget=budget, **kwargs)
+
+        train_df, target_col = self.joint_preprocess(X_train, y_train)
+
+        model = SimpleExpSmoothing(
+            train_df[[target_col]],
+        )
+        with suppress_stdout_stderr():
+            model = model.fit(smoothing_level=self.smoothing_level)
+        train_time = time.time() - current_time
+        self._model = model
+        return train_time
+
+
+class SeasonalNaive(SimpleForecaster):
+    smoothing_level = 1.0
+
+    def predict(self, X, **kwargs):
+        if isinstance(X, int):
+            forecasts = []
+            for i in range(X):
+                forecast = self._model.forecast(steps=self.season)[0]
+                forecasts.append(forecast)
+            return pd.Series(forecasts)
+        else:
+            return super().predict(X, **kwargs)
+
+
+class Naive(SimpleForecaster):
+    smoothing_level = 0.0
+
+    @classmethod
+    def _search_space(cls, data: TimeSeriesDataset, task: Task, pred_horizon: int, **params):
+        return {}
+
+    def predict(self, X, **kwargs):
+        if isinstance(X, int):
+            last_observation = self._model.params["initial_level"]
+            return pd.Series([last_observation] * X)
+        else:
+            return super().predict(X, **kwargs)
+
+
+class SeasonalAverage(SimpleForecaster):
+    def fit(self, X_train, y_train=None, budget=None, **kwargs):
+        from statsmodels.tsa.ar_model import AutoReg, ar_select_order
+
+        start_time = time.time()
+
+        self.season = kwargs.get("season", 1)  # seasonality period
+        train_df, target_col = self.joint_preprocess(X_train, y_train)
+        selection_res = ar_select_order(train_df[target_col], maxlag=self.season)
+
+        # Fit autoregressive model with optimal order
+        model = AutoReg(train_df[target_col], lags=selection_res.ar_lags)
+        self._model = model.fit()
+        end_time = time.time()
+
+        return end_time - start_time
+
+
+class Average(SimpleForecaster):
+    @classmethod
+    def _search_space(cls, data: TimeSeriesDataset, task: Task, pred_horizon: int, **params):
+        return {}
+
+    def fit(self, X_train, y_train=None, budget=None, **kwargs):
+        from statsmodels.tsa.ar_model import AutoReg
+
+        start_time = time.time()
+        train_df, target_col = self.joint_preprocess(X_train, y_train)
+        model = AutoReg(train_df[target_col], lags=0)
+        self._model = model.fit()
+        end_time = time.time()
+
+        return end_time - start_time
+
+
 class TS_SKLearn(TimeSeriesEstimator):
    """The class for tuning SKLearn Regressors for time-series forecasting"""

@@ -757,3 +877,7 @@ class XGBoostLimitDepth_TS(TS_SKLearn):
 # catboost regressor is invalid because it has a `name` parameter, making it incompatible with hcrystalball
 class CatBoost_TS(TS_SKLearn):
    base_class = CatBoostEstimator
+
+
+class LassoLars_TS(TS_SKLearn):
+    base_class = LassoLarsEstimator
--- a/flaml/automl/training_log.py
+++ b/flaml/automl/training_log.py
@@ -11,7 +11,7 @@ from typing import IO
 logger = logging.getLogger("flaml.automl")


-class TrainingLogRecord(object):
+class TrainingLogRecord:
    def __init__(
        self,
        record_id: int,
@@ -52,7 +52,7 @@ class TrainingLogCheckPoint(TrainingLogRecord):
        self.curr_best_record_id = curr_best_record_id


-class TrainingLogWriter(object):
+class TrainingLogWriter:
    def __init__(self, output_filename: str):
        self.output_filename = output_filename
        self.file = None
@@ -79,7 +79,7 @@ class TrainingLogWriter(object):
        sample_size,
    ):
        if self.file is None:
-            raise IOError("Call open() to open the output file first.")
+            raise OSError("Call open() to open the output file first.")
        if validation_loss is None:
            raise ValueError("TEST LOSS NONE ERROR!!!")
        record = TrainingLogRecord(
@@ -109,7 +109,7 @@ class TrainingLogWriter(object):

    def checkpoint(self):
        if self.file is None:
-            raise IOError("Call open() to open the output file first.")
+            raise OSError("Call open() to open the output file first.")
        if self.current_best_loss_record_id is None:
            logger.warning("flaml.training_log: checkpoint() called before any record is written, skipped.")
            return
@@ -124,7 +124,7 @@ class TrainingLogWriter(object):
        self.file = None  # for pickle


-class TrainingLogReader(object):
+class TrainingLogReader:
    def __init__(self, filename: str):
        self.filename = filename
        self.file = None
@@ -134,7 +134,7 @@ class TrainingLogReader(object):

    def records(self):
        if self.file is None:
-            raise IOError("Call open() before reading log file.")
+            raise OSError("Call open() before reading log file.")
        for line in self.file:
            data = json.loads(line)
            if len(data) == 1:
@@ -149,7 +149,7 @@ class TrainingLogReader(object):

    def get_record(self, record_id) -> TrainingLogRecord:
        if self.file is None:
-            raise IOError("Call open() before reading log file.")
+            raise OSError("Call open() before reading log file.")
        for rec in self.records():
            if rec.record_id == record_id:
                return rec
--- a/flaml/default/portfolio.py
+++ b/flaml/default/portfolio.py
@@ -69,7 +69,7 @@ def build_portfolio(meta_features, regret, strategy):

 def load_json(filename):
    """Returns the contents of json file filename."""
-    with open(filename, "r") as f:
+    with open(filename) as f:
        return json.load(f)


--- a/flaml/default/suggest.py
+++ b/flaml/default/suggest.py
@@ -43,7 +43,7 @@ def meta_feature(task, X_train, y_train, meta_feature_names):
                # 'numpy.ndarray' object has no attribute 'select_dtypes'
                this_feature.append(1)  # all features are numeric
        else:
-            raise ValueError("Feature {} not implemented. ".format(each_feature_name))
+            raise ValueError(f"Feature {each_feature_name} not implemented. ")

    return this_feature

@@ -57,7 +57,7 @@ def load_config_predictor(estimator_name, task, location=None):
    task = "multiclass" if task == "multi" else task  # TODO: multi -> multiclass?
    try:
        location = location or LOCATION
-        with open(f"{location}/{estimator_name}/{task}.json", "r") as f:
+        with open(f"{location}/{estimator_name}/{task}.json") as f:
            CONFIG_PREDICTORS[key] = predictor = json.load(f)
    except FileNotFoundError:
        raise FileNotFoundError(f"Portfolio has not been built for {estimator_name} on {task} task.")
--- a/flaml/fabric/init.py
+++ b/flaml/fabric/init.py
--- a/flaml/fabric/mlflow.py
+++ b/flaml/fabric/mlflow.py
--- a/flaml/tune/logger.py
+++ b/flaml/tune/logger.py
@@ -0,0 +1,37 @@
+import logging
+import os
+
+
+class ColoredFormatter(logging.Formatter):
+    # ANSI escape codes for colors
+    COLORS = {
+        # logging.DEBUG: "\033[36m",  # Cyan
+        # logging.INFO: "\033[32m",   # Green
+        logging.WARNING: "\033[33m",  # Yellow
+        logging.ERROR: "\033[31m",  # Red
+        logging.CRITICAL: "\033[1;31m",  # Bright Red
+    }
+    RESET = "\033[0m"  # Reset to default
+
+    def __init__(self, fmt, datefmt, use_color=True):
+        super().__init__(fmt, datefmt)
+        self.use_color = use_color
+
+    def format(self, record):
+        formatted = super().format(record)
+        if self.use_color:
+            color = self.COLORS.get(record.levelno, "")
+            if color:
+                return f"{color}{formatted}{self.RESET}"
+        return formatted
+
+
+logger = logging.getLogger(__name__)
+use_color = True
+if os.getenv("FLAML_LOG_NO_COLOR"):
+    use_color = False
+
+logger_formatter = ColoredFormatter(
+    "[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s", "%m-%d %H:%M:%S", use_color
+)
+logger.propagate = False
--- a/flaml/tune/searcher/flow2.py
+++ b/flaml/tune/searcher/flow2.py
@@ -109,7 +109,7 @@ class FLOW2(Searcher):
        else:
            mode = "min"

-        super(FLOW2, self).__init__(metric=metric, mode=mode)
+        super().__init__(metric=metric, mode=mode)
        # internally minimizes, so "max" => -1
        if mode == "max":
            self.metric_op = -1.0
@@ -350,7 +350,7 @@ class FLOW2(Searcher):
            else:
                assert (
                    self.lexico_objectives["tolerances"][k_metric][-1] == "%"
-                ), "String tolerance of {} should use %% as the suffix".format(k_metric)
+                ), f"String tolerance of {k_metric} should use %% as the suffix"
                tolerance_bound = self._f_best[k_metric] * (
                    1 + 0.01 * float(self.lexico_objectives["tolerances"][k_metric].replace("%", ""))
                )
@@ -385,7 +385,7 @@ class FLOW2(Searcher):
                else:
                    assert (
                        self.lexico_objectives["tolerances"][k_metric][-1] == "%"
-                    ), "String tolerance of {} should use %% as the suffix".format(k_metric)
+                    ), f"String tolerance of {k_metric} should use %% as the suffix"
                    tolerance_bound = self._f_best[k_metric] * (
                        1 + 0.01 * float(self.lexico_objectives["tolerances"][k_metric].replace("%", ""))
                    )
--- a/flaml/tune/searcher/online_searcher.py
+++ b/flaml/tune/searcher/online_searcher.py
@@ -319,7 +319,7 @@ class ChampionFrontierSearcher(BaseSearcher):
        candidate_configs = [set(seed_interactions) | set(item) for item in space]
        final_candidate_configs = []
        for c in candidate_configs:
-            new_c = set([e for e in c if len(e) > 1])
+            new_c = {e for e in c if len(e) > 1}
            final_candidate_configs.append(new_c)
        return final_candidate_configs

--- a/flaml/tune/searcher/suggestion.py
+++ b/flaml/tune/searcher/suggestion.py
@@ -191,7 +191,7 @@ class ConcurrencyLimiter(Searcher):
        self.batch = batch
        self.live_trials = set()
        self.cached_results = {}
-        super(ConcurrencyLimiter, self).__init__(metric=self.searcher.metric, mode=self.searcher.mode)
+        super().__init__(metric=self.searcher.metric, mode=self.searcher.mode)

    def suggest(self, trial_id: str) -> Optional[Dict]:
        assert trial_id not in self.live_trials, f"Trial ID {trial_id} must be unique: already found in set."
@@ -285,25 +285,21 @@ def validate_warmstart(
    """
    if points_to_evaluate:
        if not isinstance(points_to_evaluate, list):
-            raise TypeError("points_to_evaluate expected to be a list, got {}.".format(type(points_to_evaluate)))
+            raise TypeError(f"points_to_evaluate expected to be a list, got {type(points_to_evaluate)}.")
        for point in points_to_evaluate:
            if not isinstance(point, (dict, list)):
                raise TypeError(f"points_to_evaluate expected to include list or dict, " f"got {point}.")

            if validate_point_name_lengths and (not len(point) == len(parameter_names)):
-                raise ValueError(
-                    "Dim of point {}".format(point)
-                    + " and parameter_names {}".format(parameter_names)
-                    + " do not match."
-                )
+                raise ValueError(f"Dim of point {point}" + f" and parameter_names {parameter_names}" + " do not match.")

    if points_to_evaluate and evaluated_rewards:
        if not isinstance(evaluated_rewards, list):
-            raise TypeError("evaluated_rewards expected to be a list, got {}.".format(type(evaluated_rewards)))
+            raise TypeError(f"evaluated_rewards expected to be a list, got {type(evaluated_rewards)}.")
        if not len(evaluated_rewards) == len(points_to_evaluate):
            raise ValueError(
-                "Dim of evaluated_rewards {}".format(evaluated_rewards)
-                + " and points_to_evaluate {}".format(points_to_evaluate)
+                f"Dim of evaluated_rewards {evaluated_rewards}"
+                + f" and points_to_evaluate {points_to_evaluate}"
                + " do not match."
            )

@@ -547,7 +543,7 @@ class OptunaSearch(Searcher):
        evaluated_rewards: Optional[List] = None,
    ):
        assert ot is not None, "Optuna must be installed! Run `pip install optuna`."
-        super(OptunaSearch, self).__init__(metric=metric, mode=mode)
+        super().__init__(metric=metric, mode=mode)

        if isinstance(space, dict) and space:
            resolved_vars, domain_vars, grid_vars = parse_spec_vars(space)
--- a/flaml/tune/searcher/variant_generator.py
+++ b/flaml/tune/searcher/variant_generator.py
@@ -252,7 +252,7 @@ def _try_resolve(v) -> Tuple[bool, Any]:
        # Grid search values
        grid_values = v["grid_search"]
        if not isinstance(grid_values, list):
-            raise TuneError("Grid search expected list of values, got: {}".format(grid_values))
+            raise TuneError(f"Grid search expected list of values, got: {grid_values}")
        return False, Categorical(grid_values).grid()
    return True, v

@@ -302,13 +302,13 @@ def has_unresolved_values(spec: Dict) -> bool:

 class _UnresolvedAccessGuard(dict):
    def __init__(self, *args, **kwds):
-        super(_UnresolvedAccessGuard, self).__init__(*args, **kwds)
+        super().__init__(*args, **kwds)
        self.__dict__ = self

    def __getattribute__(self, item):
        value = dict.__getattribute__(self, item)
        if not _is_resolved(value):
-            raise RecursiveDependencyError("`{}` recursively depends on {}".format(item, value))
+            raise RecursiveDependencyError(f"`{item}` recursively depends on {value}")
        elif isinstance(value, dict):
            return _UnresolvedAccessGuard(value)
        else:
--- a/flaml/tune/spark/utils.py
+++ b/flaml/tune/spark/utils.py
@@ -162,6 +162,10 @@ def broadcast_code(custom_code="", file_name="mylearner"):
    assert isinstance(MyLargeLGBM(), LGBMEstimator)
    ```
    """
+    # Check if Spark is available
+    spark_available, _ = check_spark()
+
+    # Write to local driver file system
    flaml_path = os.path.dirname(os.path.abspath(__file__))
    custom_code = textwrap.dedent(custom_code)
    custom_path = os.path.join(flaml_path, file_name + ".py")
@@ -169,6 +173,24 @@ def broadcast_code(custom_code="", file_name="mylearner"):
    with open(custom_path, "w") as f:
        f.write(custom_code)

+    # If using Spark, broadcast the code content to executors
+    if spark_available:
+        spark = SparkSession.builder.getOrCreate()
+        bc_code = spark.sparkContext.broadcast(custom_code)
+
+        # Execute a job to ensure the code is distributed to all executors
+        def _write_code(bc):
+            code = bc.value
+            import os
+
+            module_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), file_name + ".py")
+            os.makedirs(os.path.dirname(module_path), exist_ok=True)
+            with open(module_path, "w") as f:
+                f.write(code)
+            return True
+
+        spark.sparkContext.parallelize(range(1)).map(lambda _: _write_code(bc_code)).collect()
+
    return custom_path


--- a/flaml/tune/trial.py
+++ b/flaml/tune/trial.py
@@ -110,7 +110,7 @@ class Trial:
                    }
                    self.metric_n_steps[metric] = {}
                    for n in self.n_steps:
-                        key = "last-{:d}-avg".format(n)
+                        key = f"last-{n:d}-avg"
                        self.metric_analysis[metric][key] = value
                        # Store n as string for correct restore.
                        self.metric_n_steps[metric][str(n)] = deque([value], maxlen=n)
@@ -124,7 +124,7 @@ class Trial:
                    self.metric_analysis[metric]["last"] = value

                    for n in self.n_steps:
-                        key = "last-{:d}-avg".format(n)
+                        key = f"last-{n:d}-avg"
                        self.metric_n_steps[metric][str(n)].append(value)
                        self.metric_analysis[metric][key] = sum(self.metric_n_steps[metric][str(n)]) / len(
                            self.metric_n_steps[metric][str(n)]
--- a/flaml/tune/tune.py
+++ b/flaml/tune/tune.py
@@ -21,16 +21,26 @@ except (ImportError, AssertionError):
    from .analysis import ExperimentAnalysis as EA
 else:
    ray_available = True
-
 import logging

 from flaml.tune.spark.utils import PySparkOvertimeMonitor, check_spark

+from .logger import logger, logger_formatter
 from .result import DEFAULT_METRIC
 from .trial import Trial

-logger = logging.getLogger(__name__)
-logger.propagate = False
+try:
+    import mlflow
+except ImportError:
+    mlflow = None
+try:
+    from flaml.fabric.mlflow import MLflowIntegration, is_autolog_enabled
+
+    internal_mlflow = True
+except ImportError:
+    internal_mlflow = False
+
+
 _use_ray = True
 _runner = None
 _verbose = 0
@@ -44,6 +54,7 @@ class ExperimentAnalysis(EA):
    """Class for storing the experiment results."""

    def __init__(self, trials, metric, mode, lexico_objectives=None):
+        self.best_run_id = None
        try:
            super().__init__(self, None, trials, metric, mode)
            self.lexico_objectives = lexico_objectives
@@ -128,6 +139,16 @@ class ExperimentAnalysis(EA):
        else:
            return self.best_trial.last_result

+    @property
+    def best_iteration(self) -> List[str]:
+        """Help better navigate"""
+        best_trial = self.best_trial
+        best_trial_id = best_trial.trial_id
+        for i, trial in enumerate(self.trials):
+            if trial.trial_id == best_trial_id:
+                return i
+        return None
+

 def report(_metric=None, **kwargs):
    """A function called by the HPO application to report final or intermediate
@@ -174,9 +195,16 @@ def report(_metric=None, **kwargs):
    global _training_iteration
    if _use_ray:
        try:
-            from ray import tune
+            from ray import __version__ as ray_version

-            return tune.report(_metric, **kwargs)
+            if ray_version.startswith("1."):
+                from ray import tune
+
+                return tune.report(_metric, **kwargs)
+            else:  # ray>=2
+                from ray.air import session
+
+                return session.report(metrics={"metric": _metric, **kwargs})
        except ImportError:
            # calling tune.report() outside tune.run()
            return
@@ -234,6 +262,11 @@ def run(
    lexico_objectives: Optional[dict] = None,
    force_cancel: Optional[bool] = False,
    n_concurrent_trials: Optional[int] = 0,
+    mlflow_exp_name: Optional[str] = None,
+    automl_info: Optional[Tuple[float]] = None,
+    extra_tag: Optional[dict] = None,
+    cost_attr: Optional[str] = "auto",
+    cost_budget: Optional[float] = None,
    **ray_args,
 ):
    """The function-based way of performing HPO.
@@ -424,6 +457,10 @@ def run(
    }
    ```
        force_cancel: boolean, default=False | Whether to forcely cancel the PySpark job if overtime.
+        mlflow_exp_name: str, default=None | The name of the mlflow experiment. This should be specified if
+            enable mlflow autologging on Spark. Otherwise it will log all the results into the experiment of the
+            same name as the basename of main entry file.
+        automl_info: tuple, default=None | The information of the automl run. It should be a tuple of (mlflow_log_latency,).
        n_concurrent_trials: int, default=0 | The number of concurrent trials when perform hyperparameter
            tuning with Spark. Only valid when use_spark=True and spark is required:
            `pip install flaml[spark]`. Please check
@@ -431,6 +468,13 @@ def run(
            for more details about installing Spark. When tune.run() is called from AutoML, it will be
            overwritten by the value of `n_concurrent_trials` in AutoML. When <= 0, the concurrent trials
            will be set to the number of executors.
+        extra_tag: dict, default=None | Extra tags to be added to the mlflow runs created by autologging.
+        cost_attr: None or str to specify the attribute to evaluate the cost of different trials.
+            Default is "auto", which means that we will automatically choose the cost attribute to use (depending
+            on the nature of the resource budget). When cost_attr is set to None, cost differences between different trials will be omitted
+            in our search algorithm. When cost_attr is set to a str different from "auto" and "time_total_s",
+            this cost_attr must be available in the result dict of the trial.
+        cost_budget: A float of the cost budget. Only valid when cost_attr is a str different from "auto" and "time_total_s".
        **ray_args: keyword arguments to pass to ray.tune.run().
            Only valid when use_ray=True.
    """
@@ -438,10 +482,12 @@ def run(
    global _verbose
    global _running_trial
    global _training_iteration
+    global internal_mlflow
    old_use_ray = _use_ray
    old_verbose = _verbose
    old_running_trial = _running_trial
    old_training_iteration = _training_iteration
+
    if log_file_name:
        dir_name = os.path.dirname(log_file_name)
        if dir_name:
@@ -473,10 +519,6 @@ def run(
            elif not logger.hasHandlers():
                # Add the console handler.
                _ch = logging.StreamHandler(stream=sys.stdout)
-                logger_formatter = logging.Formatter(
-                    "[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s",
-                    "%m-%d %H:%M:%S",
-                )
                _ch.setFormatter(logger_formatter)
                logger.addHandler(_ch)
            if verbose <= 2:
@@ -486,6 +528,13 @@ def run(
        else:
            logger.setLevel(logging.CRITICAL)

+    if internal_mlflow and not automl_info and (mlflow.active_run() or is_autolog_enabled()):
+        mlflow_integration = MLflowIntegration("tune", mlflow_exp_name, extra_tag)
+        evaluation_function = mlflow_integration.wrap_evaluation_function(evaluation_function)
+        _internal_mlflow = not automl_info  # True if mlflow_integration will be used for logging
+    else:
+        _internal_mlflow = False
+
    from .searcher.blendsearch import CFO, BlendSearch, RandomSearch

    if lexico_objectives is not None:
@@ -531,7 +580,7 @@ def run(
                    import optuna as _

                    SearchAlgorithm = BlendSearch
-                    logger.info("Using search algorithm {}.".format(SearchAlgorithm.__name__))
+                    logger.info(f"Using search algorithm {SearchAlgorithm.__name__}.")
                except ImportError:
                    if search_alg == "BlendSearch":
                        raise ValueError("To use BlendSearch, run: pip install flaml[blendsearch]")
@@ -540,7 +589,7 @@ def run(
                        logger.warning("Using CFO for search. To use BlendSearch, run: pip install flaml[blendsearch]")
            else:
                SearchAlgorithm = locals()[search_alg]
-                logger.info("Using search algorithm {}.".format(SearchAlgorithm.__name__))
+                logger.info(f"Using search algorithm {SearchAlgorithm.__name__}.")
            metric = metric or DEFAULT_METRIC
        search_alg = SearchAlgorithm(
            metric=metric,
@@ -560,6 +609,8 @@ def run(
            metric_constraints=metric_constraints,
            use_incumbent_result_in_evaluation=use_incumbent_result_in_evaluation,
            lexico_objectives=lexico_objectives,
+            cost_attr=cost_attr,
+            cost_budget=cost_budget,
        )
    else:
        if metric is None or mode is None:
@@ -695,10 +746,16 @@ def run(
            max_concurrent = max(1, search_alg.max_concurrent)
        else:
            max_concurrent = max(1, max_spark_parallelism)
+        passed_in_n_concurrent_trials = max(n_concurrent_trials, max_concurrent)
        n_concurrent_trials = min(
            n_concurrent_trials if n_concurrent_trials > 0 else num_executors,
            max_concurrent,
        )
+        if n_concurrent_trials < passed_in_n_concurrent_trials:
+            logger.warning(
+                f"The actual concurrent trials is {n_concurrent_trials}. You can set the environment "
+                f"variable `FLAML_MAX_CONCURRENT` to '{passed_in_n_concurrent_trials}' to override the detected num of executors."
+            )
        with parallel_backend("spark"):
            with Parallel(n_jobs=n_concurrent_trials, verbose=max(0, (verbose - 1) * 50)) as parallel:
                try:
@@ -713,11 +770,15 @@ def run(
                        time_budget_s = np.inf
                    num_failures = 0
                    upperbound_num_failures = (len(evaluated_rewards) if evaluated_rewards else 0) + max_failure
+                    logger.debug(f"automl_info: {automl_info}")
                    while (
                        time.time() - time_start < time_budget_s
                        and (num_samples < 0 or num_trials < num_samples)
                        and num_failures < upperbound_num_failures
                    ):
+                        if automl_info and automl_info[0] > 0 and time_budget_s < np.inf:
+                            time_budget_s -= automl_info[0] * n_concurrent_trials
+                            logger.debug(f"Remaining time budget with mlflow log latency: {time_budget_s} seconds.")
                        while len(_runner.running_trials) < n_concurrent_trials:
                            # suggest trials for spark
                            trial_next = _runner.step()
@@ -750,6 +811,9 @@ def run(
                            trial_to_run = trials_to_run[0]
                            _runner.running_trial = trial_to_run
                            if result is not None:
+                                if _internal_mlflow:
+                                    mlflow_integration.record_trial(result, trial_to_run, metric)
+
                                if isinstance(result, dict):
                                    if result:
                                        logger.info(f"Brief result: {result}")
@@ -758,7 +822,7 @@ def run(
                                        # When the result returned is an empty dict, set the trial status to error
                                        trial_to_run.set_status(Trial.ERROR)
                                else:
-                                    logger.info("Brief result: {}".format({metric: result}))
+                                    logger.info("Brief result: {metric: result}")
                                    report(_metric=result)
                            _runner.stop_trial(trial_to_run)
                        num_failures = 0
@@ -768,6 +832,20 @@ def run(
                        mode=mode,
                        lexico_objectives=lexico_objectives,
                    )
+                    analysis.search_space = config
+
+                    if _internal_mlflow:
+                        mlflow_integration.log_tune(analysis, metric)
+                        # try:
+                        #     _best_config = analysis.best_config
+                        # except Exception:
+                        #     _best_config = None
+                        # if _best_config:
+                        #     parallel(
+                        #         delayed(mlflow_integration.retrain)(evaluation_function, analysis.best_config)
+                        #         for dummy in [0]
+                        #     )
+
                    return analysis
                finally:
                    # recover the global variables in case of nested run
@@ -779,6 +857,8 @@ def run(
                        _runner = old_runner
                        logger.handlers = old_handlers
                        logger.setLevel(old_level)
+                    if _internal_mlflow:
+                        mlflow_integration.adopt_children()

    # simple sequential run without using tune.run() from ray
    time_start = time.time()
@@ -812,7 +892,11 @@ def run(
                result = None
                with PySparkOvertimeMonitor(time_start, time_budget_s, force_cancel):
                    result = evaluation_function(trial_to_run.config)
+                logger.debug(f"result in tune: {trial_to_run}, {result}")
                if result is not None:
+                    if _internal_mlflow:
+                        mlflow_integration.record_trial(result, trial_to_run, metric)
+
                    if isinstance(result, dict):
                        if result:
                            report(**result)
@@ -838,6 +922,19 @@ def run(
            mode=mode,
            lexico_objectives=lexico_objectives,
        )
+        analysis.search_space = config
+        if _internal_mlflow:
+            mlflow_integration.log_tune(analysis, metric)
+            if analysis.best_run_id is not None:
+                logger.info(f"Best MLflow run name: {analysis.best_run_name}")
+                logger.info(f"Best MLflow run id: {analysis.best_run_id}")
+            # try:
+            #     _best_config = analysis.best_config
+            # except Exception:
+            #     _best_config = None
+            # if _best_config:
+            #     mlflow_integration.retrain(evaluation_function, analysis.best_config)
+
        return analysis
    finally:
        # recover the global variables in case of nested run
@@ -849,6 +946,8 @@ def run(
            _runner = old_runner
            logger.handlers = old_handlers
            logger.setLevel(old_level)
+        if _internal_mlflow:
+            mlflow_integration.adopt_children()


 class Tuner:
--- a/flaml/version.py
+++ b/flaml/version.py
@@ -1 +1 @@
-__version__ = "2.2.0"
+__version__ = "2.3.6"
--- a/notebook/autogen_chatgpt_gpt4.ipynb
+++ b/notebook/autogen_chatgpt_gpt4.ipynb
@@ -174,7 +174,7 @@
    "import datasets\n",
    "\n",
    "seed = 41\n",
-    "data = datasets.load_dataset(\"competition_math\")\n",
+    "data = datasets.load_dataset(\"competition_math\", trust_remote_code=True)\n",
    "train_data = data[\"train\"].shuffle(seed=seed)\n",
    "test_data = data[\"test\"].shuffle(seed=seed)\n",
    "n_tune_data = 20\n",
@@ -390,7 +390,7 @@
     "name": "stderr",
     "output_type": "stream",
     "text": [
-      "\u001b[32m[I 2023-08-01 22:38:01,549]\u001b[0m A new study created in memory with name: optuna\u001b[0m\n"
+      "\u001B[32m[I 2023-08-01 22:38:01,549]\u001B[0m A new study created in memory with name: optuna\u001B[0m\n"
     ]
    },
    {
--- a/notebook/autogen_openai_completion.ipynb
+++ b/notebook/autogen_openai_completion.ipynb
@@ -196,7 +196,7 @@
    "import datasets\n",
    "\n",
    "seed = 41\n",
-    "data = datasets.load_dataset(\"openai_humaneval\")[\"test\"].shuffle(seed=seed)\n",
+    "data = datasets.load_dataset(\"openai_humaneval\", trust_remote_code=True)[\"test\"].shuffle(seed=seed)\n",
    "n_tune_data = 20\n",
    "tune_data = [\n",
    "    {\n",
@@ -444,8 +444,8 @@
     "name": "stderr",
     "output_type": "stream",
     "text": [
-      "\u001b[32m[I 2023-07-30 04:19:08,150]\u001b[0m A new study created in memory with name: optuna\u001b[0m\n",
-      "\u001b[32m[I 2023-07-30 04:19:08,153]\u001b[0m A new study created in memory with name: optuna\u001b[0m\n"
+      "\u001B[32m[I 2023-07-30 04:19:08,150]\u001B[0m A new study created in memory with name: optuna\u001B[0m\n",
+      "\u001B[32m[I 2023-07-30 04:19:08,153]\u001B[0m A new study created in memory with name: optuna\u001B[0m\n"
     ]
    },
    {
--- a/notebook/research/autogen_code.ipynb
+++ b/notebook/research/autogen_code.ipynb
@@ -152,7 +152,7 @@
    "import datasets\n",
    "\n",
    "seed = 41\n",
-    "data = datasets.load_dataset(\"openai_humaneval\")[\"test\"].shuffle(seed=seed)\n",
+    "data = datasets.load_dataset(\"openai_humaneval\", trust_remote_code=True)[\"test\"].shuffle(seed=seed)\n",
    "data = data.select(range(len(data))).rename_column(\"prompt\", \"definition\").remove_columns([\"task_id\", \"canonical_solution\"])"
   ]
  },
--- a/notebook/research/math_level5counting.ipynb
+++ b/notebook/research/math_level5counting.ipynb
@@ -121,7 +121,7 @@
    "import datasets\n",
    "\n",
    "seed = 41\n",
-    "data = datasets.load_dataset(\"competition_math\")\n",
+    "data = datasets.load_dataset(\"competition_math\", trust_remote_code=True)\n",
    "train_data = data[\"train\"].shuffle(seed=seed)\n",
    "test_data = data[\"test\"].shuffle(seed=seed)\n",
    "n_tune_data = 20\n",
--- a/notebook/tune_huggingface.ipynb
+++ b/notebook/tune_huggingface.ipynb
@@ -112,9 +112,7 @@
     ]
    }
   ],
-   "source": [
-    "raw_dataset = datasets.load_dataset(\"glue\", TASK)"
-   ]
+   "source": "raw_dataset = datasets.load_dataset(\"glue\", TASK, trust_remote_code=True)"
  },
  {
   "cell_type": "code",
@@ -425,9 +423,7 @@
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
-   "source": [
-    "metric = datasets.load_metric(\"glue\", TASK)"
-   ]
+   "source": "metric = datasets.load_metric(\"glue\", TASK, trust_remote_code=True)"
  },
  {
   "cell_type": "code",
@@ -646,7 +642,7 @@
    "def train_distilbert(config: dict):\n",
    "\n",
    "    # Load CoLA dataset and apply tokenizer\n",
-    "    cola_raw = datasets.load_dataset(\"glue\", TASK)\n",
+    "    cola_raw = datasets.load_dataset(\"glue\", TASK, trust_remote_code=True)\n",
    "    cola_encoded = cola_raw.map(tokenize, batched=True)\n",
    "    train_dataset, eval_dataset = cola_encoded[\"train\"], cola_encoded[\"validation\"]\n",
    "\n",
@@ -654,7 +650,7 @@
    "        MODEL_CHECKPOINT, num_labels=NUM_LABELS\n",
    "    )\n",
    "\n",
-    "    metric = datasets.load_metric(\"glue\", TASK)\n",
+    "    metric = datasets.load_metric(\"glue\", TASK, trust_remote_code=True)\n",
    "    def compute_metrics(eval_pred):\n",
    "        predictions, labels = eval_pred\n",
    "        predictions = np.argmax(predictions, axis=1)\n",
@@ -847,7 +843,7 @@
     "name": "stderr",
     "output_type": "stream",
     "text": [
-      "\u001b[2m\u001b[36m(pid=11344)\u001b[0m Reusing dataset glue (/home/ec2-user/.cache/huggingface/datasets/glue/cola/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)\n",
+      "\u001B[2m\u001B[36m(pid=11344)\u001B[0m Reusing dataset glue (/home/ec2-user/.cache/huggingface/datasets/glue/cola/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)\n",
      "  0%|          | 0/9 [00:00<?, ?ba/s]\n",
      " 22%|██▏       | 2/9 [00:00<00:00, 19.41ba/s]\n",
      " 56%|█████▌    | 5/9 [00:00<00:00, 20.98ba/s]\n",
@@ -856,25 +852,25 @@
      "100%|██████████| 2/2 [00:00<00:00, 42.79ba/s]\n",
      "  0%|          | 0/2 [00:00<?, ?ba/s]\n",
      "100%|██████████| 2/2 [00:00<00:00, 41.48ba/s]\n",
-      "\u001b[2m\u001b[36m(pid=11344)\u001b[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']\n",
-      "\u001b[2m\u001b[36m(pid=11344)\u001b[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
-      "\u001b[2m\u001b[36m(pid=11344)\u001b[0m - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
-      "\u001b[2m\u001b[36m(pid=11344)\u001b[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']\n",
-      "\u001b[2m\u001b[36m(pid=11344)\u001b[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
+      "\u001B[2m\u001B[36m(pid=11344)\u001B[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']\n",
+      "\u001B[2m\u001B[36m(pid=11344)\u001B[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "\u001B[2m\u001B[36m(pid=11344)\u001B[0m - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
+      "\u001B[2m\u001B[36m(pid=11344)\u001B[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']\n",
+      "\u001B[2m\u001B[36m(pid=11344)\u001B[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "\u001b[2m\u001b[36m(pid=11344)\u001b[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
-      "\u001b[2m\u001b[36m(pid=11344)\u001b[0m To disable this warning, you can either:\n",
-      "\u001b[2m\u001b[36m(pid=11344)\u001b[0m \t- Avoid using `tokenizers` before the fork if possible\n",
-      "\u001b[2m\u001b[36m(pid=11344)\u001b[0m \t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
-      "\u001b[2m\u001b[36m(pid=11344)\u001b[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
-      "\u001b[2m\u001b[36m(pid=11344)\u001b[0m To disable this warning, you can either:\n",
-      "\u001b[2m\u001b[36m(pid=11344)\u001b[0m \t- Avoid using `tokenizers` before the fork if possible\n",
-      "\u001b[2m\u001b[36m(pid=11344)\u001b[0m \t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
+      "\u001B[2m\u001B[36m(pid=11344)\u001B[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "\u001B[2m\u001B[36m(pid=11344)\u001B[0m To disable this warning, you can either:\n",
+      "\u001B[2m\u001B[36m(pid=11344)\u001B[0m \t- Avoid using `tokenizers` before the fork if possible\n",
+      "\u001B[2m\u001B[36m(pid=11344)\u001B[0m \t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "\u001B[2m\u001B[36m(pid=11344)\u001B[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "\u001B[2m\u001B[36m(pid=11344)\u001B[0m To disable this warning, you can either:\n",
+      "\u001B[2m\u001B[36m(pid=11344)\u001B[0m \t- Avoid using `tokenizers` before the fork if possible\n",
+      "\u001B[2m\u001B[36m(pid=11344)\u001B[0m \t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
     ]
    }
   ],
--- a/pytest.ini
+++ b/pytest.ini
@@ -0,0 +1,3 @@
+[pytest]
+markers =
+    spark: mark a test as requiring Spark
--- a/setup.py
+++ b/setup.py
@@ -4,7 +4,7 @@ import setuptools

 here = os.path.abspath(os.path.dirname(__file__))

-with open("README.md", "r", encoding="UTF-8") as fh:
+with open("README.md", encoding="UTF-8") as fh:
    long_description = fh.read()


@@ -55,7 +55,8 @@ setuptools.setup(
            "lightgbm>=2.3.1",
            "xgboost>=0.90,<2.0.0",
            "scipy>=1.4.1",
-            "pandas>=1.1.4",
+            "pandas>=1.1.4,<2.0.0; python_version<'3.10'",
+            "pandas>=1.1.4; python_version>='3.10'",
            "scikit-learn>=1.0.0",
            "thop",
            "pytest>=6.1.1",
@@ -72,14 +73,14 @@ setuptools.setup(
            "psutil==5.8.0",
            "dataclasses",
            "transformers[torch]==4.26",
-            "datasets",
-            "nltk",
+            "datasets<=3.5.0",
+            "nltk<=3.8.1",  # 3.8.2 doesn't work with mlflow
            "rouge_score",
            "hcrystalball==0.1.10",
            "seqeval",
            "pytorch-forecasting>=0.9.0,<=0.10.1; python_version<'3.11'",
-            "mlflow",
-            "pyspark>=3.2.0",
+            # "pytorch-forecasting==0.10.1; python_version=='3.11'",
+            "mlflow==2.15.1",
            "joblibspark>=0.5.0",
            "joblib<=1.3.2",
            "nbconvert",
@@ -92,6 +93,7 @@ setuptools.setup(
            "pydantic==1.10.9",
            "sympy",
            "wolframalpha",
+            "dill",  # a drop in replacement of pickle
        ],
        "catboost": [
            "catboost>=0.26,<1.2; python_version<'3.11'",
@@ -117,14 +119,14 @@ setuptools.setup(
        "hf": [
            "transformers[torch]==4.26",
            "datasets",
-            "nltk",
+            "nltk<=3.8.1",
            "rouge_score",
            "seqeval",
        ],
        "nlp": [  # for backward compatibility; hf is the new option name
            "transformers[torch]==4.26",
            "datasets",
-            "nltk",
+            "nltk<=3.8.1",
            "rouge_score",
            "seqeval",
        ],
@@ -139,7 +141,8 @@ setuptools.setup(
            "prophet>=1.0.1",
            "statsmodels>=0.12.2",
            "hcrystalball==0.1.10",
-            "pytorch-forecasting>=0.9.0",
+            "pytorch-forecasting>=0.9.0; python_version<'3.11'",
+            # "pytorch-forecasting==0.10.1; python_version=='3.11'",
            "pytorch-lightning==1.9.0",
            "tensorboardX==2.6",
        ],
@@ -163,9 +166,13 @@ setuptools.setup(
        "autozero": ["scikit-learn", "pandas", "packaging"],
    },
    classifiers=[
-        "Programming Language :: Python :: 3",
        "License :: OSI Approved :: MIT License",
        "Operating System :: OS Independent",
+        # Specify the Python versions you support here.
+        "Programming Language :: Python :: 3",
+        "Programming Language :: Python :: 3.9",
+        "Programming Language :: Python :: 3.10",
+        "Programming Language :: Python :: 3.11",
    ],
-    python_requires=">=3.6",
+    python_requires=">=3.9",
 )
--- a/test/autogen/agentchat/test_assistant_agent.py
+++ b/test/autogen/agentchat/test_assistant_agent.py
@@ -178,7 +178,7 @@ def test_tsp(human_input_mode="NEVER", max_consecutive_auto_reply=10):
    class TSPUserProxyAgent(UserProxyAgent):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
-            with open(f"{here}/tsp_prompt.txt", "r") as f:
+            with open(f"{here}/tsp_prompt.txt") as f:
                self._prompt = f.read()

        def generate_init_message(self, question) -> str:
--- a/test/autogen/oai/test_completion.py
+++ b/test/autogen/oai/test_completion.py
@@ -187,7 +187,7 @@ def test_humaneval(num_samples=1):
    )

    seed = 41
-    data = datasets.load_dataset("openai_humaneval")["test"].shuffle(seed=seed)
+    data = datasets.load_dataset("openai_humaneval", trust_remote_code=True)["test"].shuffle(seed=seed)
    n_tune_data = 20
    tune_data = [
        {
@@ -334,7 +334,7 @@ def test_math(num_samples=-1):
        return

    seed = 41
-    data = datasets.load_dataset("competition_math")
+    data = datasets.load_dataset("competition_math", trust_remote_code=True)
    train_data = data["train"].shuffle(seed=seed)
    test_data = data["test"].shuffle(seed=seed)
    n_tune_data = 20
@@ -356,7 +356,7 @@ def test_math(num_samples=-1):
    ]
    print(
        "max tokens in tuning data's canonical solutions",
-        max([len(x["solution"].split()) for x in tune_data]),
+        max(len(x["solution"].split()) for x in tune_data),
    )
    print(len(tune_data), len(test_data))
    # prompt template
--- a/test/automl/test_classification.py
+++ b/test/automl/test_classification.py
@@ -1,11 +1,15 @@
 import unittest
 from datetime import datetime
+from test.conftest import evaluate_cv_folds_with_underlying_model

 import numpy as np
 import pandas as pd
+import pytest
 import scipy.sparse
 from sklearn.datasets import load_breast_cancer
-from sklearn.model_selection import train_test_split
+from sklearn.model_selection import (
+    train_test_split,
+)

 from flaml import AutoML, tune
 from flaml.automl.model import LGBMEstimator
@@ -420,6 +424,122 @@ class TestClassification(unittest.TestCase):
        print(automl_experiment.best_estimator)


+@pytest.mark.parametrize(
+    "estimator",
+    [
+        "catboost",
+        "extra_tree",
+        "histgb",
+        "kneighbor",
+        "lgbm",
+        # "lrl1",
+        "lrl2",
+        "rf",
+        "svc",
+        "xgboost",
+        "xgb_limitdepth",
+    ],
+)
+def test_reproducibility_of_classification_models(estimator: str):
+    """FLAML finds the best model for a given dataset, which it then provides to users.
+
+    However, there are reported issues where FLAML was providing an incorrect model - see here:
+    https://github.com/microsoft/FLAML/issues/1317
+    In this test we take the best model which FLAML provided us, and then retrain and test it on the
+    same folds, to verify that the result is reproducible.
+    """
+    automl = AutoML()
+    automl_settings = {
+        "max_iter": 5,
+        "time_budget": -1,
+        "task": "classification",
+        "n_jobs": 1,
+        "estimator_list": [estimator],
+        "eval_method": "cv",
+        "n_splits": 10,
+        "metric": "f1",
+        "keep_search_state": True,
+        "skip_transform": True,
+    }
+    X, y = load_breast_cancer(return_X_y=True, as_frame=True)
+    automl.fit(X_train=X, y_train=y, **automl_settings)
+    best_model = automl.model
+    assert best_model is not None
+    config = best_model.get_params()
+    val_loss_flaml = automl.best_result["val_loss"]
+
+    # Take the best model, and see if we can reproduce the best result
+    reproduced_val_loss, metric_for_logging, train_time, pred_time = automl._state.task.evaluate_model_CV(
+        config=config,
+        estimator=best_model,
+        X_train_all=automl._state.X_train_all,
+        y_train_all=automl._state.y_train_all,
+        budget=None,
+        kf=automl._state.kf,
+        eval_metric="f1",
+        best_val_loss=None,
+        cv_score_agg_func=None,
+        log_training_metric=False,
+        fit_kwargs=None,
+        free_mem_ratio=0,
+    )
+    assert pytest.approx(val_loss_flaml) == reproduced_val_loss
+
+
+@pytest.mark.parametrize(
+    "estimator",
+    [
+        "catboost",
+        "extra_tree",
+        "histgb",
+        "kneighbor",
+        "lgbm",
+        # "lrl1",
+        "lrl2",
+        "svc",
+        "rf",
+        "xgboost",
+        "xgb_limitdepth",
+    ],
+)
+def test_reproducibility_of_underlying_classification_models(estimator: str):
+    """FLAML finds the best model for a given dataset, which it then provides to users.
+
+    However, there are reported issues where FLAML was providing an incorrect model - see here:
+    https://github.com/microsoft/FLAML/issues/1317
+    FLAML defines FLAMLised models, which wrap around the underlying (SKLearn/XGBoost/CatBoost) model.
+    Ideally, FLAMLised models should perform identically to the underlying model, when fitted
+    to the same data, with no budget. This verifies that this is the case for classification models.
+    In this test we take the best model which FLAML provided us, extract the underlying model,
+     before retraining and testing it on the same folds - to verify that the result is reproducible.
+    """
+    automl = AutoML()
+    automl_settings = {
+        "max_iter": 5,
+        "time_budget": -1,
+        "task": "classification",
+        "n_jobs": 1,
+        "estimator_list": [estimator],
+        "eval_method": "cv",
+        "n_splits": 10,
+        "metric": "f1",
+        "keep_search_state": True,
+        "skip_transform": True,
+    }
+    X, y = load_breast_cancer(return_X_y=True, as_frame=True)
+    automl.fit(X_train=X, y_train=y, **automl_settings)
+    best_model = automl.model
+    assert best_model is not None
+    val_loss_flaml = automl.best_result["val_loss"]
+    reproduced_val_loss_underlying_model = np.mean(
+        evaluate_cv_folds_with_underlying_model(
+            automl._state.X_train_all, automl._state.y_train_all, automl._state.kf, best_model.model, "classification"
+        )
+    )
+
+    assert pytest.approx(val_loss_flaml) == reproduced_val_loss_underlying_model
+
+
 if __name__ == "__main__":
    test = TestClassification()
    test.test_preprocess()
--- a/test/automl/test_constraints.py
+++ b/test/automl/test_constraints.py
@@ -125,14 +125,12 @@ def test_metric_constraints_custom():
    print(automl.estimator_list)
    print(automl.search_space)
    print(automl.points_to_evaluate)
-    print("Best minimization objective on validation data: {0:.4g}".format(automl.best_loss))
+    print(f"Best minimization objective on validation data: {automl.best_loss:.4g}")
    print(
-        "pred_time of the best config on validation data: {0:.4g}".format(
-            automl.metrics_for_best_config[1]["pred_time"]
-        )
+        "pred_time of the best config on validation data: {:.4g}".format(automl.metrics_for_best_config[1]["pred_time"])
    )
    print(
-        "val_train_loss_gap of the best config on validation data: {0:.4g}".format(
+        "val_train_loss_gap of the best config on validation data: {:.4g}".format(
            automl.metrics_for_best_config[1]["val_train_loss_gap"]
        )
    )
--- a/test/automl/test_extra_models.py
+++ b/test/automl/test_extra_models.py
@@ -0,0 +1,312 @@
+import os
+import sys
+import unittest
+import warnings
+from collections import defaultdict
+
+import mlflow
+import numpy as np
+import pandas as pd
+import pytest
+import scipy
+from packaging.version import Version
+from sklearn.datasets import load_breast_cancer, load_diabetes, load_iris
+from sklearn.model_selection import train_test_split
+
+from flaml import AutoML
+from flaml.automl.ml import sklearn_metric_loss_score
+from flaml.tune.spark.utils import check_spark
+
+pytestmark = pytest.mark.spark
+
+leaderboard = defaultdict(dict)
+
+warnings.simplefilter(action="ignore")
+if sys.platform == "darwin" or "nt" in os.name:
+    # skip this test if the platform is not linux
+    skip_spark = True
+else:
+    try:
+        import pyspark
+        from pyspark.ml.evaluation import MulticlassClassificationEvaluator, RegressionEvaluator
+        from pyspark.ml.feature import VectorAssembler
+
+        from flaml.automl.spark.utils import to_pandas_on_spark
+
+        spark = (
+            pyspark.sql.SparkSession.builder.appName("MyApp")
+            .master("local[2]")
+            .config(
+                "spark.jars.packages",
+                (
+                    "com.microsoft.azure:synapseml_2.12:1.0.2,"
+                    "org.apache.hadoop:hadoop-azure:3.3.5,"
+                    "com.microsoft.azure:azure-storage:8.6.6,"
+                    f"org.mlflow:mlflow-spark_2.12:{mlflow.__version__}"
+                    if Version(mlflow.__version__) >= Version("2.9.0")
+                    else f"org.mlflow:mlflow-spark:{mlflow.__version__}"
+                ),
+            )
+            .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
+            .config("spark.sql.debug.maxToStringFields", "100")
+            .config("spark.driver.extraJavaOptions", "-Xss1m")
+            .config("spark.executor.extraJavaOptions", "-Xss1m")
+            .getOrCreate()
+        )
+        spark.sparkContext._conf.set(
+            "spark.mlflow.pysparkml.autolog.logModelAllowlistFile",
+            "https://mmlspark.blob.core.windows.net/publicwasb/log_model_allowlist.txt",
+        )
+        # spark.sparkContext.setLogLevel("ERROR")
+        spark_available, _ = check_spark()
+        skip_spark = not spark_available
+    except ImportError:
+        skip_spark = True
+
+
+def _test_regular_models(estimator_list, task):
+    if isinstance(estimator_list, str):
+        estimator_list = [estimator_list]
+    if task == "classification":
+        load_dataset_func = load_iris
+        metric = "accuracy"
+    else:
+        load_dataset_func = load_diabetes
+        metric = "r2"
+
+    x, y = load_dataset_func(return_X_y=True, as_frame=True)
+    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=7654321)
+
+    automl_experiment = AutoML()
+    automl_settings = {
+        "max_iter": 5,
+        "task": task,
+        "estimator_list": estimator_list,
+        "metric": metric,
+    }
+    automl_experiment.fit(X_train=x_train, y_train=y_train, **automl_settings)
+    predictions = automl_experiment.predict(x_test)
+    score = sklearn_metric_loss_score(metric, predictions, y_test)
+    for estimator_name in estimator_list:
+        leaderboard[task][estimator_name] = score
+
+
+def _test_spark_models(estimator_list, task):
+    if isinstance(estimator_list, str):
+        estimator_list = [estimator_list]
+    if task == "classification":
+        load_dataset_func = load_iris
+        evaluator = MulticlassClassificationEvaluator(
+            labelCol="target", predictionCol="prediction", metricName="accuracy"
+        )
+        metric = "accuracy"
+
+    elif task == "regression":
+        load_dataset_func = load_diabetes
+        evaluator = RegressionEvaluator(labelCol="target", predictionCol="prediction", metricName="r2")
+        metric = "r2"
+
+    elif task == "binary":
+        load_dataset_func = load_breast_cancer
+        evaluator = MulticlassClassificationEvaluator(
+            labelCol="target", predictionCol="prediction", metricName="accuracy"
+        )
+        metric = "accuracy"
+
+    final_cols = ["target", "features"]
+    extra_args = {}
+
+    if estimator_list is not None and "aft_spark" in estimator_list:
+        # survival analysis task
+        pd_df = pd.read_csv(
+            "https://raw.githubusercontent.com/CamDavidsonPilon/lifelines/master/lifelines/datasets/rossi.csv"
+        )
+        pd_df.rename(columns={"week": "target"}, inplace=True)
+        final_cols += ["arrest"]
+        extra_args["censorCol"] = "arrest"
+    else:
+        pd_df = load_dataset_func(as_frame=True).frame
+
+    rename = {}
+    for attr in pd_df.columns:
+        rename[attr] = attr.replace(" ", "_")
+    pd_df = pd_df.rename(columns=rename)
+    df = spark.createDataFrame(pd_df)
+    df = df.repartition(4)
+    train, test = df.randomSplit([0.8, 0.2], seed=7654321)
+    feature_cols = [col for col in df.columns if col not in ["target", "arrest"]]
+    featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
+    train_data = featurizer.transform(train)[final_cols]
+    test_data = featurizer.transform(test)[final_cols]
+    automl = AutoML()
+    settings = {
+        "max_iter": 1,
+        "estimator_list": estimator_list,  # ML learner we intend to test
+        "task": task,  # task type
+        "metric": metric,  # metric to optimize
+    }
+    settings.update(extra_args)
+    df = to_pandas_on_spark(to_pandas_on_spark(train_data).to_spark(index_col="index"))
+
+    automl.fit(
+        dataframe=df,
+        label="target",
+        **settings,
+    )
+
+    model = automl.model.estimator
+    predictions = model.transform(test_data)
+    predictions.show(5)
+
+    score = evaluator.evaluate(predictions)
+    if estimator_list is not None:
+        for estimator_name in estimator_list:
+            leaderboard[task][estimator_name] = score
+
+
+def _test_sparse_matrix_classification(estimator):
+    automl_experiment = AutoML()
+    automl_settings = {
+        "estimator_list": [estimator],
+        "time_budget": 2,
+        "metric": "auto",
+        "task": "classification",
+        "log_file_name": "test/sparse_classification.log",
+        "split_type": "uniform",
+        "n_jobs": 1,
+        "model_history": True,
+    }
+    X_train = scipy.sparse.random(1554, 21, dtype=int)
+    y_train = np.random.randint(3, size=1554)
+    automl_experiment.fit(X_train=X_train, y_train=y_train, **automl_settings)
+
+
+def load_multi_dataset():
+    """multivariate time series forecasting dataset"""
+    import pandas as pd
+
+    # pd.set_option("display.max_rows", None, "display.max_columns", None)
+    df = pd.read_csv(
+        "https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/nyc_energy_consumption.csv"
+    )
+    # preprocessing data
+    df["timeStamp"] = pd.to_datetime(df["timeStamp"])
+    df = df.set_index("timeStamp")
+    df = df.resample("D").mean()
+    df["temp"] = df["temp"].fillna(method="ffill")
+    df["precip"] = df["precip"].fillna(method="ffill")
+    df = df[:-2]  # last two rows are NaN for 'demand' column so remove them
+    df = df.reset_index()
+
+    return df
+
+
+def _test_forecast(estimator_list, budget=10):
+    if isinstance(estimator_list, str):
+        estimator_list = [estimator_list]
+    df = load_multi_dataset()
+    # split data into train and test
+    time_horizon = 180
+    num_samples = df.shape[0]
+    split_idx = num_samples - time_horizon
+    train_df = df[:split_idx]
+    test_df = df[split_idx:]
+    # test dataframe must contain values for the regressors / multivariate variables
+    X_test = test_df[["timeStamp", "precip", "temp"]]
+    y_test = test_df["demand"]
+    # return
+    automl = AutoML()
+    settings = {
+        "time_budget": budget,  # total running time in seconds
+        "metric": "mape",  # primary metric
+        "task": "ts_forecast",  # task type
+        "log_file_name": "test/energy_forecast_numerical.log",  # flaml log file
+        "log_dir": "logs/forecast_logs",  # tcn/tft log folder
+        "eval_method": "holdout",
+        "log_type": "all",
+        "label": "demand",
+        "estimator_list": estimator_list,
+    }
+    """The main flaml automl API"""
+    automl.fit(dataframe=train_df, **settings, period=time_horizon)
+    print(automl.best_config)
+    pred_y = automl.predict(X_test)
+    mape = sklearn_metric_loss_score("mape", pred_y, y_test)
+    for estimator_name in estimator_list:
+        leaderboard["forecast"][estimator_name] = mape
+
+
+class TestExtraModel(unittest.TestCase):
+    @unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+    def test_rf_spark(self):
+        tasks = ["classification", "regression"]
+        for task in tasks:
+            _test_spark_models("rf_spark", task)
+
+    @unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+    def test_nb_spark(self):
+        _test_spark_models("nb_spark", "classification")
+
+    @unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+    def test_glr(self):
+        _test_spark_models("glr_spark", "regression")
+
+    @unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+    def test_lr(self):
+        _test_spark_models("lr_spark", "regression")
+
+    @unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+    def test_svc_spark(self):
+        _test_spark_models("svc_spark", "binary")
+
+    @unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+    def test_gbt_spark(self):
+        tasks = ["binary", "regression"]
+        for task in tasks:
+            _test_spark_models("gbt_spark", task)
+
+    @unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+    def test_aft(self):
+        _test_spark_models("aft_spark", "regression")
+
+    @unittest.skipIf(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+    def test_default_spark(self):
+        _test_spark_models(None, "classification")
+
+    def test_svc(self):
+        _test_regular_models("svc", "classification")
+        _test_sparse_matrix_classification("svc")
+
+    def test_sgd(self):
+        tasks = ["classification", "regression"]
+        for task in tasks:
+            _test_regular_models("sgd", task)
+        _test_sparse_matrix_classification("sgd")
+
+    def test_enet(self):
+        _test_regular_models("enet", "regression")
+
+    def test_lassolars(self):
+        _test_regular_models("lassolars", "regression")
+        _test_forecast("lassolars")
+
+    def test_seasonal_naive(self):
+        _test_forecast("snaive")
+
+    def test_naive(self):
+        _test_forecast("naive")
+
+    def test_seasonal_avg(self):
+        _test_forecast("savg")
+
+    def test_avg(self):
+        _test_forecast("avg")
+
+    @unittest.skipIf(skip_spark, reason="Skip on Mac or Windows")
+    def test_tcn(self):
+        _test_forecast("tcn")
+
+
+if __name__ == "__main__":
+    unittest.main()
+    print(leaderboard)
--- a/test/automl/test_forecast.py
+++ b/test/automl/test_forecast.py
@@ -1,4 +1,5 @@
 import datetime
+import os
 import sys

 import numpy as np
@@ -95,6 +96,7 @@ def test_forecast_automl(budget=10, estimators_when_no_prophet=["arima", "sarima
        )


+@pytest.mark.skipif(sys.platform == "darwin" or "nt" in os.name, reason="skip on mac or windows")
 def test_models(budget=3):
    n = 200
    X = pd.DataFrame(
@@ -151,6 +153,10 @@ def test_numpy():
    print(automl.predict(12))


+@pytest.mark.skipif(
+    sys.platform in ["darwin"],
+    reason="do not run on mac os",
+)
 def test_numpy_large():
    import numpy as np
    import pandas as pd
@@ -471,7 +477,10 @@ def test_forecast_classification(budget=5):
 def get_stalliion_data():
    from pytorch_forecasting.data.examples import get_stallion_data

-    data = get_stallion_data()
+    # data = get_stallion_data()
+    data = pd.read_parquet(
+        "https://raw.githubusercontent.com/sktime/pytorch-forecasting/refs/heads/main/examples/data/stallion.parquet"
+    )
    # add time index - For datasets with no missing values, FLAML will automate this process
    data["time_idx"] = data["date"].dt.year * 12 + data["date"].dt.month
    data["time_idx"] -= data["time_idx"].min()
@@ -567,7 +576,7 @@ def test_forecast_panel(budget=5):
    print(f"Training duration of best run: {automl.best_config_train_time}s")
    print(automl.model.estimator)
    """ pickle and save the automl object """
-    import pickle
+    import dill as pickle

    with open("automl.pkl", "wb") as f:
        pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL)
--- a/test/automl/test_max_iter_1.py
+++ b/test/automl/test_max_iter_1.py
@@ -0,0 +1,51 @@
+import mlflow
+import numpy as np
+import pandas as pd
+
+from flaml import AutoML
+
+
+def test_max_iter_1():
+    date_rng = pd.date_range(start="2024-01-01", periods=100, freq="H")
+    X = pd.DataFrame({"ds": date_rng})
+    y_train_24h = np.random.rand(len(X)) * 100
+
+    # AutoML
+    settings = {
+        "max_iter": 1,
+        "estimator_list": ["xgboost", "lgbm"],
+        "starting_points": {"xgboost": {}, "lgbm": {}},
+        "task": "ts_forecast",
+        "log_file_name": "test_max_iter_1.log",
+        "seed": 41,
+        "mlflow_exp_name": "TestExp-max_iter-1",
+        "use_spark": False,
+        "n_concurrent_trials": 1,
+        "verbose": 1,
+        "featurization": "off",
+        "metric": "rmse",
+        "mlflow_logging": True,
+    }
+
+    automl = AutoML(**settings)
+
+    with mlflow.start_run(run_name="AutoMLModel-XGBoost-and-LGBM-max_iter_1"):
+        automl.fit(
+            X_train=X,
+            y_train=y_train_24h,
+            period=24,
+            X_val=X,
+            y_val=y_train_24h,
+            split_ratio=0,
+            force_cancel=False,
+        )
+
+    assert automl.model is not None, "AutoML failed to return a model"
+    assert automl.best_run_id is not None, "Best run ID should not be None with mlflow logging"
+
+    print("Best model:", automl.model)
+    print("Best run ID:", automl.best_run_id)
+
+
+if __name__ == "__main__":
+    test_max_iter_1()
--- a/test/automl/test_mlflow.py
+++ b/test/automl/test_mlflow.py
@@ -1,3 +1,5 @@
+import pickle
+
 import mlflow
 import mlflow.entities
 import pytest
@@ -8,58 +10,113 @@ from flaml import AutoML


 class TestMLFlowLoggingParam:
+    def test_update_and_install_requirements(self):
+        import mlflow
+        from sklearn import tree
+
+        from flaml.fabric.mlflow import update_and_install_requirements
+
+        with mlflow.start_run(run_name="test") as run:
+            sk_model = tree.DecisionTreeClassifier()
+            mlflow.sklearn.log_model(sk_model, "model", registered_model_name="test")
+
+        update_and_install_requirements(run_id=run.info.run_id)
+
    def test_should_start_new_run_by_default(self, automl_settings):
-        with mlflow.start_run():
-            parent = mlflow.last_active_run()
+        with mlflow.start_run() as parent_run:
            automl = AutoML()
            X_train, y_train = load_iris(return_X_y=True)
            automl.fit(X_train=X_train, y_train=y_train, **automl_settings)
+            try:
+                self._check_mlflow_parameters(automl, parent_run.info)
+            except FileNotFoundError:
+                print("[WARNING]: No file found")

-        children = self._get_child_runs(parent)
-        assert len(children) >= 1, "Expected at least 1 child run, got {}".format(len(children))
+        children = self._get_child_runs(parent_run)
+        assert len(children) >= 1, f"Expected at least 1 child run, got {len(children)}"

    def test_should_not_start_new_run_when_mlflow_logging_set_to_false_in_init(self, automl_settings):
-        with mlflow.start_run():
-            parent = mlflow.last_active_run()
+        with mlflow.start_run() as parent_run:
            automl = AutoML(mlflow_logging=False)
            X_train, y_train = load_iris(return_X_y=True)
            automl.fit(X_train=X_train, y_train=y_train, **automl_settings)
+            try:
+                self._check_mlflow_parameters(automl, parent_run.info)
+            except FileNotFoundError:
+                print("[WARNING]: No file found")

-        children = self._get_child_runs(parent)
-        assert len(children) == 0, "Expected 0 child runs, got {}".format(len(children))
+        children = self._get_child_runs(parent_run)
+        assert len(children) == 0, f"Expected 0 child runs, got {len(children)}"

    def test_should_not_start_new_run_when_mlflow_logging_set_to_false_in_fit(self, automl_settings):
-        with mlflow.start_run():
-            parent = mlflow.last_active_run()
+        with mlflow.start_run() as parent_run:
            automl = AutoML()
            X_train, y_train = load_iris(return_X_y=True)
            automl.fit(X_train=X_train, y_train=y_train, mlflow_logging=False, **automl_settings)
+            try:
+                self._check_mlflow_parameters(automl, parent_run.info)
+            except FileNotFoundError:
+                print("[WARNING]: No file found")

-        children = self._get_child_runs(parent)
-        assert len(children) == 0, "Expected 0 child runs, got {}".format(len(children))
+        children = self._get_child_runs(parent_run)
+        assert len(children) == 0, f"Expected 0 child runs, got {len(children)}"

    def test_should_start_new_run_when_mlflow_logging_set_to_true_in_fit(self, automl_settings):
-        with mlflow.start_run():
-            parent = mlflow.last_active_run()
+        with mlflow.start_run() as parent_run:
            automl = AutoML(mlflow_logging=False)
            X_train, y_train = load_iris(return_X_y=True)
            automl.fit(X_train=X_train, y_train=y_train, mlflow_logging=True, **automl_settings)
+            try:
+                self._check_mlflow_parameters(automl, parent_run.info)
+            except FileNotFoundError:
+                print("[WARNING]: No file found")

-        children = self._get_child_runs(parent)
-        assert len(children) >= 1, "Expected at least 1 child run, got {}".format(len(children))
+        children = self._get_child_runs(parent_run)
+        assert len(children) >= 1, f"Expected at least 1 child run, got {len(children)}"

    @staticmethod
    def _get_child_runs(parent_run: mlflow.entities.Run) -> DataFrame:
        experiment_id = parent_run.info.experiment_id
        return mlflow.search_runs(
-            [experiment_id], filter_string="tags.mlflow.parentRunId = '{}'".format(parent_run.info.run_id)
+            [experiment_id], filter_string=f"tags.mlflow.parentRunId = '{parent_run.info.run_id}'"
        )

+    @staticmethod
+    def _check_mlflow_parameters(automl: AutoML, run_info: mlflow.entities.RunInfo):
+        with open(
+            f"./mlruns/{run_info.experiment_id}/{run_info.run_id}/artifacts/automl_pipeline/model.pkl", "rb"
+        ) as f:
+            t = pickle.load(f)
+            if __name__ == "__main__":
+                print(t)
+            if not hasattr(automl.model._model, "_get_param_names"):
+                return
+            for param in automl.model._model._get_param_names():
+                assert eval("t._final_estimator._model" + f".{param}") == eval(
+                    "automl.model._model" + f".{param}"
+                ), "The mlflow logging not consistent with automl model"
+                if __name__ == "__main__":
+                    print(param, "\t", eval("automl.model._model" + f".{param}"))
+        print("[INFO]: Successfully Logged")
+
    @pytest.fixture(scope="class")
    def automl_settings(self):
+        mlflow.end_run()
        return {
-            "time_budget": 2,  # in seconds
+            "time_budget": 5,  # in seconds
            "metric": "accuracy",
            "task": "classification",
            "log_file_name": "iris.log",
        }
+
+
+if __name__ == "__main__":
+    s = TestMLFlowLoggingParam()
+    automl_settings = {
+        "time_budget": 5,  # in seconds
+        "metric": "accuracy",
+        "task": "classification",
+        "log_file_name": "iris.log",
+    }
+    s.test_should_start_new_run_by_default(automl_settings)
+    s.test_should_start_new_run_when_mlflow_logging_set_to_true_in_fit(automl_settings)
--- a/test/automl/test_model.py
+++ b/test/automl/test_model.py
@@ -143,4 +143,5 @@ def test_prep():


 if __name__ == "__main__":
+    test_lrl2()
    test_prep()
--- a/test/automl/test_multiclass.py
+++ b/test/automl/test_multiclass.py
@@ -187,7 +187,6 @@ class TestMultiClass(unittest.TestCase):
    def test_custom_metric(self):
        df, y = load_iris(return_X_y=True, as_frame=True)
        df["label"] = y
-        automl = AutoML()
        settings = {
            "dataframe": df,
            "label": "label",
@@ -204,7 +203,8 @@ class TestMultiClass(unittest.TestCase):
            "pred_time_limit": 1e-5,
            "ensemble": True,
        }
-        automl.fit(**settings)
+        automl = AutoML(**settings)  # test safe_json_dumps
+        automl.fit(dataframe=df, label="label")
        print(automl.classes_)
        print(automl.model)
        print(automl.config_history)
@@ -438,8 +438,8 @@ class TestMultiClass(unittest.TestCase):
        automl_val_accuracy = 1.0 - automl.best_loss
        print("Best ML leaner:", automl.best_estimator)
        print("Best hyperparmeter config:", automl.best_config)
-        print("Best accuracy on validation data: {0:.4g}".format(automl_val_accuracy))
-        print("Training duration of best run: {0:.4g} s".format(automl.best_config_train_time))
+        print(f"Best accuracy on validation data: {automl_val_accuracy:.4g}")
+        print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")

        starting_points = automl.best_config_per_estimator
        print("starting_points", starting_points)
@@ -461,8 +461,8 @@ class TestMultiClass(unittest.TestCase):
        new_automl_val_accuracy = 1.0 - new_automl.best_loss
        print("Best ML leaner:", new_automl.best_estimator)
        print("Best hyperparmeter config:", new_automl.best_config)
-        print("Best accuracy on validation data: {0:.4g}".format(new_automl_val_accuracy))
-        print("Training duration of best run: {0:.4g} s".format(new_automl.best_config_train_time))
+        print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
+        print(f"Training duration of best run: {new_automl.best_config_train_time:.4g} s")

    def test_fit_w_starting_point_2(self, as_frame=True):
        try:
@@ -493,8 +493,8 @@ class TestMultiClass(unittest.TestCase):
        automl_val_accuracy = 1.0 - automl.best_loss
        print("Best ML leaner:", automl.best_estimator)
        print("Best hyperparmeter config:", automl.best_config)
-        print("Best accuracy on validation data: {0:.4g}".format(automl_val_accuracy))
-        print("Training duration of best run: {0:.4g} s".format(automl.best_config_train_time))
+        print(f"Best accuracy on validation data: {automl_val_accuracy:.4g}")
+        print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")

        starting_points = {}
        log_file_name = settings["log_file_name"]
@@ -508,7 +508,7 @@ class TestMultiClass(unittest.TestCase):
                if learner not in starting_points:
                    starting_points[learner] = []
                starting_points[learner].append(config)
-        max_iter = sum([len(s) for k, s in starting_points.items()])
+        max_iter = sum(len(s) for k, s in starting_points.items())
        settings_resume = {
            "time_budget": 2,
            "metric": "accuracy",
@@ -528,7 +528,7 @@ class TestMultiClass(unittest.TestCase):
        new_automl_val_accuracy = 1.0 - new_automl.best_loss
        # print('Best ML leaner:', new_automl.best_estimator)
        # print('Best hyperparmeter config:', new_automl.best_config)
-        print("Best accuracy on validation data: {0:.4g}".format(new_automl_val_accuracy))
+        print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
        # print('Training duration of best run: {0:.4g} s'.format(new_automl_experiment.best_config_train_time))


--- a/test/automl/test_notebook_example.py
+++ b/test/automl/test_notebook_example.py
@@ -65,8 +65,8 @@ def test_automl(budget=5, dataset_format="dataframe", hpo_method=None):
    """ retrieve best config and best learner """
    print("Best ML leaner:", automl.best_estimator)
    print("Best hyperparmeter config:", automl.best_config)
-    print("Best accuracy on validation data: {0:.4g}".format(1 - automl.best_loss))
-    print("Training duration of best run: {0:.4g} s".format(automl.best_config_train_time))
+    print(f"Best accuracy on validation data: {1 - automl.best_loss:.4g}")
+    print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")
    print(automl.model.estimator)
    print(automl.best_config_per_estimator)
    print("time taken to find best model:", automl.time_to_find_best_model)
--- a/test/automl/test_regression.py
+++ b/test/automl/test_regression.py
@@ -1,9 +1,12 @@
 import unittest
+from test.conftest import evaluate_cv_folds_with_underlying_model

 import numpy as np
+import pytest
 import scipy.sparse
 from sklearn.datasets import (
    fetch_california_housing,
+    make_regression,
 )

 from flaml import AutoML
@@ -205,7 +208,6 @@ class TestRegression(unittest.TestCase):


 def test_multioutput():
-    from sklearn.datasets import make_regression
    from sklearn.model_selection import train_test_split
    from sklearn.multioutput import MultiOutputRegressor, RegressorChain

@@ -230,5 +232,210 @@ def test_multioutput():
    print(model.predict(X_test))


+@pytest.mark.parametrize(
+    "estimator",
+    [
+        "catboost",
+        "enet",
+        "extra_tree",
+        "histgb",
+        "kneighbor",
+        "lgbm",
+        "rf",
+        "xgboost",
+        "xgb_limitdepth",
+    ],
+)
+def test_reproducibility_of_regression_models(estimator: str):
+    """FLAML finds the best model for a given dataset, which it then provides to users.
+
+    However, there are reported issues where FLAML was providing an incorrect model - see here:
+    https://github.com/microsoft/FLAML/issues/1317
+    In this test we take the best regression model which FLAML provided us, and then retrain and test it on the
+    same folds, to verify that the result is reproducible.
+    """
+    automl = AutoML()
+    automl_settings = {
+        "max_iter": 2,
+        "time_budget": -1,
+        "task": "regression",
+        "n_jobs": 1,
+        "estimator_list": [estimator],
+        "eval_method": "cv",
+        "n_splits": 3,
+        "metric": "r2",
+        "keep_search_state": True,
+        "skip_transform": True,
+        "retrain_full": True,
+    }
+    X, y = fetch_california_housing(return_X_y=True, as_frame=True)
+    automl.fit(X_train=X, y_train=y, **automl_settings)
+    best_model = automl.model
+    assert best_model is not None
+    config = best_model.get_params()
+    val_loss_flaml = automl.best_result["val_loss"]
+
+    # Take the best model, and see if we can reproduce the best result
+    reproduced_val_loss, metric_for_logging, train_time, pred_time = automl._state.task.evaluate_model_CV(
+        config=config,
+        estimator=best_model,
+        X_train_all=automl._state.X_train_all,
+        y_train_all=automl._state.y_train_all,
+        budget=None,
+        kf=automl._state.kf,
+        eval_metric="r2",
+        best_val_loss=None,
+        cv_score_agg_func=None,
+        log_training_metric=False,
+        fit_kwargs=None,
+        free_mem_ratio=0,
+    )
+    assert pytest.approx(val_loss_flaml) == reproduced_val_loss
+
+
+def test_reproducibility_of_catboost_regression_model():
+    """FLAML finds the best model for a given dataset, which it then provides to users.
+
+    However, there are reported issues around the catboost model - see here:
+    https://github.com/microsoft/FLAML/issues/1317
+    In this test we take the best catboost regression model which FLAML provided us, and then retrain and test it on the
+    same folds, to verify that the result is reproducible.
+    """
+    automl = AutoML()
+    automl_settings = {
+        "time_budget": 7,
+        "task": "regression",
+        "n_jobs": 1,
+        "estimator_list": ["catboost"],
+        "eval_method": "cv",
+        "n_splits": 10,
+        "metric": "r2",
+        "keep_search_state": True,
+        "skip_transform": True,
+        "retrain_full": True,
+    }
+    X, y = fetch_california_housing(return_X_y=True, as_frame=True)
+    automl.fit(X_train=X, y_train=y, **automl_settings)
+    best_model = automl.model
+    assert best_model is not None
+    config = best_model.get_params()
+    val_loss_flaml = automl.best_result["val_loss"]
+
+    # Take the best model, and see if we can reproduce the best result
+    reproduced_val_loss, metric_for_logging, train_time, pred_time = automl._state.task.evaluate_model_CV(
+        config=config,
+        estimator=best_model,
+        X_train_all=automl._state.X_train_all,
+        y_train_all=automl._state.y_train_all,
+        budget=None,
+        kf=automl._state.kf,
+        eval_metric="r2",
+        best_val_loss=None,
+        cv_score_agg_func=None,
+        log_training_metric=False,
+        fit_kwargs=None,
+        free_mem_ratio=0,
+    )
+    assert pytest.approx(val_loss_flaml) == reproduced_val_loss
+
+
+def test_reproducibility_of_lgbm_regression_model():
+    """FLAML finds the best model for a given dataset, which it then provides to users.
+
+    However, there are reported issues around LGBMs - see here:
+    https://github.com/microsoft/FLAML/issues/1368
+    In this test we take the best LGBM regression model which FLAML provided us, and then retrain and test it on the
+    same folds, to verify that the result is reproducible.
+    """
+    automl = AutoML()
+    automl_settings = {
+        "time_budget": 3,
+        "task": "regression",
+        "n_jobs": 1,
+        "estimator_list": ["lgbm"],
+        "eval_method": "cv",
+        "n_splits": 9,
+        "metric": "r2",
+        "keep_search_state": True,
+        "skip_transform": True,
+        "retrain_full": True,
+    }
+    X, y = fetch_california_housing(return_X_y=True, as_frame=True)
+    automl.fit(X_train=X, y_train=y, **automl_settings)
+    best_model = automl.model
+    assert best_model is not None
+    config = best_model.get_params()
+    val_loss_flaml = automl.best_result["val_loss"]
+
+    # Take the best model, and see if we can reproduce the best result
+    reproduced_val_loss, metric_for_logging, train_time, pred_time = automl._state.task.evaluate_model_CV(
+        config=config,
+        estimator=best_model,
+        X_train_all=automl._state.X_train_all,
+        y_train_all=automl._state.y_train_all,
+        budget=None,
+        kf=automl._state.kf,
+        eval_metric="r2",
+        best_val_loss=None,
+        cv_score_agg_func=None,
+        log_training_metric=False,
+        fit_kwargs=None,
+        free_mem_ratio=0,
+    )
+    assert pytest.approx(val_loss_flaml) == reproduced_val_loss or val_loss_flaml > reproduced_val_loss
+
+
+@pytest.mark.parametrize(
+    "estimator",
+    [
+        "catboost",
+        "enet",
+        "extra_tree",
+        "histgb",
+        "kneighbor",
+        "lgbm",
+        "rf",
+        "xgboost",
+        "xgb_limitdepth",
+    ],
+)
+def test_reproducibility_of_underlying_regression_models(estimator: str):
+    """FLAML finds the best model for a given dataset, which it then provides to users.
+
+    However, there are reported issues where FLAML was providing an incorrect model - see here:
+    https://github.com/microsoft/FLAML/issues/1317
+    FLAML defines FLAMLised models, which wrap around the underlying (SKLearn/XGBoost/CatBoost) model.
+    Ideally, FLAMLised models should perform identically to the underlying model, when fitted
+    to the same data, with no budget. This verifies that this is the case for regression models.
+    In this test we take the best model which FLAML provided us, extract the underlying model,
+     before retraining and testing it on the same folds - to verify that the result is reproducible.
+    """
+    automl = AutoML()
+    automl_settings = {
+        "max_iter": 5,
+        "time_budget": -1,
+        "task": "regression",
+        "n_jobs": 1,
+        "estimator_list": [estimator],
+        "eval_method": "cv",
+        "n_splits": 10,
+        "metric": "r2",
+        "keep_search_state": True,
+        "skip_transform": True,
+        "retrain_full": False,
+    }
+    X, y = fetch_california_housing(return_X_y=True, as_frame=True)
+    automl.fit(X_train=X, y_train=y, **automl_settings)
+    best_model = automl.model
+    assert best_model is not None
+    val_loss_flaml = automl.best_result["val_loss"]
+    reproduced_val_loss_underlying_model = np.mean(
+        evaluate_cv_folds_with_underlying_model(
+            automl._state.X_train_all, automl._state.y_train_all, automl._state.kf, best_model.model, "regression"
+        )
+    )
+    assert pytest.approx(val_loss_flaml) == reproduced_val_loss_underlying_model
+
+
 if __name__ == "__main__":
    unittest.main()
--- a/test/automl/test_score.py
+++ b/test/automl/test_score.py
@@ -195,7 +195,7 @@ class TestScore:
            automl_settings = {
                "time_budget": 2,
                "task": "rank",
-                "log_file_name": "test/{}.log".format(dataset),
+                "log_file_name": f"test/{dataset}.log",
                "model_history": True,
                "groups": np.array([0] * 200 + [1] * 200 + [2] * 100),  # group labels
                "learner_selector": "roundrobin",
--- a/test/automl/test_split.py
+++ b/test/automl/test_split.py
@@ -1,4 +1,6 @@
-from sklearn.datasets import fetch_openml
+import numpy as np
+import pandas as pd
+from sklearn.datasets import fetch_openml, load_iris
 from sklearn.metrics import accuracy_score
 from sklearn.model_selection import GroupKFold, KFold, train_test_split

@@ -16,7 +18,7 @@ def _test(split_type):
        "time_budget": 2,
        # "metric": 'accuracy',
        "task": "classification",
-        "log_file_name": "test/{}.log".format(dataset),
+        "log_file_name": f"test/{dataset}.log",
        "model_history": True,
        "log_training_metric": True,
        "split_type": split_type,
@@ -48,7 +50,7 @@ def test_time():
    _test(split_type="time")


-def test_groups():
+def test_groups_for_classification_task():
    from sklearn.externals._arff import ArffException

    try:
@@ -58,17 +60,15 @@ def test_groups():

        X, y = load_wine(return_X_y=True)

-    import numpy as np
-
    automl = AutoML()
    automl_settings = {
        "time_budget": 2,
        "task": "classification",
-        "log_file_name": "test/{}.log".format(dataset),
+        "log_file_name": f"test/{dataset}.log",
        "model_history": True,
        "eval_method": "cv",
        "groups": np.random.randint(low=0, high=10, size=len(y)),
-        "estimator_list": ["lgbm", "rf", "xgboost", "kneighbor"],
+        "estimator_list": ["catboost", "lgbm", "rf", "xgboost", "kneighbor"],
        "learner_selector": "roundrobin",
    }
    automl.fit(X, y, **automl_settings)
@@ -88,6 +88,72 @@ def test_groups():
    automl.fit(X, y, **automl_settings)


+def test_groups_for_regression_task():
+    """Append nonsensical groups to iris dataset and use it to test that GroupKFold works for regression tasks"""
+    iris_dict_data = load_iris(as_frame=True)  # numpy arrays
+    iris_data = iris_dict_data["frame"]  # pandas dataframe data + target
+
+    rng = np.random.default_rng(42)
+    iris_data["cluster"] = rng.integers(
+        low=0, high=5, size=iris_data.shape[0]
+    )  # np.random.randint(0, 5, iris_data.shape[0])
+
+    automl = AutoML()
+    X = iris_data[["sepal length (cm)", "sepal width (cm)", "petal length (cm)"]].to_numpy()
+    y = iris_data["petal width (cm)"]
+    X_train, X_test, y_train, y_test, groups_train, groups_test = train_test_split(
+        X, y, iris_data["cluster"], random_state=42
+    )
+    automl_settings = {
+        "max_iter": 5,
+        "time_budget": -1,
+        "metric": "r2",
+        "task": "regression",
+        "estimator_list": ["lgbm", "rf", "xgboost", "kneighbor"],
+        "eval_method": "cv",
+        "split_type": "uniform",
+        "groups": groups_train,
+    }
+    automl.fit(X_train, y_train, **automl_settings)
+
+
+def test_groups_with_sample_weights():
+    """Verifies that sample weights can be used with group splits i.e. that https://github.com/microsoft/FLAML/issues/1396 remains fixed"""
+    iris_dict_data = load_iris(as_frame=True)  # numpy arrays
+    iris_data = iris_dict_data["frame"]  # pandas dataframe data + target
+    iris_data["cluster"] = np.random.randint(0, 5, iris_data.shape[0])
+    automl = AutoML()
+
+    X = iris_data[["sepal length (cm)", "sepal width (cm)", "petal length (cm)"]].to_numpy()
+    y = iris_data["petal width (cm)"]
+    sample_weight = pd.Series(np.random.rand(X.shape[0]))
+    (
+        X_train,
+        X_test,
+        y_train,
+        y_test,
+        groups_train,
+        groups_test,
+        sample_weight_train,
+        sample_weight_test,
+    ) = train_test_split(X, y, iris_data["cluster"], sample_weight, random_state=42)
+    automl_settings = {
+        "max_iter": 5,
+        "time_budget": -1,
+        "metric": "r2",
+        "task": "regression",
+        "log_file_name": "error.log",
+        "log_type": "all",
+        "estimator_list": ["lgbm"],
+        "eval_method": "cv",
+        "split_type": "group",
+        "groups": groups_train,
+        "sample_weight": sample_weight_train,
+    }
+    automl.fit(X_train, y_train, **automl_settings)
+    assert automl.model is not None
+
+
 def test_stratified_groupkfold():
    from minio.error import ServerError
    from sklearn.model_selection import StratifiedGroupKFold
@@ -108,6 +174,7 @@ def test_stratified_groupkfold():
        "split_type": splitter,
        "groups": X_train["Airline"],
        "estimator_list": [
+            "catboost",
            "lgbm",
            "rf",
            "xgboost",
@@ -136,7 +203,7 @@ def test_rank():
    automl_settings = {
        "time_budget": 2,
        "task": "rank",
-        "log_file_name": "test/{}.log".format(dataset),
+        "log_file_name": f"test/{dataset}.log",
        "model_history": True,
        "eval_method": "cv",
        "groups": np.array([0] * 200 + [1] * 200 + [2] * 200 + [3] * 200 + [4] * 100 + [5] * 100),  # group labels
@@ -149,7 +216,7 @@ def test_rank():
        "time_budget": 2,
        "task": "rank",
        "metric": "ndcg@5",  # 5 can be replaced by any number
-        "log_file_name": "test/{}.log".format(dataset),
+        "log_file_name": f"test/{dataset}.log",
        "model_history": True,
        "groups": [200] * 4 + [100] * 2,  # alternative way: group counts
        # "estimator_list": ['lgbm', 'xgboost'],  # list of ML learners
@@ -188,7 +255,7 @@ def test_object():
    automl_settings = {
        "time_budget": 2,
        "task": "classification",
-        "log_file_name": "test/{}.log".format(dataset),
+        "log_file_name": f"test/{dataset}.log",
        "model_history": True,
        "log_training_metric": True,
        "split_type": TestKFold(5),
@@ -203,4 +270,4 @@ def test_object():


 if __name__ == "__main__":
-    test_groups()
+    test_groups_for_classification_task()
--- a/test/automl/test_warmstart.py
+++ b/test/automl/test_warmstart.py
@@ -29,8 +29,8 @@ class TestWarmStart(unittest.TestCase):
        automl_val_accuracy = 1.0 - automl.best_loss
        print("Best ML leaner:", automl.best_estimator)
        print("Best hyperparmeter config:", automl.best_config)
-        print("Best accuracy on validation data: {0:.4g}".format(automl_val_accuracy))
-        print("Training duration of best run: {0:.4g} s".format(automl.best_config_train_time))
+        print(f"Best accuracy on validation data: {automl_val_accuracy:.4g}")
+        print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")
        # 1. Get starting points from previous experiments.
        starting_points = automl.best_config_per_estimator
        print("starting_points", starting_points)
@@ -97,8 +97,8 @@ class TestWarmStart(unittest.TestCase):
        new_automl_val_accuracy = 1.0 - new_automl.best_loss
        print("Best ML leaner:", new_automl.best_estimator)
        print("Best hyperparmeter config:", new_automl.best_config)
-        print("Best accuracy on validation data: {0:.4g}".format(new_automl_val_accuracy))
-        print("Training duration of best run: {0:.4g} s".format(new_automl.best_config_train_time))
+        print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
+        print(f"Training duration of best run: {new_automl.best_config_train_time:.4g} s")

    def test_nobudget(self):
        automl = AutoML()
--- a/test/conftest.py
+++ b/test/conftest.py
@@ -0,0 +1,42 @@
+from typing import Any, Dict, List, Union
+
+import numpy as np
+import pandas as pd
+from catboost import CatBoostClassifier, CatBoostRegressor, Pool
+from sklearn.metrics import f1_score, r2_score
+
+
+def evaluate_cv_folds_with_underlying_model(X_train_all, y_train_all, kf, model: Any, task: str) -> pd.DataFrame:
+    """Mimic the FLAML CV process to calculate the metrics across each fold.
+
+    :param X_train_all: X training data
+    :param y_train_all: y training data
+    :param kf: The splitter object to use to generate the folds
+    :param model: The estimator to fit to the data during the CV process
+    :param task: classification or regression
+    :return: An array containing the metrics
+    """
+    rng = np.random.RandomState(2020)
+    all_fold_metrics: List[Dict[str, Union[int, float]]] = []
+    for train_index, val_index in kf.split(X_train_all, y_train_all):
+        X_train_split, y_train_split = X_train_all, y_train_all
+        train_index = rng.permutation(train_index)
+        X_train = X_train_split.iloc[train_index]
+        X_val = X_train_split.iloc[val_index]
+        y_train, y_val = y_train_split[train_index], y_train_split[val_index]
+        model_type = type(model)
+        if model_type is not CatBoostClassifier and model_type is not CatBoostRegressor:
+            model.fit(X_train, y_train)
+        else:
+            use_best_model = True
+            n = max(int(len(y_train) * 0.9), len(y_train) - 1000) if use_best_model else len(y_train)
+            X_tr, y_tr = (X_train)[:n], y_train[:n]
+            eval_set = Pool(data=X_train[n:], label=y_train[n:], cat_features=[]) if use_best_model else None
+            model.fit(X_tr, y_tr, eval_set=eval_set, use_best_model=True)
+        y_pred_classes = model.predict(X_val)
+        if task == "classification":
+            reproduced_metric = 1 - f1_score(y_val, y_pred_classes)
+        else:
+            reproduced_metric = 1 - r2_score(y_val, y_pred_classes)
+        all_fold_metrics.append(reproduced_metric)
+    return all_fold_metrics
--- a/test/nlp/test_autohf.py
+++ b/test/nlp/test_autohf.py
@@ -30,7 +30,7 @@ def test_hf_data():

    import json

-    with open("seqclass.log", "r") as fin:
+    with open("seqclass.log") as fin:
        for line in fin:
            each_log = json.loads(line.strip("\n"))
            if "validation_loss" in each_log:
--- a/test/nlp/test_autohf_classificationhead.py
+++ b/test/nlp/test_autohf_classificationhead.py
@@ -24,6 +24,8 @@ model_path_list = [
 if sys.platform.startswith("darwin") and sys.version_info[0] == 3 and sys.version_info[1] == 11:
    pytest.skip("skipping Python 3.11 on MacOS", allow_module_level=True)

+pytestmark = pytest.mark.spark  # set to spark as parallel testing raised RuntimeError
+

 def test_switch_1_1():
    data_idx, model_path_idx = 0, 0
--- a/test/nlp/test_autohf_cv.py
+++ b/test/nlp/test_autohf_cv.py
@@ -5,6 +5,8 @@ import sys
 import pytest
 from utils import get_automl_settings, get_toy_data_seqclassification

+pytestmark = pytest.mark.spark  # set to spark as parallel testing raised MlflowException of changing parameter
+

@pytest.mark.skipif(sys.platform in ["darwin", "win32"], reason="do not run on mac os or windows")
 def test_cv():
--- a/test/nlp/test_autohf_tokenclassification.py
+++ b/test/nlp/test_autohf_tokenclassification.py
@@ -44,7 +44,7 @@ def test_tokenclassification_idlabel():
    # perf test
    import json

-    with open("seqclass.log", "r") as fin:
+    with open("seqclass.log") as fin:
        for line in fin:
            each_log = json.loads(line.strip("\n"))
            if "validation_loss" in each_log:
@@ -86,7 +86,7 @@ def test_tokenclassification_tokenlabel():
    # perf test
    import json

-    with open("seqclass.log", "r") as fin:
+    with open("seqclass.log") as fin:
        for line in fin:
            each_log = json.loads(line.strip("\n"))
            if "validation_loss" in each_log:
--- a/test/nlp/test_default.py
+++ b/test/nlp/test_default.py
@@ -10,6 +10,10 @@ from flaml.default import portfolio
 if sys.platform.startswith("darwin") and sys.version_info[0] == 3 and sys.version_info[1] == 11:
    pytest.skip("skipping Python 3.11 on MacOS", allow_module_level=True)

+pytestmark = (
+    pytest.mark.spark
+)  # set to spark as parallel testing raised ValueError: Feature NonExisting not implemented.
+

 def pop_args(fit_kwargs):
    fit_kwargs.pop("max_iter", None)
--- a/test/nni/mnist.py
+++ b/test/nni/mnist.py
@@ -25,7 +25,7 @@ logger = logging.getLogger("mnist_AutoML")

 class Net(nn.Module):
    def __init__(self, hidden_size):
-        super(Net, self).__init__()
+        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(4 * 4 * 50, hidden_size)
--- a/test/spark/test_0sparkml.py
+++ b/test/spark/test_0sparkml.py
@@ -3,10 +3,13 @@ import sys
 import warnings

 import mlflow
+import numpy as np
 import pytest
 import sklearn.datasets as skds
+from packaging.version import Version

 from flaml import AutoML
+from flaml.automl.data import auto_convert_dtypes_pandas, auto_convert_dtypes_spark, get_random_dataframe
 from flaml.tune.spark.utils import check_spark

 warnings.simplefilter(action="ignore")
@@ -20,23 +23,26 @@ else:

        from flaml.automl.spark.utils import to_pandas_on_spark

-        postfix_version = "-spark3.3," if pyspark.__version__ > "3.2" else ","
        spark = (
            pyspark.sql.SparkSession.builder.appName("MyApp")
            .master("local[2]")
            .config(
                "spark.jars.packages",
                (
-                    f"com.microsoft.azure:synapseml_2.12:0.11.3{postfix_version}"
+                    "com.microsoft.azure:synapseml_2.12:1.0.4,"
                    "org.apache.hadoop:hadoop-azure:3.3.5,"
                    "com.microsoft.azure:azure-storage:8.6.6,"
-                    f"org.mlflow:mlflow-spark:2.6.0"
+                    f"org.mlflow:mlflow-spark_2.12:{mlflow.__version__}"
+                    if Version(mlflow.__version__) >= Version("2.9.0")
+                    else f"org.mlflow:mlflow-spark:{mlflow.__version__}"
                ),
            )
            .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
            .config("spark.sql.debug.maxToStringFields", "100")
            .config("spark.driver.extraJavaOptions", "-Xss1m")
            .config("spark.executor.extraJavaOptions", "-Xss1m")
+            # .config("spark.executor.memory", "48G")
+            # .config("spark.driver.memory", "48G")
            .getOrCreate()
        )
        spark.sparkContext._conf.set(
@@ -49,8 +55,12 @@ else:
    except ImportError:
        skip_spark = True

+if sys.version_info >= (3, 11):
+    skip_py311 = True
+else:
+    skip_py311 = False

-pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]


 def _test_spark_synapseml_lightgbm(spark=None, task="classification"):
@@ -159,10 +169,11 @@ def test_spark_input_df():
    settings = {
        "time_budget": 30,  # total running time in seconds
        "metric": "roc_auc",
-        "estimator_list": ["lgbm_spark"],  # list of ML learners; we tune lightgbm in this example
+        # "estimator_list": ["lgbm_spark"],  # list of ML learners; we tune lightgbm in this example
        "task": "classification",  # task type
        "log_file_name": "flaml_experiment.log",  # flaml log file
        "seed": 7654321,  # random seed
+        "eval_method": "holdout",
    }
    df = to_pandas_on_spark(to_pandas_on_spark(train_data).to_spark(index_col="index"))

@@ -176,17 +187,17 @@ def test_spark_input_df():
    try:
        model = automl.model.estimator
        predictions = model.transform(test_data)
-        predictions.show()

-        # from synapse.ml.train import ComputeModelStatistics
-
-        # metrics = ComputeModelStatistics(
-        #     evaluationMetric="classification",
-        #     labelCol="Bankrupt?",
-        #     scoredLabelsCol="prediction",
-        # ).transform(predictions)
-        # metrics.show()
+        from synapse.ml.train import ComputeModelStatistics

+        if not skip_py311:
+            # ComputeModelStatistics doesn't support python 3.11
+            metrics = ComputeModelStatistics(
+                evaluationMetric="classification",
+                labelCol="Bankrupt?",
+                scoredLabelsCol="prediction",
+            ).transform(predictions)
+            metrics.show()
    except AttributeError:
        print("No fitted model because of too short training time.")

@@ -207,16 +218,173 @@ def test_spark_input_df():
    assert "No estimator is left." in str(excinfo.value)


+def _test_spark_large_df():
+    """Test with large dataframe, should not run in pipeline."""
+    import os
+    import time
+
+    import pandas as pd
+    from pyspark.sql import functions as F
+
+    import flaml
+
+    os.environ["FLAML_MAX_CONCURRENT"] = "8"
+    start_time = time.time()
+
+    def load_higgs():
+        # 11M rows, 29 columns, 1.1GB
+        df = (
+            spark.read.format("csv")
+            .option("header", False)
+            .option("inferSchema", True)
+            .load("/datadrive/datasets/HIGGS.csv")
+            .withColumnRenamed("_c0", "target")
+            .withColumn("target", F.col("target").cast("integer"))
+            .limit(1000000)
+            .fillna(0)
+            .na.drop(how="any")
+            .repartition(64)
+            .cache()
+        )
+        print("Number of rows in data: ", df.count())
+        return df
+
+    def load_bosch():
+        # 1.184M rows, 969 cols, 1.5GB
+        df = (
+            spark.read.format("csv")
+            .option("header", True)
+            .option("inferSchema", True)
+            .load("/datadrive/datasets/train_numeric.csv")
+            .withColumnRenamed("Response", "target")
+            .withColumn("target", F.col("target").cast("integer"))
+            .limit(1000000)
+            .fillna(0)
+            .drop("Id")
+            .repartition(64)
+            .cache()
+        )
+        print("Number of rows in data: ", df.count())
+        return df
+
+    def prepare_data(dataset_name="higgs"):
+        df = load_higgs() if dataset_name == "higgs" else load_bosch()
+        train, test = df.randomSplit([0.75, 0.25], seed=7654321)
+        feature_cols = [col for col in df.columns if col not in ["target", "arrest"]]
+        final_cols = ["target", "features"]
+        featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
+        train_data = featurizer.transform(train)[final_cols]
+        test_data = featurizer.transform(test)[final_cols]
+        train_data = to_pandas_on_spark(to_pandas_on_spark(train_data).to_spark(index_col="index"))
+        return train_data, test_data
+
+    train_data, test_data = prepare_data("higgs")
+    end_time = time.time()
+    print("time cost in minutes for prepare data: ", (end_time - start_time) / 60)
+    automl = flaml.AutoML()
+    automl_settings = {
+        "max_iter": 3,
+        "time_budget": 7200,
+        "metric": "accuracy",
+        "task": "classification",
+        "seed": 1234,
+        "eval_method": "holdout",
+    }
+    automl.fit(dataframe=train_data, label="target", ensemble=False, **automl_settings)
+    model = automl.model.estimator
+    predictions = model.transform(test_data)
+    predictions.show(5)
+    end_time = time.time()
+    print("time cost in minutes: ", (end_time - start_time) / 60)
+
+
+def test_get_random_dataframe():
+    # Test with default parameters
+    df = get_random_dataframe(n_rows=50, ratio_none=0.2, seed=123)
+    assert df.shape == (50, 14)  # Default is 200 rows and 14 columns
+
+    # Test column types
+    assert "timestamp" in df.columns and np.issubdtype(df["timestamp"].dtype, np.datetime64)
+    assert "id" in df.columns and np.issubdtype(df["id"].dtype, np.integer)
+    assert "score" in df.columns and np.issubdtype(df["score"].dtype, np.floating)
+    assert "category" in df.columns and df["category"].dtype.name == "category"
+
+
+def test_auto_convert_dtypes_pandas():
+    # Create a test DataFrame with various types
+    import pandas as pd
+
+    test_df = pd.DataFrame(
+        {
+            "int_col": ["1", "2", "3", "4", "5", "6", "6"],
+            "float_col": ["1.1", "2.2", "3.3", "NULL", "5.5", "6.6", "6.6"],
+            "date_col": ["2021-01-01", "2021-02-01", "NA", "2021-04-01", "2021-05-01", "2021-06-01", "2021-06-01"],
+            "cat_col": ["A", "B", "A", "A", "B", "A", "B"],
+            "string_col": ["text1", "text2", "text3", "text4", "text5", "text6", "text7"],
+        }
+    )
+
+    # Convert dtypes
+    converted_df, schema = auto_convert_dtypes_pandas(test_df)
+
+    # Check conversions
+    assert schema["int_col"] == "int"
+    assert schema["float_col"] == "double"
+    assert schema["date_col"] == "timestamp"
+    assert schema["cat_col"] == "category"
+    assert schema["string_col"] == "string"
+
+
+def test_auto_convert_dtypes_spark():
+    """Test auto_convert_dtypes_spark function with various data types."""
+    import pandas as pd
+
+    # Create a test DataFrame with various types
+    test_pdf = pd.DataFrame(
+        {
+            "int_col": ["1", "2", "3", "4", "NA"],
+            "float_col": ["1.1", "2.2", "3.3", "NULL", "5.5"],
+            "date_col": ["2021-01-01", "2021-02-01", "NA", "2021-04-01", "2021-05-01"],
+            "cat_col": ["A", "B", "A", "C", "B"],
+            "string_col": ["text1", "text2", "text3", "text4", "text5"],
+        }
+    )
+
+    # Convert pandas DataFrame to Spark DataFrame
+    test_df = spark.createDataFrame(test_pdf)
+
+    # Convert dtypes
+    converted_df, schema = auto_convert_dtypes_spark(test_df)
+
+    # Check conversions
+    assert schema["int_col"] == "int"
+    assert schema["float_col"] == "double"
+    assert schema["date_col"] == "timestamp"
+    assert schema["cat_col"] == "string"  # Conceptual category in schema
+    assert schema["string_col"] == "string"
+
+    # Verify the actual data types from the Spark DataFrame
+    spark_dtypes = dict(converted_df.dtypes)
+    assert spark_dtypes["int_col"] == "int"
+    assert spark_dtypes["float_col"] == "double"
+    assert spark_dtypes["date_col"] == "timestamp"
+    assert spark_dtypes["cat_col"] == "string"  # In Spark, categories are still strings
+    assert spark_dtypes["string_col"] == "string"
+
+
 if __name__ == "__main__":
    test_spark_synapseml_classification()
    test_spark_synapseml_regression()
    test_spark_synapseml_rank()
    test_spark_input_df()
+    test_get_random_dataframe()
+    test_auto_convert_dtypes_pandas()
+    test_auto_convert_dtypes_spark()

    # import cProfile
    # import pstats
    # from pstats import SortKey

-    # cProfile.run("test_spark_input_df()", "test_spark_input_df.profile")
-    # p = pstats.Stats("test_spark_input_df.profile")
-    # p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats("utils.py")
+    # cProfile.run("_test_spark_large_df()", "_test_spark_large_df.profile")
+    # p = pstats.Stats("_test_spark_large_df.profile")
+    # p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(50)
--- a/test/spark/test_automl.py
+++ b/test/spark/test_automl.py
@@ -25,7 +25,7 @@ os.environ["FLAML_MAX_CONCURRENT"] = "2"
 spark_available, _ = check_spark()
 skip_spark = not spark_available

-pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]


 def test_parallel_xgboost(hpo_method=None, data_size=1000):
--- a/test/spark/test_ensemble.py
+++ b/test/spark/test_ensemble.py
@@ -1,6 +1,7 @@
 import os
 import unittest

+import pytest
 from sklearn.datasets import load_wine

 from flaml import AutoML
@@ -24,6 +25,8 @@ if os.path.exists(os.path.join(os.getcwd(), "test", "spark", "custom_mylearner.p
 else:
    skip_my_learner = True

+pytestmark = pytest.mark.spark
+

 class TestEnsemble(unittest.TestCase):
    def setUp(self) -> None:
--- a/test/spark/test_exceptions.py
+++ b/test/spark/test_exceptions.py
@@ -9,7 +9,7 @@ from flaml.tune.spark.utils import check_spark
 spark_available, _ = check_spark()
 skip_spark = not spark_available

-pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]

 os.environ["FLAML_MAX_CONCURRENT"] = "2"

@@ -41,8 +41,8 @@ def base_automl(n_concurrent_trials=1, use_ray=False, use_spark=False, verbose=0

    print("Best ML leaner:", automl.best_estimator)
    print("Best hyperparmeter config:", automl.best_config)
-    print("Best accuracy on validation data: {0:.4g}".format(1 - automl.best_loss))
-    print("Training duration of best run: {0:.4g} s".format(automl.best_config_train_time))
+    print(f"Best accuracy on validation data: {1 - automl.best_loss:.4g}")
+    print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")


 def test_both_ray_spark():
--- a/test/spark/test_mlflow.py
+++ b/test/spark/test_mlflow.py
@@ -0,0 +1,343 @@
+import importlib
+import os
+import sys
+import time
+import warnings
+
+import mlflow
+import pytest
+from packaging.version import Version
+from sklearn.datasets import fetch_california_housing, load_diabetes
+from sklearn.ensemble import RandomForestRegressor
+from sklearn.metrics import r2_score
+from sklearn.model_selection import train_test_split
+
+import flaml
+from flaml.automl.spark.utils import to_pandas_on_spark
+
+try:
+    import pyspark
+    from pyspark.ml.evaluation import RegressionEvaluator
+    from pyspark.ml.feature import VectorAssembler
+except ImportError:
+    pass
+pytestmark = pytest.mark.spark
+warnings.filterwarnings("ignore")
+
+skip_spark = importlib.util.find_spec("pyspark") is None
+client = mlflow.tracking.MlflowClient()
+
+if (sys.platform.startswith("darwin") or sys.platform.startswith("nt")) and (
+    sys.version_info[0] == 3 and sys.version_info[1] >= 10
+):
+    # TODO: remove this block when tests are stable
+    # Below tests will fail, but the functions run without error if run individually.
+    # test_tune_autolog_parentrun_nonparallel()
+    # test_tune_autolog_noparentrun_nonparallel()
+    # test_tune_noautolog_parentrun_nonparallel()
+    # test_tune_noautolog_noparentrun_nonparallel()
+    pytest.skip("skipping MacOS and Windows for python 3.10 and 3.11", allow_module_level=True)
+
+"""
+The spark used in below tests should be initiated in test_0sparkml.py when run with pytest.
+"""
+
+
+def _sklearn_tune(config):
+    is_autolog = config.pop("is_autolog")
+    is_parent_run = config.pop("is_parent_run")
+    is_parallel = config.pop("is_parallel")
+    X, y = load_diabetes(return_X_y=True, as_frame=True)
+    train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.25)
+    rf = RandomForestRegressor(**config)
+    rf.fit(train_x, train_y)
+    pred = rf.predict(test_x)
+    r2 = r2_score(test_y, pred)
+    if not is_autolog and not is_parent_run and not is_parallel:
+        with mlflow.start_run(nested=True):
+            mlflow.log_metric("r2", r2)
+    return {"r2": r2}
+
+
+def _test_tune(is_autolog, is_parent_run, is_parallel):
+    mlflow.end_run()
+    mlflow_exp_name = f"test_mlflow_integration_{int(time.time())}"
+    mlflow_experiment = mlflow.set_experiment(mlflow_exp_name)
+    params = {
+        "n_estimators": flaml.tune.randint(100, 1000),
+        "min_samples_leaf": flaml.tune.randint(1, 10),
+        "is_autolog": is_autolog,
+        "is_parent_run": is_parent_run,
+        "is_parallel": is_parallel,
+    }
+    if is_autolog:
+        mlflow.autolog()
+    else:
+        mlflow.autolog(disable=True)
+    if is_parent_run:
+        mlflow.start_run(run_name=f"tune_autolog_{is_autolog}_sparktrial_{is_parallel}")
+    flaml.tune.run(
+        _sklearn_tune,
+        params,
+        metric="r2",
+        mode="max",
+        num_samples=3,
+        use_spark=True if is_parallel else False,
+        n_concurrent_trials=2 if is_parallel else 1,
+        mlflow_exp_name=mlflow_exp_name,
+    )
+    mlflow.end_run()  # end current run
+    mlflow.autolog(disable=True)
+    return mlflow_experiment.experiment_id
+
+
+def _check_mlflow_logging(possible_num_runs, metric, is_parent_run, experiment_id, is_automl=False, skip_tags=False):
+    if isinstance(possible_num_runs, int):
+        possible_num_runs = [possible_num_runs]
+    if is_parent_run:
+        parent_run = mlflow.last_active_run()
+        child_runs = client.search_runs(
+            experiment_ids=[experiment_id],
+            filter_string=f"tags.mlflow.parentRunId = '{parent_run.info.run_id}'",
+        )
+    else:
+        child_runs = client.search_runs(experiment_ids=[experiment_id])
+    experiment_name = client.get_experiment(experiment_id).name
+    metrics = [metric in run.data.metrics for run in child_runs]
+    tags = ["flaml.version" in run.data.tags for run in child_runs]
+    params = ["learner" in run.data.params for run in child_runs]
+    assert (
+        len(child_runs) in possible_num_runs
+    ), f"The number of child runs is not correct on experiment {experiment_name}."
+    if possible_num_runs[0] > 0:
+        assert all(metrics), f"The metrics are not logged correctly on experiment {experiment_name}."
+        assert (
+            all(tags) if not skip_tags else True
+        ), f"The tags are not logged correctly on experiment {experiment_name}."
+        assert (
+            all(params) if is_automl else True
+        ), f"The params are not logged correctly on experiment {experiment_name}."
+    # mlflow.delete_experiment(experiment_id)
+
+
+@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+def test_tune_autolog_parentrun_parallel():
+    experiment_id = _test_tune(is_autolog=True, is_parent_run=True, is_parallel=True)
+    _check_mlflow_logging([4, 3], "r2", True, experiment_id)
+
+
+def test_tune_autolog_parentrun_nonparallel():
+    experiment_id = _test_tune(is_autolog=True, is_parent_run=True, is_parallel=False)
+    _check_mlflow_logging(3, "r2", True, experiment_id)
+
+
+@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+def test_tune_autolog_noparentrun_parallel():
+    experiment_id = _test_tune(is_autolog=True, is_parent_run=False, is_parallel=True)
+    _check_mlflow_logging([4, 3], "r2", False, experiment_id)
+
+
+@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+def test_tune_noautolog_parentrun_parallel():
+    experiment_id = _test_tune(is_autolog=False, is_parent_run=True, is_parallel=True)
+    _check_mlflow_logging([4, 3], "r2", True, experiment_id)
+
+
+def test_tune_autolog_noparentrun_nonparallel():
+    experiment_id = _test_tune(is_autolog=True, is_parent_run=False, is_parallel=False)
+    _check_mlflow_logging(3, "r2", False, experiment_id)
+
+
+def test_tune_noautolog_parentrun_nonparallel():
+    experiment_id = _test_tune(is_autolog=False, is_parent_run=True, is_parallel=False)
+    _check_mlflow_logging(3, "r2", True, experiment_id)
+
+
+@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+def test_tune_noautolog_noparentrun_parallel():
+    experiment_id = _test_tune(is_autolog=False, is_parent_run=False, is_parallel=True)
+    _check_mlflow_logging(0, "r2", False, experiment_id)
+
+
+def test_tune_noautolog_noparentrun_nonparallel():
+    experiment_id = _test_tune(is_autolog=False, is_parent_run=False, is_parallel=False)
+    _check_mlflow_logging(3, "r2", False, experiment_id, skip_tags=True)
+
+
+def _test_automl_sparkdata(is_autolog, is_parent_run):
+    mlflow.end_run()
+    mlflow_exp_name = f"test_mlflow_integration_{int(time.time())}"
+    mlflow_experiment = mlflow.set_experiment(mlflow_exp_name)
+    if is_autolog:
+        mlflow.autolog()
+    else:
+        mlflow.autolog(disable=True)
+    if is_parent_run:
+        mlflow.start_run(run_name=f"automl_sparkdata_autolog_{is_autolog}")
+    spark = pyspark.sql.SparkSession.builder.getOrCreate()
+    pd_df = load_diabetes(as_frame=True).frame
+    df = spark.createDataFrame(pd_df)
+    df = df.repartition(4).cache()
+    train, test = df.randomSplit([0.8, 0.2], seed=1)
+    feature_cols = df.columns[:-1]
+    featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
+    train_data = featurizer.transform(train)["target", "features"]
+    featurizer.transform(test)["target", "features"]
+    automl = flaml.AutoML()
+    settings = {
+        "max_iter": 3,
+        "metric": "mse",
+        "task": "regression",  # task type
+        "log_file_name": "flaml_experiment.log",  # flaml log file
+        "mlflow_exp_name": mlflow_exp_name,
+        "log_type": "all",
+        "n_splits": 2,
+        "model_history": True,
+    }
+    df = to_pandas_on_spark(to_pandas_on_spark(train_data).to_spark(index_col="index"))
+    automl.fit(
+        dataframe=df,
+        label="target",
+        **settings,
+    )
+    mlflow.end_run()  # end current run
+    mlflow.autolog(disable=True)
+    return mlflow_experiment.experiment_id
+
+
+def _test_automl_nonsparkdata(is_autolog, is_parent_run):
+    mlflow_exp_name = f"test_mlflow_integration_{int(time.time())}"
+    mlflow_experiment = mlflow.set_experiment(mlflow_exp_name)
+    if is_autolog:
+        mlflow.autolog()
+    else:
+        mlflow.autolog(disable=True)
+    if is_parent_run:
+        mlflow.start_run(run_name=f"automl_nonsparkdata_autolog_{is_autolog}")
+    automl_experiment = flaml.AutoML()
+    automl_settings = {
+        "max_iter": 3,
+        "metric": "r2",
+        "task": "regression",
+        "n_concurrent_trials": 2,
+        "use_spark": True,
+        "mlflow_exp_name": None if is_parent_run else mlflow_exp_name,
+        "log_type": "all",
+        "n_splits": 2,
+        "model_history": True,
+    }
+    X, y = load_diabetes(return_X_y=True, as_frame=True)
+    train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.25)
+    automl_experiment.fit(X_train=train_x, y_train=train_y, **automl_settings)
+    mlflow.end_run()  # end current run
+    mlflow.autolog(disable=True)
+    return mlflow_experiment.experiment_id
+
+
+@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+def test_automl_sparkdata_autolog_parentrun():
+    experiment_id = _test_automl_sparkdata(is_autolog=True, is_parent_run=True)
+    _check_mlflow_logging(3, "mse", True, experiment_id, is_automl=True)
+
+
+@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+def test_automl_sparkdata_autolog_noparentrun():
+    experiment_id = _test_automl_sparkdata(is_autolog=True, is_parent_run=False)
+    _check_mlflow_logging(3, "mse", False, experiment_id, is_automl=True)
+
+
+@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+def test_automl_sparkdata_noautolog_parentrun():
+    experiment_id = _test_automl_sparkdata(is_autolog=False, is_parent_run=True)
+    _check_mlflow_logging(3, "mse", True, experiment_id, is_automl=True)
+
+
+@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+def test_automl_sparkdata_noautolog_noparentrun():
+    experiment_id = _test_automl_sparkdata(is_autolog=False, is_parent_run=False)
+    _check_mlflow_logging(0, "mse", False, experiment_id, is_automl=True)  # no logging
+
+
+@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+def test_automl_nonsparkdata_autolog_parentrun():
+    experiment_id = _test_automl_nonsparkdata(is_autolog=True, is_parent_run=True)
+    _check_mlflow_logging([4, 3], "r2", True, experiment_id, is_automl=True)
+
+
+@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+def test_automl_nonsparkdata_autolog_noparentrun():
+    experiment_id = _test_automl_nonsparkdata(is_autolog=True, is_parent_run=False)
+    _check_mlflow_logging([4, 3], "r2", False, experiment_id, is_automl=True)
+
+
+@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+def test_automl_nonsparkdata_noautolog_parentrun():
+    experiment_id = _test_automl_nonsparkdata(is_autolog=False, is_parent_run=True)
+    _check_mlflow_logging([4, 3], "r2", True, experiment_id, is_automl=True)
+
+
+@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+def test_automl_nonsparkdata_noautolog_noparentrun():
+    experiment_id = _test_automl_nonsparkdata(is_autolog=False, is_parent_run=False)
+    _check_mlflow_logging(0, "r2", False, experiment_id, is_automl=True)  # no logging
+
+
+@pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+def test_exit_pyspark_autolog():
+    import pyspark
+
+    spark = pyspark.sql.SparkSession.builder.getOrCreate()
+    spark.sparkContext._gateway.shutdown_callback_server()  # this is to avoid stucking
+    mlflow.autolog(disable=True)
+
+
+def _init_spark_for_main():
+    import pyspark
+
+    spark = (
+        pyspark.sql.SparkSession.builder.appName("MyApp")
+        .master("local[2]")
+        .config(
+            "spark.jars.packages",
+            (
+                "com.microsoft.azure:synapseml_2.12:1.0.4,"
+                "org.apache.hadoop:hadoop-azure:3.3.5,"
+                "com.microsoft.azure:azure-storage:8.6.6,"
+                f"org.mlflow:mlflow-spark_2.12:{mlflow.__version__}"
+                if Version(mlflow.__version__) >= Version("2.9.0")
+                else f"org.mlflow:mlflow-spark:{mlflow.__version__}"
+            ),
+        )
+        .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
+        .config("spark.sql.debug.maxToStringFields", "100")
+        .config("spark.driver.extraJavaOptions", "-Xss1m")
+        .config("spark.executor.extraJavaOptions", "-Xss1m")
+        .getOrCreate()
+    )
+    spark.sparkContext._conf.set(
+        "spark.mlflow.pysparkml.autolog.logModelAllowlistFile",
+        "https://mmlspark.blob.core.windows.net/publicwasb/log_model_allowlist.txt",
+    )
+
+
+if __name__ == "__main__":
+    _init_spark_for_main()
+
+    # test_tune_autolog_parentrun_parallel()
+    # test_tune_autolog_parentrun_nonparallel()
+    test_tune_autolog_noparentrun_parallel()  # TODO: runs not removed
+    # test_tune_noautolog_parentrun_parallel()
+    # test_tune_autolog_noparentrun_nonparallel()
+    # test_tune_noautolog_parentrun_nonparallel()
+    # test_tune_noautolog_noparentrun_parallel()
+    # test_tune_noautolog_noparentrun_nonparallel()
+    # test_automl_sparkdata_autolog_parentrun()
+    # test_automl_sparkdata_autolog_noparentrun()
+    # test_automl_sparkdata_noautolog_parentrun()
+    # test_automl_sparkdata_noautolog_noparentrun()
+    # test_automl_nonsparkdata_autolog_parentrun()
+    # test_automl_nonsparkdata_autolog_noparentrun()  # TODO: runs not removed
+    # test_automl_nonsparkdata_noautolog_parentrun()
+    # test_automl_nonsparkdata_noautolog_noparentrun()
+
+    test_exit_pyspark_autolog()
--- a/test/spark/test_multiclass.py
+++ b/test/spark/test_multiclass.py
@@ -2,6 +2,7 @@ import os
 import unittest

 import numpy as np
+import pytest
 import scipy.sparse
 from sklearn.datasets import load_iris, load_wine

@@ -12,6 +13,7 @@ from flaml.tune.spark.utils import check_spark

 spark_available, _ = check_spark()
 skip_spark = not spark_available
+pytestmark = pytest.mark.spark

 os.environ["FLAML_MAX_CONCURRENT"] = "2"

@@ -344,8 +346,8 @@ class TestMultiClass(unittest.TestCase):
        automl_val_accuracy = 1.0 - automl_experiment.best_loss
        print("Best ML leaner:", automl_experiment.best_estimator)
        print("Best hyperparmeter config:", automl_experiment.best_config)
-        print("Best accuracy on validation data: {0:.4g}".format(automl_val_accuracy))
-        print("Training duration of best run: {0:.4g} s".format(automl_experiment.best_config_train_time))
+        print(f"Best accuracy on validation data: {automl_val_accuracy:.4g}")
+        print(f"Training duration of best run: {automl_experiment.best_config_train_time:.4g} s")

        starting_points = automl_experiment.best_config_per_estimator
        print("starting_points", starting_points)
@@ -369,8 +371,8 @@ class TestMultiClass(unittest.TestCase):
        new_automl_val_accuracy = 1.0 - new_automl_experiment.best_loss
        print("Best ML leaner:", new_automl_experiment.best_estimator)
        print("Best hyperparmeter config:", new_automl_experiment.best_config)
-        print("Best accuracy on validation data: {0:.4g}".format(new_automl_val_accuracy))
-        print("Training duration of best run: {0:.4g} s".format(new_automl_experiment.best_config_train_time))
+        print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
+        print(f"Training duration of best run: {new_automl_experiment.best_config_train_time:.4g} s")

    def test_fit_w_starting_points_list(self, as_frame=True):
        automl_experiment = AutoML()
@@ -394,8 +396,8 @@ class TestMultiClass(unittest.TestCase):
        automl_val_accuracy = 1.0 - automl_experiment.best_loss
        print("Best ML leaner:", automl_experiment.best_estimator)
        print("Best hyperparmeter config:", automl_experiment.best_config)
-        print("Best accuracy on validation data: {0:.4g}".format(automl_val_accuracy))
-        print("Training duration of best run: {0:.4g} s".format(automl_experiment.best_config_train_time))
+        print(f"Best accuracy on validation data: {automl_val_accuracy:.4g}")
+        print(f"Training duration of best run: {automl_experiment.best_config_train_time:.4g} s")

        starting_points = {}
        log_file_name = automl_settings["log_file_name"]
@@ -409,7 +411,7 @@ class TestMultiClass(unittest.TestCase):
                if learner not in starting_points:
                    starting_points[learner] = []
                starting_points[learner].append(config)
-        max_iter = sum([len(s) for k, s in starting_points.items()])
+        max_iter = sum(len(s) for k, s in starting_points.items())
        automl_settings_resume = {
            "time_budget": 2,
            "metric": "accuracy",
@@ -431,7 +433,7 @@ class TestMultiClass(unittest.TestCase):
        new_automl_val_accuracy = 1.0 - new_automl_experiment.best_loss
        # print('Best ML leaner:', new_automl_experiment.best_estimator)
        # print('Best hyperparmeter config:', new_automl_experiment.best_config)
-        print("Best accuracy on validation data: {0:.4g}".format(new_automl_val_accuracy))
+        print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
        # print('Training duration of best run: {0:.4g} s'.format(new_automl_experiment.best_config_train_time))


--- a/test/spark/test_notebook.py
+++ b/test/spark/test_notebook.py
@@ -9,7 +9,7 @@ from flaml.tune.spark.utils import check_spark
 spark_available, _ = check_spark()
 skip_spark = not spark_available

-pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]

 here = os.path.abspath(os.path.dirname(__file__))
 os.environ["FLAML_MAX_CONCURRENT"] = "2"
--- a/test/spark/test_overtime.py
+++ b/test/spark/test_overtime.py
@@ -25,7 +25,7 @@ try:
 except ImportError:
    skip_spark = True

-pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]


 def test_overtime():
@@ -55,7 +55,7 @@ def test_overtime():
    start_time = time.time()
    automl_experiment.fit(**automl_settings)
    elapsed_time = time.time() - start_time
-    print("time budget: {:.2f}s, actual elapsed time: {:.2f}s".format(time_budget, elapsed_time))
+    print(f"time budget: {time_budget:.2f}s, actual elapsed time: {elapsed_time:.2f}s")
    # assert abs(elapsed_time - time_budget) < 5  # cancel assertion because github VM sometimes is super slow, causing the test to fail
    print(automl_experiment.predict(df))
    print(automl_experiment.model)
--- a/test/spark/test_performance.py
+++ b/test/spark/test_performance.py
@@ -11,7 +11,7 @@ from flaml.tune.spark.utils import check_spark
 spark_available, _ = check_spark()
 skip_spark = not spark_available

-pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]

 os.environ["FLAML_MAX_CONCURRENT"] = "2"

@@ -75,8 +75,8 @@ def run_automl(budget=3, dataset_format="dataframe", hpo_method=None):
    """ retrieve best config and best learner """
    print("Best ML leaner:", automl.best_estimator)
    print("Best hyperparmeter config:", automl.best_config)
-    print("Best accuracy on validation data: {0:.4g}".format(1 - automl.best_loss))
-    print("Training duration of best run: {0:.4g} s".format(automl.best_config_train_time))
+    print(f"Best accuracy on validation data: {1 - automl.best_loss:.4g}")
+    print(f"Training duration of best run: {automl.best_config_train_time:.4g} s")
    print(automl.model.estimator)
    print(automl.best_config_per_estimator)
    print("time taken to find best model:", automl.time_to_find_best_model)
--- a/test/spark/test_tune.py
+++ b/test/spark/test_tune.py
@@ -14,7 +14,7 @@ from flaml.tune.spark.utils import check_spark
 spark_available, _ = check_spark()
 skip_spark = not spark_available

-pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]

 os.environ["FLAML_MAX_CONCURRENT"] = "2"
 X, y = load_breast_cancer(return_X_y=True)
--- a/test/spark/test_utils.py
+++ b/test/spark/test_utils.py
@@ -36,7 +36,7 @@ except ImportError:
    print("Spark is not installed. Skip all spark tests.")
    skip_spark = True

-pytestmark = pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests.")
+pytestmark = [pytest.mark.skipif(skip_spark, reason="Spark is not installed. Skip all spark tests."), pytest.mark.spark]


 def test_with_parameters_spark():
@@ -167,7 +167,7 @@ def test_len_labels():
    assert len_labels(y1) == 4
    ll, la = len_labels(y2, return_labels=True)
    assert ll == 4
-    assert set(la.to_numpy()) == set([1, 2, 5, 4])
+    assert set(la.to_numpy()) == {1, 2, 5, 4}


 def test_unique_value_first_index():
--- a/test/test_autovw.py
+++ b/test/test_autovw.py
@@ -50,11 +50,11 @@ def oml_to_vw_w_grouping(X, y, ds_dir, fname, orginal_dim, group_num, grouping_m
                for i in range(len(X)):
                    NS_content = []
                    for zz in range(len(group_indexes)):
-                        ns_features = " ".join("{}:{:.6f}".format(ind, X[i][ind]) for ind in group_indexes[zz])
+                        ns_features = " ".join(f"{ind}:{X[i][ind]:.6f}" for ind in group_indexes[zz])
                        NS_content.append(ns_features)
                    ns_line = "{} |{}".format(
                        str(y[i]),
-                        "|".join("{} {}".format(NS_LIST[j], NS_content[j]) for j in range(len(group_indexes))),
+                        "|".join(f"{NS_LIST[j]} {NS_content[j]}" for j in range(len(group_indexes))),
                    )
                    f.write(ns_line)
                    f.write("\n")
@@ -67,7 +67,7 @@ def save_vw_dataset_w_ns(X, y, did, ds_dir, max_ns_num, is_regression):
    """convert openml dataset to vw example and save to file"""
    print("is_regression", is_regression)
    if is_regression:
-        fname = "ds_{}_{}_{}.vw".format(did, max_ns_num, 0)
+        fname = f"ds_{did}_{max_ns_num}_{0}.vw"
        print("dataset size", X.shape[0], X.shape[1])
        print("saving data", did, ds_dir, fname)
        dim = X.shape[1]
@@ -131,7 +131,7 @@ def load_vw_dataset(did, ds_dir, is_regression, max_ns_num):

    if is_regression:
        # the second field specifies the largest number of namespaces using.
-        fname = "ds_{}_{}_{}.vw".format(did, max_ns_num, 0)
+        fname = f"ds_{did}_{max_ns_num}_{0}.vw"
        vw_dataset_file = os.path.join(ds_dir, fname)
        # if file does not exist, generate and save the datasets
        if not os.path.exists(vw_dataset_file) or os.stat(vw_dataset_file).st_size < 1000:
@@ -139,7 +139,7 @@ def load_vw_dataset(did, ds_dir, is_regression, max_ns_num):
        print(ds_dir, vw_dataset_file)
        if not os.path.exists(ds_dir):
            os.makedirs(ds_dir)
-        with open(os.path.join(ds_dir, fname), "r") as f:
+        with open(os.path.join(ds_dir, fname)) as f:
            vw_content = f.read().splitlines()
            print(type(vw_content), len(vw_content))
        return vw_content
--- a/test/test_gpu.py
+++ b/test/test_gpu.py
@@ -59,6 +59,17 @@ def _test_hf_data():
    except requests.exceptions.ConnectionError:
        return

+    # Tests will only run if there is a GPU available
+    try:
+        import ray
+
+        pg = ray.util.placement_group([{"CPU": 1, "GPU": 1}])
+
+        if not pg.wait(timeout_seconds=10):  # Wait 10 seconds for resources
+            raise RuntimeError("No available node types can fulfill resource request!")
+    except RuntimeError:
+        return
+
    custom_sent_keys = ["sentence1", "sentence2"]
    label_key = "label"

--- a/test/tune/test_lexiflow.py
+++ b/test/tune/test_lexiflow.py
@@ -75,10 +75,10 @@ def test_lexiflow():
        layers = []
        in_features = 28 * 28
        for i in range(n_layers):
-            out_features = configuration["n_units_l{}".format(i)]
+            out_features = configuration[f"n_units_l{i}"]
            layers.append(nn.Linear(in_features, out_features))
            layers.append(nn.ReLU())
-            p = configuration["dropout_{}".format(i)]
+            p = configuration[f"dropout_{i}"]
            layers.append(nn.Dropout(p))
            in_features = out_features
        layers.append(nn.Linear(in_features, 10))
--- a/test/tune/test_pytorch_cifar10.py
+++ b/test/tune/test_pytorch_cifar10.py
@@ -24,7 +24,7 @@ try:
    # __net_begin__
    class Net(nn.Module):
        def __init__(self, l1=120, l2=84):
-            super(Net, self).__init__()
+            super().__init__()
            self.conv1 = nn.Conv2d(3, 6, 5)
            self.pool = nn.MaxPool2d(2, 2)
            self.conv2 = nn.Conv2d(6, 16, 5)
@@ -277,7 +277,7 @@ def cifar10_main(method="BlendSearch", num_samples=10, max_num_epochs=100, gpus_
    logger.info(f"#trials={len(result.trials)}")
    logger.info(f"time={time.time()-start_time}")
    best_trial = result.get_best_trial("loss", "min", "all")
-    logger.info("Best trial config: {}".format(best_trial.config))
+    logger.info(f"Best trial config: {best_trial.config}")
    logger.info("Best trial final validation loss: {}".format(best_trial.metric_analysis["loss"]["min"]))
    logger.info("Best trial final validation accuracy: {}".format(best_trial.metric_analysis["accuracy"]["max"]))

@@ -296,7 +296,7 @@ def cifar10_main(method="BlendSearch", num_samples=10, max_num_epochs=100, gpus_
    best_trained_model.load_state_dict(model_state)

    test_acc = _test_accuracy(best_trained_model, device)
-    logger.info("Best trial test set accuracy: {}".format(test_acc))
+    logger.info(f"Best trial test set accuracy: {test_acc}")


 # __main_end__
--- a/tutorials/Automl2024DemoAutoMLTask.ipynb
+++ b/tutorials/Automl2024DemoAutoMLTask.ipynb
--- a/tutorials/Automl2024DemoTuneLLM.ipynb
+++ b/tutorials/Automl2024DemoTuneLLM.ipynb
--- a/tutorials/README.md
+++ b/tutorials/README.md
@@ -1,5 +1,6 @@
 Please find tutorials on FLAML below:

+- [AutoML 2024](flaml-tutorial-automl-24.md)
 - [PyData Seattle 2023](flaml-tutorial-pydata-23.md)
 - [A hands-on tutorial on FLAML presented at KDD 2022](flaml-tutorial-kdd-22.md)
 - [A lab forum on FLAML at AAAI 2023](flaml-tutorial-aaai-23.md)
--- a/Show More
+++ b/Show More