HyperParameterTuning - fighting breast cancer

2025-04-04

This tutorial shows how to use SynapseML to identify the best combination of hyperparameters for chosen classifiers, to build more accurate and reliable models. The tutorial shows how to perform distributed randomized grid search hyperparameter tuning to build a model that identifies breast cancer.

Set up the dependencies

Import pandas and set up a Spark session:

import pandas as pd
from pyspark.sql import SparkSession

# Bootstrap Spark Session
spark = SparkSession.builder.getOrCreate()

Read the data, and split it into tuning and test sets:

data = spark.read.parquet(
    "wasbs://publicwasb@mmlspark.blob.core.windows.net/BreastCancer.parquet"
).cache()
tune, test = data.randomSplit([0.80, 0.20])
tune.limit(10).toPandas()

Define the models to use:

from synapse.ml.automl import TuneHyperparameters
from synapse.ml.train import TrainClassifier
from pyspark.ml.classification import (
    LogisticRegression,
    RandomForestClassifier,
    GBTClassifier,
)

logReg = LogisticRegression()
randForest = RandomForestClassifier()
gbt = GBTClassifier()
smlmodels = [logReg, randForest, gbt]
mmlmodels = [TrainClassifier(model=model, labelCol="Label") for model in smlmodels]

Use AutoML to find the best model

Import the SynapseML AutoML classes from synapse.ml.automl. Specify the hyperparameters with HyperparamBuilder. Add either DiscreteHyperParam or RangeHyperParam hyperparameters. TuneHyperparameters randomly chooses values from a uniform distribution:

from synapse.ml.automl import *

paramBuilder = (
    HyperparamBuilder()
    .addHyperparam(logReg, logReg.regParam, RangeHyperParam(0.1, 0.3))
    .addHyperparam(randForest, randForest.numTrees, DiscreteHyperParam([5, 10]))
    .addHyperparam(randForest, randForest.maxDepth, DiscreteHyperParam([3, 5]))
    .addHyperparam(gbt, gbt.maxBins, RangeHyperParam(8, 16))
    .addHyperparam(gbt, gbt.maxDepth, DiscreteHyperParam([3, 5]))
)
searchSpace = paramBuilder.build()
# The search space is a list of params to tuples of estimator and hyperparam
print(searchSpace)
randomSpace = RandomSpace(searchSpace)

Run TuneHyperparameters to get the best model:

bestModel = TuneHyperparameters(
    evaluationMetric="accuracy",
    models=mmlmodels,
    numFolds=2,
    numRuns=len(mmlmodels) * 2,
    parallelism=1,
    paramSpace=randomSpace.space(),
    seed=0,
).fit(tune)

Evaluate the model

View the parameters of the best model, and retrieve the underlying best model pipeline:

print(bestModel.getBestModelInfo())
print(bestModel.getBestModel())

Score against the test set, and view the metrics:

from synapse.ml.train import ComputeModelStatistics

prediction = bestModel.transform(test)
metrics = ComputeModelStatistics().transform(prediction)
metrics.limit(10).toPandas()

Share via

HyperParameterTuning - fighting breast cancer

Set up the dependencies

Use AutoML to find the best model

Evaluate the model

Related content

Feedback

Additional resources