CV Class

Cross Validation

Inheritance
builtins.object
CV

Constructor

CV(pipeline)

Parameters

Name Description
pipeline

Pipeline object or a list of pipeline steps that's used for cross validation

Remarks

Cross Validation is a technique used for training and testing a model when there is only one dataset. The dataset is partitioned into k parts (k is specified by the user) called folds. Each fold, in turn, is used as a test set, where the rest of the data is used as a training set. The result is k separate models. The metrics for each model are reported separately, and so is the average of each metric on all models.

Methods

fit

Cross validate the pipeline and return the results.

acceptable by the input pipeline.

Sometimes there is a need to specify which examples should not be separated into different folds. Take, for instance, the ranking problem, where instances have a "query" and a "url" feature. Instances that have the same query value should always be in the same fold (otherwise the algorithm "cheats" by seeing examples for same query). In such cases, the groups column can be used. Data rows that have the same value for groups column, will be in the same fold.

or 'after_transforms'. When the pipeline has many transforms, it would be more efficient to do the transforms before splitting the data, so that the the transforms run only once, instead of once per fold. However, with some transforms that learn from data, this could cause data leak, so extra care must be taken when using this option. split_start can precisely specify where data splitting happens:

  • 'before_transforms' means split data before all the

transforms in the pipeline. This is the default behavior and would not cause any data leak.

  • 'after_transforms' means split data after all the

transforms in the pipeline. This is the fastest option, but could cause data leak, depending on the transforms. The results from this option can be compared to 'before_transforms' results to ensure data leak doesn't happen.

  • For precise control, split_start can be specified as an

int, which means pipeline_steps[:split_start] will be applied before the split, and pipeline_steps[ split_start:] will be applied after the split. Note that 'after_transforms' is equivalent to -1, and 'before_transforms' is equivalent to 0.

dict are:

  • 'predictions': dataframe containing the predictions

for input data. The prediction for each data point corresponds to the prediction when the fold containing that data point was used as test data.

  • 'models': dataframe containing the model file path per

fold.

  • 'metrics': dataframe containing the metrics per fold.

  • 'metrics_summary': dataframe containing the summary

statistics of metrics.

  • 'confusion_matrix': dataframe containing the confusion

matrix per fold (only applicable to classification).

fit

Cross validate the pipeline and return the results.

acceptable by the input pipeline.

Sometimes there is a need to specify which examples should not be separated into different folds. Take, for instance, the ranking problem, where instances have a "query" and a "url" feature. Instances that have the same query value should always be in the same fold (otherwise the algorithm "cheats" by seeing examples for same query). In such cases, the groups column can be used. Data rows that have the same value for groups column, will be in the same fold.

or 'after_transforms'. When the pipeline has many transforms, it would be more efficient to do the transforms before splitting the data, so that the the transforms run only once, instead of once per fold. However, with some transforms that learn from data, this could cause data leak, so extra care must be taken when using this option. split_start can precisely specify where data splitting happens:

  • 'before_transforms' means split data before all the

transforms in the pipeline. This is the default behavior and would not cause any data leak.

  • 'after_transforms' means split data after all the

transforms in the pipeline. This is the fastest option, but could cause data leak, depending on the transforms. The results from this option can be compared to 'before_transforms' results to ensure data leak doesn't happen.

  • For precise control, split_start can be specified as an

int, which means pipeline_steps[:split_start] will be applied before the split, and pipeline_steps[ split_start:] will be applied after the split. Note that 'after_transforms' is equivalent to -1, and 'before_transforms' is equivalent to 0.

dict are:

  • 'predictions': dataframe containing the predictions

for input data. The prediction for each data point corresponds to the prediction when the fold containing that data point was used as test data.

  • 'models': dataframe containing the model file path per

fold.

  • 'metrics': dataframe containing the metrics per fold.

  • 'metrics_summary': dataframe containing the summary

statistics of metrics.

  • 'confusion_matrix': dataframe containing the confusion

matrix per fold (only applicable to classification).

fit(X, y=None, cv=2, groups=None, split_start='before_transforms', **params)

Parameters

Name Description
params
Required

Additional arguments sent to compute engine.

y
Default value: None
cv
Default value: 2
groups
Default value: None
split_start
Default value: before_transforms

Returns

Type Description

dict of pandas dataframes. The possible keys for this

Examples


   ###############################################################################
   # CV - cross-validate data
   import numpy as np
   from nimbusml import Pipeline, FileDataStream, DataSchema
   from nimbusml.datasets import get_dataset
   from nimbusml.feature_extraction.categorical import OneHotVectorizer
   from nimbusml.linear_model import LogisticRegressionClassifier, \
       FastLinearRegressor
   from nimbusml.model_selection import CV
   from nimbusml.preprocessing.missing_values import Indicator, Handler

   # Case 1: Default usage of CV

   path = get_dataset('infert').as_filepath()
   schema = DataSchema.read_schema(path, numeric_dtype=np.float32)
   data = FileDataStream.read_csv(path, schema=schema)

   pipeline = Pipeline([
       OneHotVectorizer(columns={'edu': 'education'}),
       LogisticRegressionClassifier(feature=['age', 'spontaneous', 'edu'],
                                    label='induced')])

   # Do 3-fold cross-validation
   cv_results = CV(pipeline).fit(data, cv=3)

   # print summary statistic of metrics
   print(cv_results['metrics_summary'])

   # print metrics for all folds
   print(cv_results['metrics'])

   # print confusion matrix for fold 1
   cm = cv_results['confusion_matrix']
   print(cm[cm.Fold == 1])

   # Case 2: Using CV with split_start option

   path = get_dataset("airquality").as_filepath()
   schema = DataSchema.read_schema(path)
   data = FileDataStream(path, schema)

   # CV also accepts the list of pipeline steps directly as input
   pipeline_steps = [
       Indicator() << {
           'Ozone_ind': 'Ozone',
           'Solar_R_ind': 'Solar_R'},
       Handler(
           replace_with='Mean') << {
           'Solar_R': 'Solar_R',
           'Ozone': 'Ozone'},
       FastLinearRegressor(
           feature=[
               'Ozone',
               'Solar_R',
               'Ozone_ind',
               'Solar_R_ind',
               'Temp'],
           label='Wind')]

   # Since the Indicator and Handler transforms don't learn from data,
   # they could be run once before splitting the data into folds, instead of
   # repeating them once per fold. We use 'split_start=after_transforms' option
   # to achieve this optimization.
   cv_results = CV(pipeline_steps).fit(data, split_start='after_transforms')

   # Results can be accessed the same way as in Case 1 above.
   print(cv_results['metrics_summary'])

Attributes

fold_column_name

fold_column_name = 'Fold'