CV Class
Cross Validation
- Inheritance
-
builtins.objectCV
Constructor
CV(pipeline)
Parameters
Name | Description |
---|---|
pipeline
|
Pipeline object or a list of pipeline steps that's used for cross validation |
Remarks
Cross Validation is a technique used for training and testing a model when there is only one dataset. The dataset is partitioned into k parts (k is specified by the user) called folds. Each fold, in turn, is used as a test set, where the rest of the data is used as a training set. The result is k separate models. The metrics for each model are reported separately, and so is the average of each metric on all models.
Methods
fit |
Cross validate the pipeline and return the results. acceptable by the input pipeline. Sometimes there is a need to specify which examples should not be separated into different folds. Take, for instance, the ranking problem, where instances have a "query" and a "url" feature. Instances that have the same query value should always be in the same fold (otherwise the algorithm "cheats" by seeing examples for same query). In such cases, the groups column can be used. Data rows that have the same value for groups column, will be in the same fold. or 'after_transforms'. When the pipeline has many transforms, it would be more efficient to do the transforms before splitting the data, so that the the transforms run only once, instead of once per fold. However, with some transforms that learn from data, this could cause data leak, so extra care must be taken when using this option. split_start can precisely specify where data splitting happens:
transforms in the pipeline. This is the default behavior and would not cause any data leak.
transforms in the pipeline. This is the fastest option, but could cause data leak, depending on the transforms. The results from this option can be compared to 'before_transforms' results to ensure data leak doesn't happen.
int, which means pipeline_steps[:split_start] will be applied before the split, and pipeline_steps[ split_start:] will be applied after the split. Note that 'after_transforms' is equivalent to -1, and 'before_transforms' is equivalent to 0. dict are:
for input data. The prediction for each data point corresponds to the prediction when the fold containing that data point was used as test data.
fold.
statistics of metrics.
matrix per fold (only applicable to classification). |
fit
Cross validate the pipeline and return the results.
acceptable by the input pipeline.
Sometimes there is a need to specify which examples should not be separated into different folds. Take, for instance, the ranking problem, where instances have a "query" and a "url" feature. Instances that have the same query value should always be in the same fold (otherwise the algorithm "cheats" by seeing examples for same query). In such cases, the groups column can be used. Data rows that have the same value for groups column, will be in the same fold.
or 'after_transforms'. When the pipeline has many transforms, it would be more efficient to do the transforms before splitting the data, so that the the transforms run only once, instead of once per fold. However, with some transforms that learn from data, this could cause data leak, so extra care must be taken when using this option. split_start can precisely specify where data splitting happens:
- 'before_transforms' means split data before all the
transforms in the pipeline. This is the default behavior and would not cause any data leak.
- 'after_transforms' means split data after all the
transforms in the pipeline. This is the fastest option, but could cause data leak, depending on the transforms. The results from this option can be compared to 'before_transforms' results to ensure data leak doesn't happen.
- For precise control, split_start can be specified as an
int, which means pipeline_steps[:split_start] will be applied before the split, and pipeline_steps[ split_start:] will be applied after the split. Note that 'after_transforms' is equivalent to -1, and 'before_transforms' is equivalent to 0.
dict are:
'predictions'
: dataframe containing the predictions
for input data. The prediction for each data point corresponds to the prediction when the fold containing that data point was used as test data.
'models'
: dataframe containing the model file path per
fold.
'metrics'
: dataframe containing the metrics per fold.'metrics_summary'
: dataframe containing the summary
statistics of metrics.
'confusion_matrix'
: dataframe containing the confusion
matrix per fold (only applicable to classification).
fit(X, y=None, cv=2, groups=None, split_start='before_transforms', **params)
Parameters
Name | Description |
---|---|
params
Required
|
Additional arguments sent to compute engine. |
y
|
Default value: None
|
cv
|
Default value: 2
|
groups
|
Default value: None
|
split_start
|
Default value: before_transforms
|
Returns
Type | Description |
---|---|
dict of pandas dataframes. The possible keys for this |
Examples
###############################################################################
# CV - cross-validate data
import numpy as np
from nimbusml import Pipeline, FileDataStream, DataSchema
from nimbusml.datasets import get_dataset
from nimbusml.feature_extraction.categorical import OneHotVectorizer
from nimbusml.linear_model import LogisticRegressionClassifier, \
FastLinearRegressor
from nimbusml.model_selection import CV
from nimbusml.preprocessing.missing_values import Indicator, Handler
# Case 1: Default usage of CV
path = get_dataset('infert').as_filepath()
schema = DataSchema.read_schema(path, numeric_dtype=np.float32)
data = FileDataStream.read_csv(path, schema=schema)
pipeline = Pipeline([
OneHotVectorizer(columns={'edu': 'education'}),
LogisticRegressionClassifier(feature=['age', 'spontaneous', 'edu'],
label='induced')])
# Do 3-fold cross-validation
cv_results = CV(pipeline).fit(data, cv=3)
# print summary statistic of metrics
print(cv_results['metrics_summary'])
# print metrics for all folds
print(cv_results['metrics'])
# print confusion matrix for fold 1
cm = cv_results['confusion_matrix']
print(cm[cm.Fold == 1])
# Case 2: Using CV with split_start option
path = get_dataset("airquality").as_filepath()
schema = DataSchema.read_schema(path)
data = FileDataStream(path, schema)
# CV also accepts the list of pipeline steps directly as input
pipeline_steps = [
Indicator() << {
'Ozone_ind': 'Ozone',
'Solar_R_ind': 'Solar_R'},
Handler(
replace_with='Mean') << {
'Solar_R': 'Solar_R',
'Ozone': 'Ozone'},
FastLinearRegressor(
feature=[
'Ozone',
'Solar_R',
'Ozone_ind',
'Solar_R_ind',
'Temp'],
label='Wind')]
# Since the Indicator and Handler transforms don't learn from data,
# they could be run once before splitting the data into folds, instead of
# repeating them once per fold. We use 'split_start=after_transforms' option
# to achieve this optimization.
cv_results = CV(pipeline_steps).fit(data, split_start='after_transforms')
# Results can be accessed the same way as in Case 1 above.
print(cv_results['metrics_summary'])
Attributes
fold_column_name
fold_column_name = 'Fold'