PcaAnomalyDetector Class
Train an anomaly model using approximate PCA via randomized SVD algorithm
- Inheritance
-
nimbusml.internal.core.decomposition._pcaanomalydetector.PcaAnomalyDetectorPcaAnomalyDetectornimbusml.base_predictor.BasePredictorPcaAnomalyDetectorsklearn.base.ClassifierMixinPcaAnomalyDetector
Constructor
PcaAnomalyDetector(normalize='Auto', caching='Auto', rank=20, oversampling=20, center=True, random_state=None, feature=None, weight=None, **params)
Parameters
Name | Description |
---|---|
feature
|
see Columns. |
weight
|
see Columns. |
normalize
|
Specifies the type of automatic normalization used:
Normalization rescales disparate data ranges to a standard scale.
Feature
scaling insures the distances between data points are proportional
and
enables various optimization methods such as gradient descent to
converge
much faster. If normalization is performed, a |
caching
|
Whether trainer should cache input training data. |
rank
|
The number of components in the PCA. |
oversampling
|
Oversampling parameter for randomized PCA training. |
center
|
If enabled, data is centered to be zero mean. |
random_state
|
The seed for random number generation. |
params
|
Additional arguments sent to compute engine. |
Examples
###############################################################################
# PcaAnomalyDetector
from nimbusml import Pipeline, FileDataStream
from nimbusml.datasets import get_dataset
from nimbusml.decomposition import PcaAnomalyDetector
from nimbusml.feature_extraction.categorical import OneHotVectorizer
# data input (as a FileDataStream)
path = get_dataset('infert').as_filepath()
data = FileDataStream.read_csv(path)
print(data.head())
# age case education induced parity ... row_num spontaneous ...
# 0 26 1 0-5yrs 1 6 ... 1 2 ...
# 1 42 1 0-5yrs 1 1 ... 2 0 ...
# 2 39 1 0-5yrs 2 6 ... 3 0 ...
# 3 34 1 0-5yrs 2 4 ... 4 0 ...
# 4 35 1 6-11yrs 1 3 ... 5 1 ...
# define the training pipeline
pipeline = Pipeline([
OneHotVectorizer(columns={'edu': 'education'}),
PcaAnomalyDetector(rank=3, feature=['induced', 'edu'])
])
# train, predict, and evaluate
metrics, predictions = pipeline.fit(data).test(
data, 'case', output_scores=True)
# Score
# 0 0.026155
# 1 0.026155
# 2 0.018055
# 3 0.018055
# 4 0.004043
# print predictions
print(predictions.head())
# print evaluation metrics
print(metrics)
# AUC DR @K FP DR @P FPR DR @NumPos Threshold @K FP ...
# 0 0.547718 0.084337 0 0.433735 0.009589 ...
Remarks
PcaAnomalyDetector
uses an approximate SVD decomposition of the
data covariance matrix
to find the principal components (eigenvectors), using a randomized
algorithm to allow for
efficient factorization of large datasets.
PCA results in a low-rank approximation of a matrix containing the data to be analyzed. Since most of the variance in the data is captured in the subspace spanned by the principal components, the distance to the subspace can be used as a measure to detect outlier instances.
The rank argument used to specify how many of the of largest principal components to use to approximate the final data matrix. A larger score at prediction time indicates that the instance is further away from the expected distance, and is more likely to be an outlier.
Normalization of the dimensions (columns) is required, and by default is turned on. Setting the normalize argument to No will therefore result in poor performance.
Reference
Randomized Methods for Computing the Singular Value Decomposition (SVD) of very large matrices A randomized algorithm for principal component analysis, Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions
Methods
get_params |
Get the parameters for this operator. |
get_params
Get the parameters for this operator.
get_params(deep=False)
Parameters
Name | Description |
---|---|
deep
|
Default value: False
|