# CyberML

## access anomalies: complement_access.py

- Talk at European Spark Conference 2019
- (Internal Microsoft) Talk at MLADS November 2018
- (Internal Microsoft) Talk at MLADS June 2019

- ComplementAccessTransformer is a SparkML Transformer. Given a dataframe, it returns a new dataframe comprised of access patterns sampled from the set of possible access patterns not present in the original dataframe. In other words, it returns a sample from the complement set.

## feature engineering: indexers.py

- IdIndexer is a SparkML Estimator. Given a dataframe, it creates an IdIndexerModel (described next) for categorical features. The model maps each partition and column seen in the given dataframe to an ID, for each partition or one consecutive range for all partition and column values.
- IdIndexerModel is a SparkML Transformer. Given a dataframe maps each partition and column field to a consecutive integer ID. Partitions or column values not encountered in the estimator are mapped to 0. The model can operate in two modes, either create consecutive integer ID independently
- MultiIndexer is a SparkML Estimator. Uses multiple IdIndexers to generate a MultiIndexerModel (described next) for categorical features. The model contains multiple IdIndexers for multiple partitions and columns.
- MultiIndexerModel is a SparkML Transformer. Given a dataframe maps each partition and column field to a consecutive integer ID. Partitions or column values not encountered in the estimator are mapped to 0. The model can operate in two modes, either create consecutive integer ID independently

## feature engineering: scalers.py

- StandardScalarScaler is a SparkML Estimator. Given a dataframe it creates a StandardScalarScalerModel (described next) which normalizes any given dataframe according to the mean and standard deviation calculated on the dataframe given to the estimator.
- StandardScalarScalerModel is a SparkML Transformer. Given a dataframe with a value column x, the transformer changes its value as follows: x'=(x-mean)/stddev. That is, if the transformer is given the same dataframe the estimator was given then the value column will have a mean of 0.0 and a standard deviation of 1.0.
- LinearScalarScaler is a SparkML Estimator. Given a dataframe it creates a LinearScalarScalerModel (described next) which normalizes any given dataframe according to the minimum and maximum values calculated on the dataframe given to the estimator.
- LinearScalarScalerModel is a SparkML Transformer. Given a dataframe with a value column x, the transformer changes its value such that if the transformer is given the same dataframe the estimator was given then the value column will be scaled linearly to the given ranges.

## access anomalies: collaborative_filtering.py

- AccessAnomaly is a SparkML Estimator. Given a dataframe, the estimator generates an AccessAnomalyModel (described next). The model can detect anomalous access of users to resources where the access is outside of the user's or resources's profile. For instance, a user from HR accessing a resource from Finance. This result is based solely on access patterns rather than explicit features. Internally, the code is based on Collaborative Filtering as implemented in Spark, using Matrix Factorization with Alternating Least Squares.
- AccessAnomalyModel is a SparkML Transformer. Given a dataframe the transformer computes a value between (-inf, inf) where positive values indicate an anomaly score. Anomaly scores are computed to have a mean of 1.0 and a standard deviation of 1.0 over the original dataframe given to the estimator.
- ModelNormalizeTransformer is a SparkML Transformer. This transformer is used internally by AccessAnomaly to normalize a model to generate anomaly scores with mean 0.0 and standard deviation of 1.0.
- AccessAnomalyConfig contains the default values for AccessAnomaly.