nanotune.classification.classifier

class nanotune.classification.classifier.Classifier(data_filenames: List[str], category: str, data_types: Optional[List[str]] = None, folder_path: Optional[str] = None, test_size: float = 0.2, classifier_type: Optional[str] = None, hyper_parameters: Optional[Dict[str, Union[str, float, int]]] = None, retained_variance: float = 0.99, file_fractions: Optional[List[float]] = None)[source]

Bases: object

Emulates binary classifiers implemented in scikit-learn. It includes methods loading labelled data saved to numpy files, splitting it into balanced sets, training and evaluating the classifier. The data to load is expected to be in the same format as outputted by nanotune.data.export_data.export_data.

hyper_parameters

hyperparameters/keyword arguments to pass to the binary classifier.

category

which measurements wil be classified, e.g. ‘pinchoff’, ‘singledot’, ‘dotregime’. Supported categories correspond to keys of nt.config[‘core’][‘features’].

file_fractions

fraction of all data loaded that should be used. For a value less than 1, data is chosen at random. It can be used to calculate accuracy variation when different random sub-sets of avaliable data are used for training data.

folder_path

path to folder where numpy data files are located.s

file_paths

list of paths of numpy data files.

classifier_type

string indicating which binary classifier to use. E.g ‘SVG’, ‘MLPClassifier’, …. For a full list see the keys of DEFAULT_CLF_PARAMETERS.

retained_variance

variance to retain when calculating principal components.

data_types

data types, one or more of: ‘signal’, ‘gradient’, ‘frequencies’, ‘features’.

test_size

size of test set; it’s the relative proportional of all data loaded.

clf

instance of a scikit-learn binary classifier.

original_data

all data loaded.

labels

labels of original data.

load_data(file_paths: List[str], data_types: List[str], file_fractions: Optional[List[float]] = [1.0]) Tuple[ndarray[Any, dtype[float64]], ndarray[Any, dtype[int64]]][source]

Loads data including labels from numpy files.

Parameters
  • file_paths – list of paths of numpy data files.

  • data_types – data types, one or more of: ‘signal’, ‘gradient’,

  • 'frequencies'

  • 'features'.

  • file_fractions – fraction of all data loaded that should be used. For

  • 1 (a value less than) –

  • to (data is chosen at random. It can be used) –

  • of (calculate accuracy variation when different random sub-sets) –

  • data. (avaliable data are used for training) –

Returns
  • np.array – data

  • np.array – labels

select_relevant_data(data: ndarray[Any, dtype[float64]], labels: ndarray[Any, dtype[int64]]) Tuple[ndarray[Any, dtype[float64]], ndarray[Any, dtype[int64]]][source]

Extract a subset data depending on which labels we would like to predict. Only relevant for dot data as the data file may contain four different labels.

Parameters
  • data – original data

  • label – original labels

train(data: Optional[ndarray[Any, dtype[float64]]] = None, labels: Optional[ndarray[Any, dtype[int64]]] = None) None[source]

Trains binary classifier. Equal populations of data are selected before the sklearn classifier is fitted.

Parameters
  • data – prepped data

  • label – corresponding labels

prep_data(train_data: Optional[ndarray[Any, dtype[float64]]] = None, test_data: Optional[ndarray[Any, dtype[float64]]] = None, perform_pca: Optional[bool] = False, scale_pc: Optional[bool] = False) Tuple[Optional[ndarray[Any, dtype[float64]]], Optional[ndarray[Any, dtype[float64]]]][source]

Prepares data for training and testing. It scales the data and extracts principle components is desired. Transformations are applied to both train and test data, although transformations are first fitted on training data and then applied to test data.

Parameters
  • train_data – dataset to be used for training

  • test_data – dataset to be used for testing

  • perform_pca – whether or not to perform a PCA and thus use principal components for training and testing.

  • scale_pc – whether to scale principal components.

Returns
  • np.array – prepped training data

  • np.array – prepped testing data

select_equal_populations(data: ndarray[Any, dtype[float64]], labels: ndarray[Any, dtype[int64]]) Tuple[ndarray[Any, dtype[float64]], ndarray[Any, dtype[int64]]][source]

Selects a subset of data at random so that each population/label appears equally often. Makes for a balanced dataset.

Parameters
  • data – the data to subsample and balance

  • labels – labels corresponding to data.

split_data(data: ndarray[Any, dtype[float64]], labels: ndarray[Any, dtype[int64]]) Tuple[ndarray[Any, dtype[float64]], ndarray[Any, dtype[float64]], ndarray[Any, dtype[int64]], ndarray[Any, dtype[int64]]][source]

Splits data into train and test set. Emulates sklearn’s train_test_split.

Parameters
  • data – the data to split

  • labels – labels of data

Returns
  • np.array – train data

  • np.array – test data

  • np.array – train labels

  • np.array – test labels

scale_raw_data(train_data: Optional[ndarray[Any, dtype[float64]]] = None, test_data: Optional[ndarray[Any, dtype[float64]]] = None) Tuple[Optional[ndarray[Any, dtype[float64]]], Optional[ndarray[Any, dtype[float64]]]][source]

Scales data loaded from numpy files by emulating sklearn’s StandardScaler. The scaler is first fitted using the train set before also scaling the test set.

Parameters
  • train_data – dataset to be used for training

  • test_data – dataset to be used for testing

Returns
  • np.array – scaled training data

  • np.array – scaled test data

scale_compressed_data(train_data: Optional[ndarray[Any, dtype[float64]]] = None, test_data: Optional[ndarray[Any, dtype[float64]]] = None) Tuple[Optional[ndarray[Any, dtype[float64]]], Optional[ndarray[Any, dtype[float64]]]][source]

Scales previously computed principal components.

Parameters
  • train_data – dataset to be used for training, containing principal components.

  • test_data – dataset to be used for testing, containing principal components.

Returns
  • np.array – scaled training data containing scaled principal components.

  • np.array – scaled test data containing scaled principal components.

get_principle_components(train_data: Optional[ndarray[Any, dtype[float64]]] = None, test_data: Optional[ndarray[Any, dtype[float64]]] = None) Tuple[Optional[ndarray[Any, dtype[float64]]], Optional[ndarray[Any, dtype[float64]]]][source]

Computes principal components.

Parameters
  • train_data – dataset to be used for training

  • test_data – dataset to be used for testing

Returns
  • np.array – scaled training data containing principal components.

  • np.array – scaled test data containing principal components.

score(test_data: ndarray[Any, dtype[float64]], test_labels: ndarray[Any, dtype[float64]]) float[source]

Emulates the binary classifiers score method and returns the mean accuracy for the given test set.

Parameters
  • test_data – prepped test data

  • test_labels – labels of test_data.

Returns

float – mean accuracy.

predict(dataid: int, db_name: str, db_folder: Optional[str] = None, readout_method_to_use: str = 'transport') List[Any][source]

Classifies the trace of a QCoDeS dataset. Ideally, this data has been measured with nanotune and/or normalization constants saved to metadata under the nt.meta_tag key. Otherwise correct classification can not be guaranteed.

Parameters
  • dataid – QCoDeS run ID.

  • db_name – name of database

  • db_folder – path to folder where database is located.

Returns

array – containing the result as integers.

compute_metrics(n_iter: Optional[int] = None, save_to_file: bool = True, filename: str = '', supp_train_data: Optional[List[str]] = None, n_supp_train: Optional[int] = None, perform_pca: bool = False, scale_pc: bool = False) Tuple[Dict[str, Dict[str, Any]], ndarray[Any, dtype[float64]]][source]

Computes different metrics of a classifier, averaging over a number of iterations. Beside metrics, the training time is tracked too. All information extracted is saved to a dict. Metrics computed are accuracy_score, brier_score_loss, auc (area under curve), average_precision_recall and the confusion matrix - all implemented in sklearn.metrics.

Parameters
  • n_iter – number of train and test iterations over which the metrics statistic should be computed.

  • save_to_file – whether to save metrics info to file.

  • filename – name of json file if metrics are saved.

  • supp_train_data – list of paths to files with additional training data.

  • n_supp_train – number of datasets which should be added to the train set from additional data.

  • perform_pca – whether the metrics should be computed using principal components for training and testing.

  • scale_pc – whether principal components should be scaled.

Returns
  • dict – summary od results, mapping the a string indicating the quantity onto the quantity itself. Containt mean and std each metric.

  • np.array – metric results of all iterations. Shape is (len(METRIC_NAMES), n_iter), where the metrics appear in the order defined by METRIC_NAMES.