pe.histogram package
Subpackages
Submodules
pe.histogram.histogram module
- class pe.histogram.histogram.Histogram[source]
Bases:
ABC
The abstract class for computing the histogram over synthetic samples. The histogram values indicate how good each synthetic sample is in terms their closeness to the private data.
- abstract compute_histogram(priv_data, syn_data)[source]
Compute the histogram over the synthetic data using the private data.
- Parameters:
priv_data (
pe.data.data.Data
) – The private datasyn_data (
pe.data.data.Data
) – The synthetic data
pe.histogram.nearest_neighbors module
- class pe.histogram.nearest_neighbors.NearestNeighbors(embedding, mode, lookahead_degree, lookahead_log_folder=None, voting_details_log_folder=None, api=None, num_nearest_neighbors=1, backend='auto')[source]
Bases:
Histogram
Compute the nearest neighbors histogram. Each private sample will vote for their closest num_nearest_neighbors synthetic samples to construct the histogram. The l2 norm of the votes from each private sample is normalized to 1.
- __init__(embedding, mode, lookahead_degree, lookahead_log_folder=None, voting_details_log_folder=None, api=None, num_nearest_neighbors=1, backend='auto')[source]
Constructor.
- Parameters:
embedding (
pe.embedding.embedding.Embedding
) – Thepe.embedding.embedding.Embedding
object to compute the embedding of samplesmode (str) – The distance metric to use for finding the nearest neighbors. It should be one of the following: “l2” (l2 distance), “cos_sim” (cosine similarity), “ip” (inner product). Not all backends support all modes
lookahead_degree (int) – The degree of lookahead to compute the embedding of synthetic samples. If it is 0, the original embedding is used. If it is greater than 0, the embedding of the synthetic samples is computed by averaging the embeddings of the synthetic samples generated by the variation API for lookahead_degree times
lookahead_log_folder (str, optional) – The folder to save the logs of the lookahead. If it is None, the logs are not saved. Defaults to None
voting_details_log_folder (str, optional) – The folder to save the logs of the voting details. If it is None, the logs are not saved. Defaults to None
api (
pe.api.api.API
, optional) – The API to generate synthetic samples. It should be provided when lookahead_degree is greater than 0. Defaults to Nonenum_nearest_neighbors (int, optional) – The number of nearest neighbors to consider for each private sample, defaults to 1
backend (str, optional) – The backend to use for finding the nearest neighbors. It should be one of the following: “faiss” (FAISS), “sklearn” (scikit-learn), “auto” (using FAISS if available, otherwise scikit-learn). Defaults to “auto”. FAISS supports GPU and is much faster when the number of synthetic samples and/or private samples is large. It requires the installation of faiss-gpu or faiss-cpu package. See https://faiss.ai/
- Raises:
ValueError – If the api is not provided when lookahead_degree is greater than 0
ValueError – If the backend is unknown
- _compute_lookahead_embedding(syn_data)[source]
Compute the embedding of synthetic samples with lookahead.
- Parameters:
syn_data (
pe.data.data.Data
) – The synthetic data- Returns:
The synthetic data with the computed embedding in the column
pe.constant.data.LOOKAHEAD_EMBEDDING_COLUMN_NAME
- Return type:
- _log_lookahead(syn_data, lookahead_id)[source]
Log the lookahead data.
- Parameters:
syn_data (
pe.data.data.Data
) – The lookahead datalookahead_id (int) – The ID of the lookahead
- _log_voting_details(priv_data, syn_data, ids)[source]
Log the voting details.
- Parameters:
priv_data (
pe.data.data.Data
) – The private datasyn_data (
pe.data.data.Data
) – The synthetic dataids (np.ndarray) – The IDs of the nearest neighbors for each private sample
- compute_histogram(priv_data, syn_data)[source]
Compute the nearest neighbors histogram.
- Parameters:
priv_data (
pe.data.data.Data
) – The private datasyn_data (
pe.data.data.Data
) – The synthetic data
- Returns:
The private data, possibly with the additional embedding column, and the synthetic data, with the computed histogram in the column
pe.constant.data.CLEAN_HISTOGRAM_COLUMN_NAME
and possibly with the additional embedding column- Return type:
tuple[
pe.data.data.Data
,pe.data.data.Data
]