This function runs the NCLUSION method

run_nclusion(datafilename1, KMax, alpha1, gamma1,
    seed, elbo_ep,
    dataset,
    outdir, logger=nothing,
    num_iter=150, save_metrics=false)

Arguments

datafilename1

Absolute path to preprocessed AnnData object with scRNA-seq expression values (cells x genes). The AnnData.obs_names layer must be occupied with the unique cell IDs, the AnnData.var_names with unique gene identifiers, AnnData.X with the expression matrix, and AnnData.obs["condition"] with the condition label for each cell. Optional: if the dataset has been pre-annotated, AnnData.obs["cell_type"] may be occupied with these annotations. For more detailed information on the required input data structure see Get started.

KMax

Maximum number of clusters NCLUSION will be initialized with

alpha1

Second level concentration parameter (smaller values give fewer cluster)

gamma1

Top level concentration parameter (smaller values give fewer cluster)

seed

Random seed

elbo_ep

Minumum tolerance in the change in elbo values between iterations. This determines whether or not convergence is reached.

dataset

Name of the data set being analyzed (for output naming purposes)

outdir

Absolute path to where NCLUSION's output will be written

logger

Logging object that writes algorithms progress to stdout [Default: nothing]

num_iter

Maximum number of iterations the variational inference algorithm will continue to run before stopping if convergence is not reached [Default: 150]

save_metrics

Boolean that indicates whether or not NCLUSION should calculate label-dependent clustering metrics (Note: Requires reference labels to be provided in the AnnData.obs["cell_type"] of the input data object) [Default: false]

Value

Dictionary of all estimated parameters, including PIPs (yjk_) and cluster propbability (rtik_)

Saved Output Files

The following output files are automatically saved to outputs/EXPERIMENT_nclusion_{DATA-NAME}/DATASET_{DATA-NAME}_{NUM-GENES}HVGs-{NUM-CELLS}N/{YEAR_MONTH_DATE_TIME}/ which is automatically generated in the specified output directory, outdir, after running NCLUSION.

NOTE: {DATA-NAME} corresponds to the dataset input variable. {NUM-GENES} corresponds to the number of genes included in the input AnnData object. {NUM-CELLS} corresponds to the number of cells included in the input AnnData object. {YEAR-MONTH-DATE-TIME} corresponds to the time-stamp at which the outputs were generated (calculated automatically by NCLUSION).

  1. {DATA-NAME}_{NUM-GENES}HVGs-{NUM-CELLS}N_nclusion-{YEAR-MONTH-DATE-TIME}.csv
    • .csv file that contains the NCLUSION clustering assignments, where each row displays relevant metadata and clustering results for each cell.
    • The 'condition' column displays the cell's experimental condition labels (all cells in this study have the same condition label).
    • The 'cell_id' column displays the cell's unique identifier label.
    • The 'inferred_label' column displays the cell's NCLUSION cluster assignment.
    NOTE: the following columns are only filled if alternative cell-type annotations are provided in the AnnData.obs['cell_type'] layer of the input data. Otherwise they remain empty:
    • The 'cell_type' column displays the cell's alternative cell-type annotation obtained from a different study.
    • The 'called_label' column displays the numerical equivalent of the 'cell_type' label, mapped automatically by NCLUSION.
  2. {NUM-GENES}G-{YEAR-MONTH-DATE-TIME}-pips.csv
    • .csv file that contains the Posterior Inclusion Probabilities (PIPs) of genes across clusters identified by NCLUSION. PIPs are used as a summary of evidence for a gene being associated with driving the identity of any phenotypic cluster. For a particular cluster, a higher PIP indicates that a gene is more significant to the formation of that cluster.
  3. _QuickSummary_{YEAR-MONTH-DATE-TIME}.txt
    • .txt file that contains the input parameters that were used to run NCLUSION.
  4. output.jld2
    • .jld2 file that contains all model parameters
  5. {YEAR-MONTH-DATE-TIME}-Nk.csv
    • .csv file that contains estimated number of cells in each cluster identified by NCLUSION