Getting Started with NCLUSION using Simulated Data

First, install NCLUSION and its dependencies. You can find installation instructions here. If you have any questions or concerns regarding installation, please submit an issue.

To get started withNCLUSION, the input data must be in the AnnData format. Your expression data must be an n_obs x n_vars matrix, where n_obs is the number of cells and n_vars is the number of genes in the dataset, and should be stored in the AnnData.X layer. The unique cell ids must be stored in the AnnData.obs_names layer, while the gene names must be stored in the AnnData.var_names layer, with both in the form of a one-dimensional array of strings. They serve as index annotations for the rows and columns of the expression matrix, respectively. Each cell in the expression data must also contain 'condition' labels in the AnnData.obs['condition'] observation layer. This is a one-dimensional array of categorical labels that maps each cell to a timepoint or experimental condition. If cell conditions are uniform, annotate all cells with the same 'condition' label (i.e. "1"). If your dataset contains cell-type annotations and you wish to compare them to the NCLUSION-inferred labels, you can provide them in the AnnData.obs["cell_type"] layer as a pandas.Categorical object. If provided, NCLUSION will calculate various extrinsic evaluation metrics, as described in the Reference documentation. However, it is not required to provide manually curated cell-type annotations to run NCLUSION. If your dataset is unannotated, leave the AnnData.obs["cell_type"] layer empty.

We provide a simluated dataset as a .csv file because we find it useful for explaining how to set up your data as an AnnData object and use NCLUSION. This data set can be downloaded from here. Alternatively, enter the following wget command into a bash terminal to download the simulated data directory directly onto your machine and unzip it:

wget https://microsoft.github.io/nclusion/simulated_data.zip
unzip simulated_data.zip

For this next step, be sure to set up a conda virtual environment activated with the following packages installed:

For more information on how to install Anaconda and set up a virtual environment, click here. The simulated data can be loaded in as a pandas dataframe, and can be converted into the proper AnnData format described above by entering the following commands into a Python terminal with your virtual environment activated:

import pandas as pd
import scanpy as sc
import anndata as ad 
import numpy as np

expression_matrix = pd.read_csv("simulated_data/simulated_data.csv")

adata = ad.AnnData(expression_matrix, dtype=np.float32)

condition = np.ones(shape=adata.n_obs)
adata.obs['condition'] = pd.Categorical(condition)

cell_type_labels = pd.read_csv("simulated_data/simulated_labels.csv")['z1']
cell_type_labels.index = cell_type_labels.index.map(str)
adata.obs["cell_type"] = pd.Categorical(cell_type_labels)

adata.write_h5ad('simulated_data/simulated_data.h5ad')

print('Expression Matrix: ', adata.X)
print('Cell IDs: ', adata.obs_names)
print('Gene Names: ', adata.var_names)
print('Condition Labels: ', adata.obs.condition)
print('Manually-curated Cell Type Labels: ', adata.obs.cell_type)

Real data should be filtered to include only high-quality cells, and preprocessed. We recommend that you normalize the expression values such that the total counts per cell is uniform, and log-transformed. However, this is not necessary for the simulated dataset.

Running NCLUSION

To run NCLUSION, start a julia REPL in your nclusion environment by entering julia --project=/PATH/TO/JULIA/ENV --thread=auto, with the --project input being set to the path that points to your Julia nclusion environment. Then, enter the following code into the activated Julia REPL. NOTE: this code can also be pasted into a Julia script (i.e. nclusion.jl) and executed by entering julia --project=PATH/TO/JULIA/ENV nclusion.jl into a bash terminal.

ENV["GKSwstype"] = "100"
curr_dir = ENV["PWD"]

using nclusion

logger = FormatLogger() do io, args
    println(io, args._module, " | ", "[", args.level, "] ", args.message)
end;


datafilename1 = "simulated_data/simulated_data.h5ad" 
alpha1 = 1 * 10^(-7.0)
gamma1 = 1 * 10^(-7.0)
KMax = 25
seed = 12345
elbo_ep = 10^(-0.0)
dataset = "simulated_data"
outdir = curr_dir
num_iter = 150
calc_metrics = true

outputs_dict = run_nclusion(datafilename1,KMax,alpha1,gamma1,seed,elbo_ep,dataset,outdir; logger = logger, num_iter=num_iter,save_metrics=calc_metrics)
filepath = outputs_dict[:filepath]
filename = "$filepath/output.jld2"

jldsave(filename,true;outputs_dict=outputs_dict)

If you ran NCLUSION in a Julia REPL, exit the REPL by entering exit(). NCLUSION clustering assignments can be found in a file named outputs/EXPERIMENT_nclusion_simulated_data/DATASET_simulated_data_5HVGs-1800N/YEAR_MONTH_DATE_TIME/simulated_data_5HVGs-1800N_nclusion-YEAR-MONTH-DATE-TIME.csv where YEAR-MONTH-DATE-TIME corresponds to the time stamp of when the results were generated (produced automatically by NCLUSION). Run the following command in your terminal to compare your output to the expected output. Replace the PATH/TO/NCLUSION/OUTPUTS variable with the path to the simulated data clustering assignments you just generated.

cmp --silent PATH/TO/NCLUSION/OUTPUTS simulated_data/simulated_data_expected_output.csv && echo '### SUCCESS: Files Are Identical! ###'|| echo '### WARNING: Files Are Different! ###'

If NCLUSION ran successfully, the command will print '### SUCCESS: Files Are Identical! ###'. Otherwise, it will print '### WARNING: Files Are Different! ###'