First, install NCLUSION and its dependencies. You can find installation instructions here. If you have any questions or concerns regarding installation, please submit an issue.
To get started
withNCLUSION
, the input data must be in the AnnData
format. Your expression data must be an n_obs x n_vars
matrix, where n_obs is the number of cells and n_vars is the number of genes in the
dataset, and should be stored in the AnnData.X
layer. The unique
cell ids must be stored in the AnnData.obs_names
layer,
while the gene names must be stored in the AnnData.var_names
layer,
with both in the form of a one-dimensional array of strings. They serve as index annotations
for the rows and columns of the expression matrix, respectively. Each cell in the expression data must also contain
'condition' labels in the AnnData.obs['condition']
observation layer. This is a one-dimensional array of categorical labels that
maps each cell to a timepoint or experimental condition. If cell conditions are uniform,
annotate all cells with the same 'condition' label (i.e. "1"). If your
dataset contains cell-type annotations and you wish to compare them to the NCLUSION-inferred
labels, you can provide them in the
AnnData.obs["cell_type"]
layer as a pandas.Categorical
object. If provided, NCLUSION will calculate
various extrinsic evaluation metrics, as described in the Reference documentation. However, it is
not required to provide manually curated cell-type annotations to run NCLUSION.
If your dataset is unannotated, leave the AnnData.obs["cell_type"]
layer empty.
We provide a simluated dataset as a .csv
file because we find it useful for explaining how to set up your
data as an AnnData object and use NCLUSION. This data set can be downloaded from here. Alternatively, enter the
following wget command into a
bash terminal to download the simulated data directory
directly onto your machine and unzip it:
wget https://microsoft.github.io/nclusion/simulated_data.zip
unzip simulated_data.zip
For this next step, be sure to set up a conda
virtual environment activated with the following packages installed:
For more information on how to install Anaconda and set up a virtual environment, click here. The simulated data can be loaded in as a pandas dataframe, and can be converted into the proper AnnData format described above by entering the following commands into a Python terminal with your virtual environment activated:
import pandas as pd
import scanpy as sc
import anndata as ad
import numpy as np
expression_matrix = pd.read_csv("simulated_data/simulated_data.csv")
adata = ad.AnnData(expression_matrix, dtype=np.float32)
condition = np.ones(shape=adata.n_obs)
adata.obs['condition'] = pd.Categorical(condition)
cell_type_labels = pd.read_csv("simulated_data/simulated_labels.csv")['z1']
cell_type_labels.index = cell_type_labels.index.map(str)
adata.obs["cell_type"] = pd.Categorical(cell_type_labels)
adata.write_h5ad('simulated_data/simulated_data.h5ad')
print('Expression Matrix: ', adata.X)
print('Cell IDs: ', adata.obs_names)
print('Gene Names: ', adata.var_names)
print('Condition Labels: ', adata.obs.condition)
print('Manually-curated Cell Type Labels: ', adata.obs.cell_type)
Real data should be filtered to include only high-quality cells, and preprocessed. We recommend that you normalize the expression values such that the total counts per cell is uniform, and log-transformed. However, this is not necessary for the simulated dataset.
To run NCLUSION, start a julia REPL in your nclusion environment by entering
julia --project=/PATH/TO/JULIA/ENV --thread=auto
, with the
--project
input being set to the path that points to your Julia
nclusion environment. Then, enter the following code into the activated Julia
REPL. NOTE: this code can also be pasted into a Julia script (i.e.
nclusion.jl
) and executed by entering
julia --project=PATH/TO/JULIA/ENV nclusion.jl
into a bash
terminal.
ENV["GKSwstype"] = "100"
curr_dir = ENV["PWD"]
using nclusion
logger = FormatLogger() do io, args
println(io, args._module, " | ", "[", args.level, "] ", args.message)
end;
datafilename1 = "simulated_data/simulated_data.h5ad"
alpha1 = 1 * 10^(-7.0)
gamma1 = 1 * 10^(-7.0)
KMax = 25
seed = 12345
elbo_ep = 10^(-0.0)
dataset = "simulated_data"
outdir = curr_dir
num_iter = 150
calc_metrics = true
outputs_dict = run_nclusion(datafilename1,KMax,alpha1,gamma1,seed,elbo_ep,dataset,outdir; logger = logger, num_iter=num_iter,save_metrics=calc_metrics)
filepath = outputs_dict[:filepath]
filename = "$filepath/output.jld2"
jldsave(filename,true;outputs_dict=outputs_dict)
If you ran NCLUSION in a Julia REPL, exit the REPL by entering
exit()
. NCLUSION clustering assignments can be found in a file
named
outputs/EXPERIMENT_nclusion_simulated_data/DATASET_simulated_data_5HVGs-1800N/YEAR_MONTH_DATE_TIME/simulated_data_5HVGs-1800N_nclusion-YEAR-MONTH-DATE-TIME.csv
where YEAR-MONTH-DATE-TIME corresponds to the time stamp of when the results
were generated (produced automatically by NCLUSION). Run the following command in your
terminal to compare your output to the expected output. Replace the
PATH/TO/NCLUSION/OUTPUTS
variable with the path to the simulated
data clustering assignments you just generated.
cmp --silent PATH/TO/NCLUSION/OUTPUTS simulated_data/simulated_data_expected_output.csv && echo '### SUCCESS: Files Are Identical! ###'|| echo '### WARNING: Files Are Different! ###'
If NCLUSION ran successfully, the command will print '### SUCCESS: Files Are
Identical! ###'
. Otherwise, it will print '### WARNING: Files Are Different! ###'