NCLUSION package documentation, examples, and tutorials, can be found here: NCLUSION Documentation and Tutorials.
Clustering is commonly used in single-cell RNA-sequencing (scRNA-seq) pipelines to characterize cellular heterogeneity. However, current methods face two main limitations. First, they require user-specified heuristics which add time and complexity to bioinformatic workflows; and, second, they rely on post-selective differential expression analyses to identify marker genes driving cluster differences which has been shown to be subject to inflated false discovery rates. In this repository, we present a solution to those challenges by introducing "nonparametric clustering of single-cell populations" (NCLUSION): an infinite mixture model that leverages Bayesian sparse priors to simultaneously identify marker genes while performing clustering on single-cell expression data. NCLUSION uses a scalable variational inference algorithm to perform these analyses on datasets with up to millions of cells. In this package and corresponding documentation we demonstrate that NCLUSION (i) matches the performance of other state-of-the-art clustering techniques with significantly reduced runtime and (ii) provides statistically robust and biologically-relevant transcriptomic signatures for each of the clusters it identifies. Overall, NCLUSION represents a reliable hypothesis generating tool for understanding patterns of expression variation present in single-cell populations.
NCLUSION uses a sparse hierarchical Dirichlet process normal mixture model to reduce the number of choices that users need to make while simultaneously performing variable selection to identify top cluster-specific marker genes for downstream analyses. There are three key components to NCLUSION. First, NCLUSION is fit directly on single-cell expression matrices and does not require the data to be embedded within a lower-dimensional space. Second, it assumes that cells can belong to one of infinitely many different clusters. This is captured by placing a Dirichlet process prior over the cluster assignment probabilities for each cell. By always setting the number of possible clusters K = ∞, we remove the need for users to have to iterate through different values until they find the optimal choice. Instead, NCLUSION uses a stick-breaking proccess to naturally allow for more clusters as the amount of variation in a single cell dataset grows. Third, NCLUSION assumes that not all genes are important when assigning cells to a given cluster. To model this, we place a spike and slab prior on the mean expression of each gene within each cluster. This prior shrinks the mean of genes that play an insignificant role in cluster formation towards zero. The learning of the model parameters is done via a variational-EM algorithm that allows the method to scale with the number of cells in the data.
This package requires that Julia 1.6+ and its dependencies are installed. Instructions for installation can be found here.
To create a new Julia project environment and install the NCLUSION package, follow these steps:
julia
in a bash terminal while in
your working directory.]
.
Create a new Julia project environment by typing generate
followed by the name of your project (i.e. generate nclusion_demo
). This creates a project
directory named after the environment you created, and two files: Project.toml
and Manifest.toml
.
These files, which can be found in the project
directory, specify the packages that are downloaded in the environment.activate
followed by the name of the Julia environment (i.e. activate
nclusion_demo
).add nclusion
You can exit the Julia Pkg REPL by typing Ctrl + C
, and
the Julia REPL by entering exit()
.To install NCLUSION into an existing Julia project environment, follow these steps:
julia --project=/path/to/julia/env --thread=auto
into a bash terminal, where --project
is set equal to the path to your existing Julia project environment.using Pkg;Pkg.add("nclusion")
You can exit the Julia Pkg REPL by typing Ctrl + C
, and
the Julia REPL by entering exit()
.Alternatively, you can install NCLUSION directly using git by following these steps:
git clone git@github.com:microsoft/nclusion.git
cd nclusion/
julia
]
. This command starts the Pkg REPL. In this Pkg REPL, type the following Julia command: activate .
This activates the nclusion environment. You can verify that the environment is active if the new Pkg REPL prompt is (nclusion) pkg>
.
instantiate
precompile
This completes the installation process. You can exit the Julia Pkg REPL by typing Ctrl + C
, and the Julia REPL by entering exit()
.C. Nwizu, M. Hughes, M. Ramseier, A. Navia, A. Shalek, N. Fusi, S. Raghavan, P. Winter, A. Amini, and L. Crawford. Scalable nonparametric clustering with unified marker gene selection for single-cell RNA-seq data. bioRxiv. https://doi.org/10.1101/2024.02.11.579839
If you have any questions or concerns regarding NCLUSION or the tutorials, please contact Chibuikem Nwizu, Lorin Crawford, Ava Amini, or Madeline Hughes. All feedback on the repository, manuscript, and tutorials is appreciated.