Constructing differential co-expression gene programs

SSpMosaic generate program

This tutorial demonstrates the process of constructing differential co-expression gene programs.

Loading package

library(SSpMosaic)

Set the working directory

Users can customize this directory by specifying an appropriate path as needed.

setwd('./')
current_dir <- getwd()
result_dir <- paste0(current_dir,'/result/')

Loading data

The input for SSpMosaic is a Seurat object containing dimensionality reduction data (e.g., PCA results) and cell labels (e.g., from unsupervised clustering or cell type annotations). The example data required to run this tutorial can be downloaded from https://drive.google.com/drive/folders/12abksyeOY0xTHCo25c-jugRn_MlZzDK1 We recommend providing a predefined set of genes of interest (e.g., marker genes) for each cluster. Using this input, the algorithm will extract SSpMosaic programs accordingly. If no genes of interest are specified, the algorithm will automatically identify marker genes for each cluster and use them as candidate genes for SSpMosaic programs.

seurat_object <- readRDS(paste0(current_dir, "/data/human_brain4_preprocessed.rds"))
markers <- readRDS(paste0(current_dir, "/data/Inh_markers.rds"))

Generate candidate programs

The function generate_module is used to produce several candidate programs for each cell cluster within the Seurat object. These programs serve as potential representatives of the cell clusters. However, further screening and filtering are required to identify the final representative programs.The parameters of generate_module are as follows:

object: A Seurat object containing reduction and cell cluster information.
cluster_col: A string indicating the column name in the Seurat object metadata corresponding to the cell cluster assignment.
meta_cell: A named integer vector, with names corresponding to cell clusters. Specifies the number of meta cells in hdWGCNA for each cluster. If NULL, the default value is used (the ceiling of the number of cells in each cluster divided by 30).
max_share: A named integer vector, with names corresponding to cell clusters. Specifies the maximum number of shared cells for meta cells in hdWGCNA for each cluster. If NULL, the default value is used.
soft_power: A named integer vector, with names corresponding to cell clusters. Specifies the soft_power parameter in hdWGCNA for each cluster. If NULL, the default value is used (the lowest power achieving a 0.8 scale-free topology fit).
normalize_metacell: A named boolean vector that specifies whether to normalize metacell in hdWGCNA for each cluster,named by cell clusters,or NULL to use the default value.
cluster_chosen: A character vector specifying the cell clusters to generate SSpMosaic programs. If NULL, all clusters are selected by default.
min_cell_number: A positive integer specifying the minimum number of cells required for a cell cluster.
sample_name: A string specifying the sample name.
out_dir: A string specifying the output directory path.
gene_use: A named list, with names corresponding to cell clusters, specifying the gene candidates for SSpMosaic. If NULL, marker genes are used.
log2FC_thres: A positive double specifying the log2 fold-change threshold for identifying cell cluster markers. Used only when no gene set of interest is provided for the cluster.
min.pct: A non-negative double specifying the minimum fraction of cells in the cluster that must detect a gene for it to be tested. Used only when no gene set of interest is provided for the cluster.
min_metacell: A positive integer specifying the minimum number of meta cells.
assay: A named vector, with names corresponding to cell clusters, specifying the assay used for hdWGCNA. If NULL, the default assay is used.
slot: A named vector, with names corresponding to cell clusters, specifying the slot used to extract data.
layer: A named vector, with names corresponding to cell clusters, specifying the layer used to extract data. Applicable only to Seurat v5.
verbose: A logical value indicating whether to display detailed information during program generation.

#Choose to normalize the counts of metacells for every cluster
norm  <- rep(TRUE,length(unique(seurat_object$celltype)))
names(norm) <- unique(seurat_object$celltype)
#Choose to use the expression values from the layer data  for every cluster
layer <- rep('data',length(unique(seurat_object$celltype)))
names(layer) <- unique(seurat_object$celltype)
#Run candidate program generation
generate_module(object = seurat_object,sample_name = 'human_brain4',cluster_col = 'celltype',out_dir = result_dir,normalize_metacell = norm,min_cell_number = 100,gene_use = markers,layer = layer,verbose = FALSE)

Read candidate programs

The function get_module is used to read the candidate programs generated by generate_module.The parameters of get_module are as follows:

sample_name: A string specifying the sample name; it must match the sample_name parameter used in generate_module.
read_dir: A string specifying the output directory path; it must match the out_dir parameter used in generate_module.

#Read the generated candidate programs
m <- get_module(sample_name = 'human_brain4',read_dir = result_dir)

Calculating candidate program score on the dataset

The function score_module is used to calculate the score of each candidate program generated by generate_module on the dataset.The parameters of score_module are as follows:

module: The return value of the get_module function.
sample_name: A string specifying the sample name; it must match the sample_name parameter used in generate_module.
read_dir: A string specifying the output directory path; it must match the out_dir parameter used in generate_module.
cluster_col: A string indicating the column name in the Seurat object metadata corresponding to the cell cluster assignment; it must match the cluster_col parameter used in generate_module.
nbin: An integer specifying the number of bins of aggregate expression levels for all analyzed features used in Seurat::AddModuleScore.

#Calculate the candidate program scores on the dataset
score_module(module = m,sample_name = 'human_brain4',read_dir = result_dir,cluster_col = 'celltype',nbin = 20)

Program selection

The function filter_module is used to perform screening and filtering to identify the final representative programs based on the scores calculated by score_module.The parameters of filter_module are as follows:

object: A Seurat object; it must match the object parameter used in generate_module.
module: The return value of the get_module function.; it must match the module parameter used in score_module.
sample_name: A string specifying the sample name; it must match the sample_name parameter used in generate_module.
read_dir: A string specifying the output directory path; it must match the out_dir parameter used in generate_module.
cluster_col: A string indicating the column name in the Seurat object metadata corresponding to the cell cluster assignment; it must match the cluster_col parameter used in generate_module.
sd_thres: A non-negative double that specifies the threshold for the difference between two standard deviations; the default value is 0.01, or it can be set to NA to disable filtering based on this metric.
mean_thres: A non-negative double,indicating the threshold of the difference between two mean values; the default value is NA to disable filtering based on this metric.
pct_thres: A non-negative double,indicating the threshold of the proportion of cells with program score greater than zero; the default value is NA to disable filtering based on this metric.
merge_module: A boolean variable that specifies whether to merge modules generated from the same clusters;the default value is TRUE.

#Filter the candidate programs to generate the final SSpMosaic programs for this dataset
res <- filter_module(object = seurat_object, module = m,sample_name = 'human_brain4',read_dir = result_dir,cluster_col = 'celltype')

## Reading module score
## Filtering module
## [1] "human.brain4.Inh.PVALB.1"
## [1] "human.brain4.Inh.SNCG.1"
## [1] "human.brain4.Inh.SNCG.2"
## [1] "human.brain4.Inh.SNCG.3"
## [1] "human.brain4.Inh.VIP.1"
## [1] "human.brain4.Inh.LAMP5.1"
## [1] "human.brain4.Inh.LAMP5.2"
## [1] "human.brain4.Inh.LAMP5.3"
## [1] "human.brain4.Inh.LAMP5.4"
## [1] "human.brain4.Inh.LAMP5.5"
## [1] "human.brain4.Inh.SST.1"
## [1] "human.brain4.Inh.SST.2"
## [1] "human.brain4.Inh.SST.3"
## [1] "human.brain4.Inh.SST.4"
## [1] "human.brain4.Inh.CHANDELIER.1"
## [1] "human.brain4.Inh.CHANDELIER.2"
## Saving module