In this tutorial, we employs a network propagation algorithm to assess similarities between programs across different clusters, generating a structured distance matrix that supports hierarchical clustering. This process results in the formation of metaprograms, which are higher-order representations of gene programs with weighted gene contributions.
library(SSpMosaic)
library(pheatmap)
Users can replace this directory with an appropriate path when using
setwd('./')
current_dir <- getwd()
result_dir <- paste0(current_dir,'/result/')
data_dir <- paste0(current_dir,'/data/')
The input for SSpMosaic network propagation includes the collection
of outputs obtained by running filter_module
separately on each dataset, and all resulting files
must reside in the same directory.
The example data required to run this tutorial can be downloaded from https://drive.google.com/drive/folders/12abksyeOY0xTHCo25c-jugRn_MlZzDK1.
The get_all_modules
function is used to retrieve
SSpMosaic programs from multiple datasets. The parameters of
get_all_modules
are as follows:
sample_names: A string vector including the
sample_name
parameters used in the
filter_module
function for each dataset.
read_dir: A string indicating the directory
path;note that the filter_module
outputs of multiple
datasets should be saved in the same directory.
#Specify the dataset names
samples <- c('human_brain1','human_brain2','human_brain3','human_brain4')
#Read the filtered SSpMosaic programs generated from multiple datasets
moudules <- get_all_modules(sample_names = samples,read_dir = data_dir)
Summarize the SSpMosaic programs into a data frame to serve as the
input for the gene_program_network_propagation_dist
function.
#Get the length of the longest gene set
max_length <- max(sapply(moudules, length))
#Merge the gene sets of different programs
df <- do.call(rbind, lapply(moudules, function(x) {
length(x) <- max_length
x
}))
#Convert to data.frame type
df <- as.data.frame(df, stringsAsFactors = FALSE)
The protein-protein interaction (PPI) matrix is used as a biological background network. Users can choose different biological background networks based on their research needs, such as pathway-based biological networks.
The example ppi matrix can be downloaded from https://drive.google.com/drive/folders/12abksyeOY0xTHCo25c-jugRn_MlZzDK1.
#Load ppi matrix
ppi <- readRDS(paste0(data_dir,'human_ppi_matrix.rds'))
The function gene_program_network_propagation_dist
is
used to run the network propagation algorithm and calculate the
distances between programs based on the results of network
propagation.
The parameters of gene_program_network_propagation_dist
are as follows:
gene_programs: A data.frame as generated above,including the gene set corresponding to each gene program
biological_network: A matrix where both rows and columns represent genes, and the values represent interaction intensity..
#Generate an inter-program distance matrix using network propagation
distance_matrix <- gene_program_network_propagation_dist(gene_programs = df,biological_network = ppi)
Generate SSpMosaic metaprograms using Hierarchical Clustering.
#Only keep Inh
distance_matrix <- distance_matrix[grepl('Inh',rownames(distance_matrix)),grepl('Inh',colnames(distance_matrix))]
#Drop cell types that only appear in a single sample
distance_matrix <- distance_matrix[!grepl('Inh.PAX6|Inh.ADARB2',rownames(distance_matrix)),!grepl('Inh.PAX6|Inh.ADARB2',colnames(distance_matrix))]
#Run hierarchical Clustering
p <- pheatmap::pheatmap(1-distance_matrix,fontsize_row = 7,fontsize_col = 7,cellwidth = 7,cellheight = 7)
#Cut the hierarchical clustering tree
clusters <- cutree(p$tree_row, k = 8)
Summarize the correspondence between SSpMosaic metaprograms and programs, and generate metaprogram names.
#Initialize a list to store the correspondence between SSpMosaic metaprograms and programs
meta_moudule<- list()
#Get the hierarchical clustering ids corresponding to all metaprograms
unique_clusters <- unique(clusters)
#Define function to handle one-to-many cases.
count_string_occurrences <- function(arr, target) {
count <- sum(grepl(target, arr))
return(count)
}
#Iterate over each metaprogram
for (cluster in unique_clusters) {
#Retrieve all programs contained within this metaprogram
variables <- names(clusters[clusters == cluster])
#Obtain the original clusters to which these programs belonged
extracted_chars <- gsub("^.*_", "", variables)
u <- unique(extracted_chars)
#Generate the metaprogram name based on the original clusters to which the items within the metaprogram belonged
meta_name <- "meta_program"
for(l in 1:length(u))
{
meta_name <- paste(meta_name,u[l],sep = '_')
}
#Handle one-to-many cases
if(count_string_occurrences(names(meta_moudule),meta_name) >= 1)
{
meta_name <- paste(meta_name,(count_string_occurrences(names(meta_moudule),meta_name)+1),sep = '_')
}
#Save the metaprogram into the list
meta_moudule[[meta_name]] <- variables
}
Establish the correspondence between SSpMopsaic metaprograms and genes through the correspondence between metaprograms and programs.
#Initialize a list to store the correspondence between SSpMosaic metaprograms and genes
meta_moudule_genes <- list()
#Iterate over each metaprogram
for(meta in names(meta_moudule))
{
#Iterate over each program within the metaprogram
for(program in meta_moudule[[meta]])
{
#Merge the gene set corresponding to the program into the overall gene set of the metaprogram
meta_moudule_genes[[meta]] <- c(meta_moudule_genes[[meta]],moudules[[program]])
}
}