Network propagation

In this tutorial, we employs a network propagation algorithm to assess similarities between programs across different clusters, generating a structured distance matrix that supports hierarchical clustering. This process results in the formation of metaprograms, which are higher-order representations of gene programs with weighted gene contributions.

Loading package

library(SSpMosaic)
library(pheatmap)

Setting the working directory

Users can replace this directory with an appropriate path when using

setwd('./')
current_dir <- getwd()
result_dir <- paste0(current_dir,'/result/')
data_dir <- paste0(current_dir,'/data/')

Loading SSpMosaic programs from multiple samples

The input for SSpMosaic network propagation includes the collection of outputs obtained by running filter_module separately on each dataset, and all resulting files must reside in the same directory.

The example data required to run this tutorial can be downloaded from https://drive.google.com/drive/folders/12abksyeOY0xTHCo25c-jugRn_MlZzDK1.

The get_all_modules function is used to retrieve SSpMosaic programs from multiple datasets. The parameters of get_all_modules are as follows:

sample_names: A string vector including the sample_name parameters used in the filter_module function for each dataset.
read_dir: A string indicating the directory path;note that the filter_module outputs of multiple datasets should be saved in the same directory.

#Specify the dataset names
samples <- c('human_brain1','human_brain2','human_brain3','human_brain4')
#Read the filtered SSpMosaic programs generated from multiple datasets
moudules <- get_all_modules(sample_names = samples,read_dir = data_dir)

Converting to data frame

Summarize the SSpMosaic programs into a data frame to serve as the input for the gene_program_network_propagation_dist function.

#Get the length of the longest gene set
max_length <- max(sapply(moudules, length))

#Merge the gene sets of different programs
df <- do.call(rbind, lapply(moudules, function(x) {
  
  length(x) <- max_length  
  x  
}))

#Convert to data.frame type
df <- as.data.frame(df, stringsAsFactors = FALSE)

Loading ppi matrix

The protein-protein interaction (PPI) matrix is used as a biological background network. Users can choose different biological background networks based on their research needs, such as pathway-based biological networks.

The example ppi matrix can be downloaded from https://drive.google.com/drive/folders/12abksyeOY0xTHCo25c-jugRn_MlZzDK1.

#Load ppi matrix
ppi <- readRDS(paste0(data_dir,'human_ppi_matrix.rds'))

Network propagation

The function gene_program_network_propagation_dist is used to run the network propagation algorithm and calculate the distances between programs based on the results of network propagation.

The parameters of gene_program_network_propagation_dist are as follows:

gene_programs: A data.frame as generated above,including the gene set corresponding to each gene program
biological_network: A matrix where both rows and columns represent genes, and the values represent interaction intensity..

#Generate an inter-program distance matrix using network propagation
distance_matrix <- gene_program_network_propagation_dist(gene_programs = df,biological_network = ppi)

Hierarchical Clustering

Generate SSpMosaic metaprograms using Hierarchical Clustering.

#Only keep Inh
distance_matrix <- distance_matrix[grepl('Inh',rownames(distance_matrix)),grepl('Inh',colnames(distance_matrix))] 
#Drop cell types that only appear in a single sample
distance_matrix <- distance_matrix[!grepl('Inh.PAX6|Inh.ADARB2',rownames(distance_matrix)),!grepl('Inh.PAX6|Inh.ADARB2',colnames(distance_matrix))]
#Run hierarchical Clustering
p <- pheatmap::pheatmap(1-distance_matrix,fontsize_row = 7,fontsize_col = 7,cellwidth = 7,cellheight = 7)

#Cut the hierarchical clustering tree
clusters <- cutree(p$tree_row, k = 8)

Generate SSpMosaic metaprograms

Summarize the correspondence between SSpMosaic metaprograms and programs, and generate metaprogram names.

#Initialize a list to store the correspondence between SSpMosaic metaprograms and programs
meta_moudule<- list()
#Get the hierarchical clustering ids corresponding to all metaprograms
unique_clusters <- unique(clusters)
#Define function to handle one-to-many cases.
count_string_occurrences <- function(arr, target) {
  count <- sum(grepl(target, arr))
  return(count)
}
#Iterate over each metaprogram
for (cluster in unique_clusters) {
  #Retrieve all programs contained within this metaprogram
  variables <- names(clusters[clusters == cluster])
  #Obtain the original clusters to which these programs belonged
  extracted_chars <- gsub("^.*_", "", variables)
  u <- unique(extracted_chars)
  #Generate the metaprogram name based on the original clusters to which the items within the metaprogram belonged
  meta_name <-  "meta_program"
  for(l in 1:length(u))
  {
    meta_name <- paste(meta_name,u[l],sep = '_')
  }
  #Handle one-to-many cases
  if(count_string_occurrences(names(meta_moudule),meta_name) >= 1)
  {
    
    meta_name <- paste(meta_name,(count_string_occurrences(names(meta_moudule),meta_name)+1),sep = '_')
  }
  #Save the metaprogram into the list
  meta_moudule[[meta_name]] <- variables
}

Generate the gene sets corresponding to the SSpMosaic metaprograms

Establish the correspondence between SSpMopsaic metaprograms and genes through the correspondence between metaprograms and programs.

#Initialize a list to store the correspondence between SSpMosaic metaprograms and genes
meta_moudule_genes <- list()
#Iterate over each metaprogram
for(meta in names(meta_moudule))
{
  #Iterate over each program within the metaprogram
  for(program in meta_moudule[[meta]])
  {
    #Merge the gene set corresponding to the program into the overall gene set of the metaprogram
    meta_moudule_genes[[meta]] <- c(meta_moudule_genes[[meta]],moudules[[program]])
  }
}