4. Running Tensor-cell2cell to perform a tensor factorization

After generating a 4D-Communication Tensor, we can run Tensor-cell2cell to perform a tensor factorization and obtain the context-driven patterns of cell-cell communication (factors).

This tutorial will show you how to load a previously exported 4D-Communication Tensor and run Tensor-cell2cell on this tensor.

Initial Setup

Enable GPU use as discussed in Tutorial 03

[1]:
gpu_use = TRUE

library(reticulate, quietly = TRUE)

if (gpu_use){
    device <- 'cuda:0'
    tensorly <- reticulate::import('tensorly')
    tensorly$set_backend('pytorch')
}else{
    device <- NULL
}

Import Libraries

[2]:
library(liana, quietly = TRUE)

library(dplyr, quietly = TRUE)
library(tidyr, quietly = TRUE)
library(magrittr, quietly = TRUE)
library(tibble, quietly = TRUE)

library(ggplot2, quietly = TRUE)

library(Seurat, quietly = TRUE)

Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Attaching package: ‘magrittr’


The following object is masked from ‘package:tidyr’:

    extract


Attaching SeuratObject

Seurat v4 was just loaded with SeuratObject v5; disabling v5 assays and
validation routines, and ensuring assays work in strict v3/v4
compatibility mode

We will use reticulate to run Tensor-cell2cell in R

[3]:
use_condaenv('ccc_protocols')
c2c <- reticulate::import(module = "cell2cell")

Directories

[4]:
output_folder <- file.path('..', '..', 'data', 'tc2c-outputs')

Load Tensor

Now, we can load our 4D-Communication Tensor.

[5]:
tensor <- reticulate::py_load_object(filename = file.path(output_folder, 'BALF-Tensor-R.pkl'))

Perform Tensor Factorization

The above function to executes two important steps to perform the tensor factorization: elbow analysis to select the rank and decomposition to estimate the low-rank tensor. For additional details on these two steps, please see the Supplementary Section of the companion Python tutorial.

Key parameters are:

  • rank is the number of factors or latent patterns we want to obtain from the analysis. You can either indicate a specific number or leave it as NULL to obtain a suggested number through an elbow analysis.

  • tf_optimization indicates whether running the analysis in the 'regular' or the 'robust' way. The regular way means that the tensor decomposition is run 10 times per rank evaluated in the elbow analysis, and 1 time in the final decomposition. Additionally, the optimization algorithm has less number of iterations in the regular than the robust case (100 vs 500) and less precision (tolerance of 1e-7 vs 1e-8). The robust case runs the tensor decomposition 20 times per rank evaluated in the elbow analysis, and 100 times in the final decomposition. Here we could use the tf_optimization='regular', which is faster but generates less robust results. We recommend using tf_optimization='robust, which takes longer to run (more iteractions and more precise too). It is important to notice that multiple runs of a tensor decomposition differ in the initialization values (regardless of using tf_init='random' or tf_init='svd').

  • seed is the seed for randomization. It controls the randomization used when initializing the optimization algorithm that performs the tensor decomposition. It is useful for reproducing the same result every time that the analysis is run. If NULL, a different randomization will be used each time.

  • 'elbow_metric' is the metric to perform the elbow analysis to obtain a suggested number of factors. Options are 'error' and 'similarity', indicating respectively the error of tensor reconstruction, and the similarity of tensor factorization across multiple runs for a given rank. The similarity metric may result in non-smooth curves given the highly variability especially at higher ranks. If so, we recommend using the parameter smooth_elbow=True.

  • uppper_rank is the max number of ranks to try in the elbow analysis.

  • init is the initialization of the tensor decomposition algorithm. If your tensor contains a mask (tensor.mask), it will automatically use init='random'. Options are 'random' and 'svd'. The latter helps to obtain results that are more close to a global optima from the optimization method behind the algorithm.

  • svd is the method to perform the SVD to initialize the tensor factorization. This is only considered whe init='svd'.

[6]:
tensor <- liana::decompose_tensor(tensor = tensor,
                                  rank = NULL, # Number of factors to perform the factorization. If NULL, it is automatically determined by an elbow analysis
                                  tf_optimization = 'robust', # To define how robust we want the analysis to be.
                                  seed = 0, # Random seed for reproducibility
                                  elbow_metric = 'error', # Metric to use in the elbow analysis
                                  smooth_elbow = FALSE, # Whether smoothing the metric of the elbow analysis
                                  upper_rank=25, # Max number of factors to try in the elbow analysis
                                  init = 'random', # Initialization method of the tensor factorization
                                  svd = 'numpy_svd', # Type of SVD to use if the initialization is 'svd'
                                  factors_only = FALSE,
                                  )
Estimating ranks...

Decomposing the tensor...

[7]:
print(paste0('The estimated tensor rank is: ', tensor$rank))
[1] "The estimated tensor rank is: 10"

Let’s see how the automated elbow analysis arrived at its rank estimation:

[8]:
# Estimate standard error
error_average <- tensor$elbow_metric_raw %>%
    t() %>%
    as.data.frame() %>%
    mutate(rank=row_number()) %>%
    pivot_longer(-rank, names_to = "run_no", values_to = "error") %>%
    group_by(rank) %>%
    summarize(average = mean(error),
              N = n(),
              SE.low = average - (sd(error)/sqrt(N)),
              SE.high = average + (sd(error)/sqrt(N))
           )

# plot
error_average %>%
    ggplot(aes(x=rank, y=average), group=1) +
    geom_line(col='red') +
    geom_ribbon(aes(ymin = SE.low, ymax = SE.high), alpha = 0.1) +
    geom_vline(xintercept = tensor$rank, colour='darkblue') + # rank of interest
    theme_bw() +
    labs(y="Error", x="Rank")
../../_images/notebooks_ccc_R_04-Perform-Tensor-Factorization_16_0.png

Next, we can visualize what the factors look like.

LIANA has a helper function to visualize the factors. However, it requires a SingleCellExperiment object as input to access the associated metadata. So, let’s load our Seurat object, convert it to a SingleCellExperiment Object, and format it for visualization the tensor decomposition:

[9]:
# load processed data with liana results from 02-Infer-Communication-Scores.ipynb
covid_data <- readRDS(file.path('..', '..', 'data', 'covid_balf_norm.rds'))

# format
# formatted.factors<-format_c2c_factors(tensor$factors)
# saveRDS(formatted.factors, paste0(output_folder, 'Loadings.rds'))
# covid_data@metadata$tensor_res<-formatted.factors

covid_data@metadata$tensor_res <- liana:::format_c2c_factors(tensor$factors)

Save data for downstream analyses

[10]:
saveRDS(covid_data, file.path('..' , '..', 'data', 'covid_data_sce.rds'))
[11]:
h_ = 13
w_ = 15
options(repr.plot.height=h_, repr.plot.width=w_)

plot_c2c_overview(sce = covid_data, group_col = 'condition', sample_col = 'sample_new')
../../_images/notebooks_ccc_R_04-Perform-Tensor-Factorization_21_0.png

Factorization Results

After running the decomposition, the results are stored in the factors attribute (tensor$factors). This attribute is a dictionary (named list) containing the loadings for each of the elements in every tensor dimension. Keys are the names of the different dimension.

[12]:
names(tensor$factors)
  1. 'contexts'
  2. 'interactions'
  3. 'senders'
  4. 'receivers'

We can inspect the loadings of the samples, for example, located under the key 'Contexts'.

[13]:
tensor$factors[['contexts']]
A data.frame: 12 × 10
Factor 1Factor 2Factor 3Factor 4Factor 5Factor 6Factor 7Factor 8Factor 9Factor 10
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
HC10.571677510.00403223510.24866760.069511630.19927370.091382590.21099480.099011180.087434610.10624311
HC20.613458160.00093741410.24222000.079044580.16627040.010860450.16963400.057797210.052163250.09475538
HC30.280845520.06400109830.33431330.389251440.44101950.259509030.29442040.210486990.067302530.13640665
M10.178115520.02433695640.27887510.542565050.26497680.519761200.39170960.315592050.297609600.25299737
M20.157660800.03674701230.28518600.500092390.26896720.506432950.40550260.302759650.295829470.22803977
M30.255313870.10383091120.28003230.247461100.19437550.352842210.36945250.237261700.153499860.11353971
S10.162276830.60680115220.29689220.163801850.27580420.096722410.15981140.262955930.310083090.16405572
S20.058484420.45251742010.29558280.273415830.31429510.252637390.29166000.351023020.421035920.36295748
S30.171097680.22141410410.31451120.299349340.31257990.125736580.28797740.292251230.169058220.30484933
S40.108372010.25052455070.30780770.052092810.34278180.323285880.27113940.402529980.430962380.47937286
S50.043106140.47058612110.25805550.162721530.21198630.204466120.22392440.350422890.433286990.40211794
S60.153561580.27739039060.30730830.106338800.35044980.194298570.26165480.363863050.331641640.43490598

If we are interested in the top elements of a given dimension, here for example 'interactions', we can easily obtain them:

[14]:
tensor$get_top_factor_elements(order_name='interactions',
                                factor_name='Factor 10',
                                top_number=as.integer(10))
CALM2^PDE1B
0.202606901526451
HLA-C^LILRA1
0.164520159363747
CSF1^CSF2RA
0.159353837370872
HLA-A^LILRA1
0.139374673366547
TIGIT^NECTIN2
0.13768395781517
CD70^TNFRSF17
0.12326680123806
ICAM1^ITGAM_ITGB2
0.104458123445511
HLA-B^LILRA1
0.10074632614851
PLAU^ST14
0.0999555811285973
LGALS9^MET
0.097864530980587

Supplementary Information about the Tensor Decomposition

The tensor decomposition performed here is a non-negative canonical polyadic decomposition (CPD) (same performed in the elbow analysis). Briefly, tensor decomposition identifies a low-rank tensor (here, a rank of 10) that approximated the full tensor. This low-rank tensor can be represented as the sum of a set of rank-1 tensors (10 of them in this case). Each rank-1 tensor represents a factor in the decomposition and can be further represented as the outer product of n vectors, where n represents the number of tensor dimensions. Each vector represents one of the n tensor dimensions for that factor and its values, corresponding to individual elements in each dimension, represent the factor loadings. In our case, each factor will contain loadings for the context, LR pair, sender-cell, and receiver-cell dimensions. Those elements within each factor that contain high loadings contribute to the factor-specific communication pattern.

TF

Pipeline

To combine the tensor building from Tutorial 03 with the rank selection and decomposition steps from this tutorial, we can run the liana_tensor_c2c function while setting the build_only argument to FALSE:

First, preprocess the context_df_dict variable as in Tutorial 03:

[15]:
# preprocess.scores<-function(context_df_dict,
#                             score_col = "LRscore",
#                             sender_col = "source", receiver_col = "target",
#                             ligand_col = "ligand.complex", receptor_col = "receptor.complex",
#                             outer_fraction = 0, invert = TRUE, non_negative = TRUE, non_negative_fill = 0
#                            ){
#     source.cells<-sapply(context_df_dict, function(x) x[[sender_col]])
#     target.cells<-sapply(context_df_dict, function(x) x[[receiver_col]])
#     all.cells.persample<-sapply(names(source.cells),
#                                 function(sample.name) union(source.cells[[sample.name]], target.cells[[sample.name]]))
#     all.cells<-Reduce(union, unname(unlist(all.cells.persample)))
#     cell.counts.persample<- sapply(all.cells,
#                                    function(ct) length(which(sapply(all.cells.persample, function(x) ct %in% x))))
#     cells.todrop<-names(which(cell.counts.persample < (outer_fraction*length(context_df_dict))))


#     lrs.persample<-sapply(context_df_dict, function(x) paste(x[[ligand_col]], x[[receptor_col]],
#                                                              sep = '^'))
#     all.lrs<- Reduce(union, unname(unlist(lrs.persample)))
#     lr.counts.persample<-sapply(all.lrs, function (lr) length(which(sapply(lrs.persample, function(x) lr %in% x))))
#     lrs.todrop<-names(which(lr.counts.persample < (outer_fraction*length(context_df_dict))))

#     for (sample.name in names(context_df_dict)){
#         ccc.sample<-context_df_dict[[sample.name]]

#         if (invert){ccc.sample[[score_col]]=(1- ccc.sample[[score_col]])}# invert score
#         if (non_negative){ccc.sample[[score_col]][ccc.sample[[score_col]] < 0] = non_negative_fill} # make non-negative

#         # apply the outer_frac parameter
#         ccc.sample[(!(ccc.sample[[sender_col]] %in% cells.todrop)) &
#                    (!(ccc.sample[[receiver_col]] %in% cells.todrop)),]
#         ccc.sample<-ccc.sample[!(paste(ccc.sample[[ligand_col]], ccc.sample[[receptor_col]],
#                                    sep = '^') %in% lrs.todrop),]

#         context_df_dict[[sample.name]]<-ccc.sample
#     }
#     return(context_df_dict)
# }

# data_folder = '/data/hratch/ccc_protocols/interim/liana-outputs/'#'../../data/liana-outputs/'  # <--replace in final version
# env_name = 'ccc_protocols' # conda environemnt created by ../../env_setup/setup_env.sh
# context_df_dict<-readRDS(paste0(data_folder, 'LIANA_by_sample_R.rds'))
# context_df_dict<-preprocess.scores(context_df_dict, outer_frac = 1/3,
#                                    invert = TRUE, non_negative = TRUE, non_negative_fill = 0)
[16]:
# tensor <- liana_tensor_c2c(context_df_dict = context_df_dict,
#                     # build args
#                           sender_col = "source", # Column name of the sender cells
#                           receiver_col = "target", # Column name of the receiver cells
#                           ligand_col = "ligand.complex", # Column name of the ligands
#                           receptor_col = "receptor.complex", # Column name of the receptors
#                         score_col = 'LRscore', # Column name of the communication scores to use
#                         how='outer',  # What to include across all samples
#                         lr_fill = NaN, # What to fill missing LRs with
#                         cell_fill = NaN, # What to fill missing cell types with
#                         lr_sep='^', # How to separate ligand and receptor names to name LR pair
#                         context_order=sorted_names, # Order to store the contexts in the tensor
#                         sort_elements = TRUE, # Whether sorting alphabetically element names of each tensor dim. Does not apply for context order if context_order is passed.
#                         conda_env = env_name, # used to pass an existing conda env with cell2cell
#                     # decomposition args
#                         build_only = FALSE, # set this to FALSE to combine the downstream rank selection and decomposition steps all here
#                         device = device # Device to use when backend is pytorch.
#                         rank = NULL,
#                         seed = 0,
#                         init = 'random',
#                         upper_rank = 25,
#                         runs = 3,
#                         factors_only = FALSE,
#                         inplace = FALSE
#                             )

Subset of a Tensor

From this analysis we may also be interested in a subset of the original tensor to inspect the cell-cell communication of a subset of samples, LR pairs, sender cells, or receiver cells.

We can subset the original tensor without having to build it from the beginning. For example, we are interested in the CCC of NKs and T cells as the senders, and B cells, Macrophages, mDC and pDC cells as receivers. We can do so by creating a dictionary specifying the elements of the dimensions that we want to subset. Here, the dimensions corresponds to: - 0 for the Contexts - 1 for the Ligand-Receptor Pairs - 2 for the Sender Cells - 3 for the Receiver Cells

If we do not include a dimension in the following dictionary, that dimension will not be subset.

[17]:
sender_subset = list('NK', 'T')
receiver_subset = list('B', 'Macrophages', 'mDC', 'pDC')

subset_dict<-reticulate::py_dict(keys = c(as.integer(2), as.integer(3)),
                                 values = list(sender_subset, receiver_subset))
subset_dict
{2: ['NK', 'T'], 3: ['B', 'Macrophages', 'mDC', 'pDC']}

Some key parameters here are:

  • original_order indicates whether keeping the element order as in the original tensor or as in the subset list.

  • remove_duplicates indicates whether removing duplicate elements in the subset lists. Duplicates can be introduced especially when passing these lists manually.

  • keep indicates what duplicated element to keep. Options are:

    • ‘first’ : Drop duplicates except for the first occurrence.

    • ‘last’ : Drop duplicates except for the last occurrence.

    • False : Drop all duplicates.

[18]:
subset_tensor = c2c$tensor$subset_tensor(interaction_tensor=tensor,
                        subset_dict=subset_dict,
                        original_order=TRUE)

If we compare the size of the original tensor with the subset one, we can see that we only modified the senders and receiver cells.

[19]:
print(tensor$shape)
print(subset_tensor$shape)
torch.Size([12, 1054, 10, 10])
torch.Size([12, 1054, 2, 4])