LIANA x Tensor-cell2cell Quickstart (R)

This tutorial provides an abbreviated version of Tutorials 01-06 in order to give a quick overview of the pipeline to go from a raw counts matrix to downstream cell-cell communication (CCC) analyses. By combining LIANA and Tensor-cell2cell, we get a general framework that can robustly incorporate many existing CCC inference tools, ultimately retriving consensus communication scores for any sample and analyzing all those samples together to identify context-dependent communication programs.

Initial Setups

Enabling GPU use

First, if you are using a NVIDIA GPU with CUDA cores, set use_gpu=TRUE and enable PyTorch with the following code block. Otherwise, set use_gpu=FALSE or skip this part

[1]:
use_gpu <- TRUE
library(reticulate, quietly = TRUE)
if (use_gpu){
    device <- 'cuda:0'
    tensorly <- reticulate::import('tensorly')
    tensorly$set_backend('pytorch')
}else{
    device <- NULL
}

Libraries

Then, import all the packages we will use in this tutorial:

[2]:
suppressMessages({
    library(dplyr, quietly = TRUE)
    library(tidyr, quietly = TRUE)
    library(purrr, quietly = TRUE)
    library(forcats, quietly = TRUE)
    library(textshape, quietly = TRUE)
    library(ggpubr, quietly = TRUE)
    library(rstatix, quietly = TRUE)
    library(reshape2, quietly = TRUE)
    library(tibble, quietly = TRUE)

    library(scater, quietly = TRUE)
    library(scuttle, quietly = TRUE)

    library(ggplot2, quietly = TRUE)

    library(liana, quietly = TRUE)
    library(decoupleR, quietly = TRUE)
    c2c <- reticulate::import(module = "cell2cell")
})

Directories

Afterwards, specify the data and output directories:

[3]:
data_folder = '../../data/quickstart/'
output_folder = '../../data/quickstart/outputs'

for (folder_ in c(data_folder, output_folder)){
    if (!dir.exists(folder_)){
        dir.create(folder_)
    }
}

Loading Data

We begin by loading the single-cell transcriptomics data. For this tutorial, we will use a lung dataset of 63k immune and epithelial cells across three control, three moderate, and six severe COVID-19 patients21. We can download the data and store it in the SingleCellExperiment format:

[4]:
options(timeout=600)
covid_data <- readRDS(url('https://zenodo.org/record/7706962/files/BALF-COVID19-Liao_et_al-NatMed-2020_SCE.rds'))

Preprocess Expression

Note, we do not include a batch correction step as Tensor-cell2cell can get robust decomposition results without this; however, we have extensive analysis and discuss regarding this topic in Supplementary Tutorial 01

Quality-Control Filtering

The loaded data has already been pre-processed to a degree and comes with cell annotations. Nevertheless, let’s highlight some of the key steps. To ensure that any noisy cells or features are removed, we filter any non-informative cells and genes:

[5]:
# basic filters
covid_data <- scater::addPerCellQC(covid_data)
min.features <- 200
min.cells <- 3

sce_counts <- counts(covid_data)

if (min.features > 0) {
  nfeatures <- Matrix::colSums(x = sce_counts > 0)
  sce_counts <- sce_counts[, which(x = nfeatures >= min.features)]
}
# filter genes on the number of cells expressing
if (min.cells > 0) {
  num.cells <- Matrix::rowSums(x = sce_counts > 0)
  sce_counts <- sce_counts[which(x = num.cells >= min.cells), ]
}

covid_data<-covid_data[rownames(sce_counts), colnames(sce_counts)]

We additionally remove high mitochondrial content:

[6]:
is.mito <- grep("MT-", rownames(covid_data))
per.cell <- scuttle::perCellQCMetrics(covid_data, subsets=list(Mito=is.mito))
colData(covid_data)[['subsets_Mito_percent']] <- per.cell$subsets_Mito_percent
covid_data <- covid_data[, colData(covid_data)$subsets_Mito_percent < 15]

Which is followed by removing cells with high number of total UMI counts, potentially representing more than one single cell (doublets):

[7]:
covid_data <- covid_data[, colData(covid_data)$detected < 5500]

Normalization

Normalized counts are usually obtained in two essential steps, the first being count depth scaling which ensures that the measured count depths are comparable across cells. This is then usually followed up with log1p transformation, which essentially stabilizes the variance of the counts and enables the use of linear metrics downstream:

[8]:
scale.factor <- 1e4
lib.sizes <- colSums(counts(covid_data)) / scale.factor
assay(covid_data, 'logcounts') <- scuttle::normalizeCounts(covid_data, size.factors=lib.sizes, log=FALSE, center.size.factors=FALSE)
assay(covid_data, 'logcounts') <- log1p(assay(covid_data, 'logcounts'))

Let’s see what our data looks like:

[9]:
head(colData(covid_data))
DataFrame with 6 rows and 13 columns
                      orig.ident      sample  sample_new     disease
                     <character> <character> <character> <character>
AAACCCACAGCTACAT-1_1        C100        C100         HC3           N
AAACCCATCCACGGGT-1_1        C100        C100         HC3           N
AAACCCATCCCATTCG-1_1        C100        C100         HC3           N
AAACGAACAAACAGGC-1_1        C100        C100         HC3           N
AAACGAAGTCGCACAC-1_1        C100        C100         HC3           N
AAACGAAGTCTATGAC-1_1        C100        C100         HC3           N
                         hasnCoV   cluster   cell.type   condition    ident
                     <character> <integer> <character> <character> <factor>
AAACCCACAGCTACAT-1_1           N        27           B     Control     C100
AAACCCATCCACGGGT-1_1           N        23 Macrophages     Control     C100
AAACCCATCCCATTCG-1_1           N         6           T     Control     C100
AAACGAACAAACAGGC-1_1           N        10 Macrophages     Control     C100
AAACGAAGTCGCACAC-1_1           N        10 Macrophages     Control     C100
AAACGAAGTCTATGAC-1_1           N         9           T     Control     C100
                           sum  detected     total subsets_Mito_percent
                     <numeric> <integer> <numeric>            <numeric>
AAACCCACAGCTACAT-1_1      3124      1377      3124             9.189882
AAACCCATCCACGGGT-1_1      1430       836      1430             0.909727
AAACCCATCCCATTCG-1_1      2342      1105      2342             6.319385
AAACGAACAAACAGGC-1_1     31378      4530     31378             9.981516
AAACGAAGTCGCACAC-1_1     12767      3409     12767             5.161745
AAACGAAGTCTATGAC-1_1      2198      1094      2198             8.871702

Deciphering Cell-Cell Communication

we will use LIANA to infer the ligand-receptor interactions for each sample. LIANA is highly modularized and it natively implements the formulations of a number of methods, including CellPhoneDBv2, Connectome, log2FC, NATMI, SingleCellSignalR, CellChat, a geometric mean, as well as a consensus in the form of a rank aggregate from any combination of methods

[10]:
liana::show_methods()
  1. 'connectome'
  2. 'logfc'
  3. 'natmi'
  4. 'sca'
  5. 'cellphonedb'
  6. 'cytotalk'
  7. 'call_squidpy'
  8. 'call_cellchat'
  9. 'call_connectome'
  10. 'call_sca'
  11. 'call_italk'
  12. 'call_natmi'

LIANA classifies the scoring functions from the different methods into two categories: those that infer the “Magnitude” and “Specificity” of interactions. We define the “Magnitude” of an interaction as a measure of the strength of the interaction’s expression, and the “Specificity” of an interaction is a measure of how specific an interaction is to a given pair of clusters. Generally, these categories are complementary, and the magnitude of the interaction is often a proxy of the specificity of the interaction. For example, a ligand-receptor interaction with a high magnitude score in a given pair of cell types is likely to also be specific, and vice versa.

For clarity, here we map each output score (column names in the above dataframe) to the scoring method and scoring type:

Method Name

Magnitude Score

Specificity Score

CellPhoneDB

lr.mean

cellphonedb.pvalue

Connectome

prod_weight

weight_sc

log2FC

None

logfc_comb

NATMI

prod_weight

edge_specificity

SingleCellSignalR (sca)

LRscore

None

When considering ligand-receptor prior knowledge resources, a common theme is the trade-off between coverage and quality, and similarly each resource comes with its own biases. In this regard, LIANA builds on OmniPath29 as any of the resources in LIANA are obtained via OmniPath. These include the expert-curated resources of CellPhoneDBv223, CellChat27, ICELLNET30, connectomeDB202025, CellTalkDB31, as well as 10 others. LIANA further provides a consensus expert-curated resource from the aforementioned five resources, along with some curated interactions from SignaLink. In this protocol, we will use the consensus resource from liana, though any of the other resources are available via LIANA, and one can also use liana with their own custom resource.

[11]:
liana::show_resources()
  1. 'Default'
  2. 'Consensus'
  3. 'Baccin2019'
  4. 'CellCall'
  5. 'CellChatDB'
  6. 'Cellinker'
  7. 'CellPhoneDB'
  8. 'CellTalkDB'
  9. 'connectomeDB2020'
  10. 'EMBRACE'
  11. 'Guide2Pharma'
  12. 'HPMR'
  13. 'ICELLNET'
  14. 'iTALK'
  15. 'Kirouac2010'
  16. 'LRdb'
  17. 'Ramilowski2015'
  18. 'OmniPath'
  19. 'MouseConsensus'

Selecting any of the lists of ligand-receptor pairs in LIANA can be done through the following command (here we select the aforementioned “consensus” resource:

[12]:
lr_pairs <- liana::select_resource('Consensus')

93d60c08422d4e8f967557df1334d8a7

Next, we can run LIANA on each sample across the available methods. By default, LIANA calculates an aggregate rank across these mtehods using a re-implementation of the RobustRankAggregate method, and generates a probability distribution for ligand-receptors that are ranked consistently better than expected under a null hypothesis (See Appendix 2). The consensus of ligand-receptor interactions across methods can therefore be treated as a p-value.

[13]:
covid_data <- liana_bysample(sce = covid_data,
                             sample_col = "sample_new", # context dimension column from SCE metadata
                             idents_col = "cell.type", # cell types column from SCE metadata
                             resource = 'Consensus',
                             expr_prop = 0.1, # must be expressed in expr_prop fraction of cells
                              min_cells = 5,
                             assay.type = 'logcounts', # run on log- and library-normalized counts
                             aggregate_how = 'both', # aggregate magnitude and specificity
                             verbose = TRUE,
                             inplace = TRUE,
                             permutation.params=list(nperms=100),
#                              parallelize = TRUE,
#                              workers = 40
                        )
`sample_new` was converted to a factor!

Current sample: HC1

Running LIANA with `cell.type` as labels!

`Idents` were converted to factor

Cell identities with less than 5 cells: Plasma were removed!Cell identities with less than 5 cells: pDC were removed!

Warning message in exec(output, ...):
“5711 genes and/or 0 cells were removed as they had no counts!”
Warning message:
“`invoke()` is deprecated as of rlang 0.4.0.
Please use `exec()` or `inject()` instead.
This warning is displayed once every 8 hours.”
LIANA: LR summary stats calculated!

Now Running: Natmi

Now Running: Connectome

Now Running: Logfc

Now Running: Sca

Now Running: Cellphonedb

Warning message:
“`progress_estimated()` was deprecated in dplyr 1.0.0.
 The deprecated feature was likely used in the liana package.
  Please report the issue at <https://github.com/saezlab/liana/issues>.”
Now aggregating natmi

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: connectome”
Warning message in exec(output, ...):
“Unknown method name or missing specifics for: logfc”
Now aggregating sca

Now aggregating cellphonedb

Aggregating Ranks

Now aggregating natmi

Now aggregating connectome

Now aggregating logfc

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: sca”
Now aggregating cellphonedb

Aggregating Ranks

Current sample: HC2

Running LIANA with `cell.type` as labels!

`Idents` were converted to factor

Cell identities with less than 5 cells: NK were removed!Cell identities with less than 5 cells: pDC were removed!

Warning message in exec(output, ...):
“6729 genes and/or 0 cells were removed as they had no counts!”
LIANA: LR summary stats calculated!

Now Running: Natmi

Now Running: Connectome

Now Running: Logfc

Now Running: Sca

Now Running: Cellphonedb

Now aggregating natmi

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: connectome”
Warning message in exec(output, ...):
“Unknown method name or missing specifics for: logfc”
Now aggregating sca

Now aggregating cellphonedb

Aggregating Ranks

Now aggregating natmi

Now aggregating connectome

Now aggregating logfc

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: sca”
Now aggregating cellphonedb

Aggregating Ranks

Current sample: HC3

Running LIANA with `cell.type` as labels!

`Idents` were converted to factor

Cell identities with less than 5 cells: Plasma were removed!

Warning message in exec(output, ...):
“5591 genes and/or 0 cells were removed as they had no counts!”
LIANA: LR summary stats calculated!

Now Running: Natmi

Now Running: Connectome

Now Running: Logfc

Now Running: Sca

Now Running: Cellphonedb

Now aggregating natmi

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: connectome”
Warning message in exec(output, ...):
“Unknown method name or missing specifics for: logfc”
Now aggregating sca

Now aggregating cellphonedb

Aggregating Ranks

Now aggregating natmi

Now aggregating connectome

Now aggregating logfc

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: sca”
Now aggregating cellphonedb

Aggregating Ranks

Current sample: M1

Running LIANA with `cell.type` as labels!

`Idents` were converted to factor

Cell identities with less than 5 cells: Mast were removed!Cell identities with less than 5 cells: Neutrophil were removed!

Warning message in exec(output, ...):
“3140 genes and/or 0 cells were removed as they had no counts!”
LIANA: LR summary stats calculated!

Now Running: Natmi

Now Running: Connectome

Now Running: Logfc

Now Running: Sca

Now Running: Cellphonedb

Now aggregating natmi

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: connectome”
Warning message in exec(output, ...):
“Unknown method name or missing specifics for: logfc”
Now aggregating sca

Now aggregating cellphonedb

Aggregating Ranks

Now aggregating natmi

Now aggregating connectome

Now aggregating logfc

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: sca”
Now aggregating cellphonedb

Aggregating Ranks

Current sample: M2

Running LIANA with `cell.type` as labels!

`Idents` were converted to factor

Cell identities with less than 5 cells: Mast were removed!Cell identities with less than 5 cells: Neutrophil were removed!Cell identities with less than 5 cells: Plasma were removed!

Warning message in exec(output, ...):
“3362 genes and/or 0 cells were removed as they had no counts!”
LIANA: LR summary stats calculated!

Now Running: Natmi

Now Running: Connectome

Now Running: Logfc

Now Running: Sca

Now Running: Cellphonedb

Now aggregating natmi

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: connectome”
Warning message in exec(output, ...):
“Unknown method name or missing specifics for: logfc”
Now aggregating sca

Now aggregating cellphonedb

Aggregating Ranks

Now aggregating natmi

Now aggregating connectome

Now aggregating logfc

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: sca”
Now aggregating cellphonedb

Aggregating Ranks

Current sample: M3

Running LIANA with `cell.type` as labels!

`Idents` were converted to factor

Cell identities with less than 5 cells: Mast were removed!Cell identities with less than 5 cells: Neutrophil were removed!Cell identities with less than 5 cells: Plasma were removed!Cell identities with less than 5 cells: pDC were removed!

Warning message in exec(output, ...):
“8374 genes and/or 0 cells were removed as they had no counts!”
LIANA: LR summary stats calculated!

Now Running: Natmi

Now Running: Connectome

Now Running: Logfc

Now Running: Sca

Now Running: Cellphonedb

Now aggregating natmi

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: connectome”
Warning message in exec(output, ...):
“Unknown method name or missing specifics for: logfc”
Now aggregating sca

Now aggregating cellphonedb

Aggregating Ranks

Now aggregating natmi

Now aggregating connectome

Now aggregating logfc

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: sca”
Now aggregating cellphonedb

Aggregating Ranks

Current sample: S1

Running LIANA with `cell.type` as labels!

`Idents` were converted to factor

Warning message in exec(output, ...):
“2524 genes and/or 0 cells were removed as they had no counts!”
LIANA: LR summary stats calculated!

Now Running: Natmi

Now Running: Connectome

Now Running: Logfc

Now Running: Sca

Now Running: Cellphonedb

Now aggregating natmi

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: connectome”
Warning message in exec(output, ...):
“Unknown method name or missing specifics for: logfc”
Now aggregating sca

Now aggregating cellphonedb

Aggregating Ranks

Now aggregating natmi

Now aggregating connectome

Now aggregating logfc

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: sca”
Now aggregating cellphonedb

Aggregating Ranks

Current sample: S2

Running LIANA with `cell.type` as labels!

`Idents` were converted to factor

Warning message in exec(output, ...):
“1629 genes and/or 0 cells were removed as they had no counts!”
LIANA: LR summary stats calculated!

Now Running: Natmi

Now Running: Connectome

Now Running: Logfc

Now Running: Sca

Now Running: Cellphonedb

Now aggregating natmi

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: connectome”
Warning message in exec(output, ...):
“Unknown method name or missing specifics for: logfc”
Now aggregating sca

Now aggregating cellphonedb

Aggregating Ranks

Now aggregating natmi

Now aggregating connectome

Now aggregating logfc

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: sca”
Now aggregating cellphonedb

Aggregating Ranks

Current sample: S3

Running LIANA with `cell.type` as labels!

`Idents` were converted to factor

Cell identities with less than 5 cells: B were removed!Cell identities with less than 5 cells: Plasma were removed!

Warning message in exec(output, ...):
“4639 genes and/or 0 cells were removed as they had no counts!”
LIANA: LR summary stats calculated!

Now Running: Natmi

Now Running: Connectome

Now Running: Logfc

Now Running: Sca

Now Running: Cellphonedb

Now aggregating natmi

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: connectome”
Warning message in exec(output, ...):
“Unknown method name or missing specifics for: logfc”
Now aggregating sca

Now aggregating cellphonedb

Aggregating Ranks

Now aggregating natmi

Now aggregating connectome

Now aggregating logfc

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: sca”
Now aggregating cellphonedb

Aggregating Ranks

Current sample: S4

Running LIANA with `cell.type` as labels!

`Idents` were converted to factor

Cell identities with less than 5 cells: B were removed!Cell identities with less than 5 cells: pDC were removed!

Warning message in exec(output, ...):
“3792 genes and/or 0 cells were removed as they had no counts!”
LIANA: LR summary stats calculated!

Now Running: Natmi

Now Running: Connectome

Now Running: Logfc

Now Running: Sca

Now Running: Cellphonedb

Now aggregating natmi

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: connectome”
Warning message in exec(output, ...):
“Unknown method name or missing specifics for: logfc”
Now aggregating sca

Now aggregating cellphonedb

Aggregating Ranks

Now aggregating natmi

Now aggregating connectome

Now aggregating logfc

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: sca”
Now aggregating cellphonedb

Aggregating Ranks

Current sample: S5

Running LIANA with `cell.type` as labels!

`Idents` were converted to factor

Cell identities with less than 5 cells: Plasma were removed!

Warning message in exec(output, ...):
“4238 genes and/or 0 cells were removed as they had no counts!”
LIANA: LR summary stats calculated!

Now Running: Natmi

Now Running: Connectome

Now Running: Logfc

Now Running: Sca

Now Running: Cellphonedb

Now aggregating natmi

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: connectome”
Warning message in exec(output, ...):
“Unknown method name or missing specifics for: logfc”
Now aggregating sca

Now aggregating cellphonedb

Aggregating Ranks

Now aggregating natmi

Now aggregating connectome

Now aggregating logfc

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: sca”
Now aggregating cellphonedb

Aggregating Ranks

Current sample: S6

Running LIANA with `cell.type` as labels!

`Idents` were converted to factor

Cell identities with less than 5 cells: Mast were removed!Cell identities with less than 5 cells: pDC were removed!

Warning message in exec(output, ...):
“3449 genes and/or 0 cells were removed as they had no counts!”
LIANA: LR summary stats calculated!

Now Running: Natmi

Now Running: Connectome

Now Running: Logfc

Now Running: Sca

Now Running: Cellphonedb

Now aggregating natmi

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: connectome”
Warning message in exec(output, ...):
“Unknown method name or missing specifics for: logfc”
Now aggregating sca

Now aggregating cellphonedb

Aggregating Ranks

Now aggregating natmi

Now aggregating connectome

Now aggregating logfc

Warning message in exec(output, ...):
“Unknown method name or missing specifics for: sca”
Now aggregating cellphonedb

Aggregating Ranks

The parameterrs used here are as follows:

  • sce stands for SingleCellExperiment, and we pass here with an object with a single sample/context.

  • resource is the name of any of the resources available in liana

  • idents_col corresponds to the cell group label stored in adata.obs.

  • assay.type indicates which counts assys in the SCE object to use, here the log-normalized counts are assigned to logcounts, other options include the raw UMI counts stored in counts

  • expr_prop is the expression proportion threshold (in terms of cells per cell type expressing the protein)threshold of expression for any protein subunit involved in the interaction, according to which we keep or discard the interactions.

  • min_cells is the minimum number of cells per cell type required for a cell type to be considered in the analysis

  • aggregate_how specifies whether to aggregate on magnitude score types, specificity score types, or both

  • verbose is a boolean that indicates whether to print the progress of the function

  • inplace is a boolean that indicates whether storing the results in place, i.e. to adata.uns[“liana_res”].

Let’s see what the results look like:

[14]:
liana_res <- covid_data@metadata$liana_res %>%
            bind_rows(.id = "sample_new") %>%
            mutate(sample_new = factor(sample_new, levels = unique(sample_new)))

head(liana_res)
A tibble: 6 × 16
sample_newsourcetargetligand.complexreceptor.complexmagnitude_rankspecificity_rankmagnitude_mean_ranknatmi.prod_weightsca.LRscorecellphonedb.lr.meanspecificity_mean_ranknatmi.edge_specificityconnectome.weight_sclogfc.logfc_combcellphonedb.pvalue
<fct><chr><chr><chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
HC1T NKB2MCD3D 3.983438e-110.01345424951.0000008.0598610.96102543.410586487.750.083276061.30085620.87197990
HC1MacrophagesNKB2MCD3D 3.186750e-100.01350724502.0000008.0595990.96102483.410500487.000.083273351.30055650.88321770
HC1NK NKB2MCD3D 4.979297e-090.01814754503.6666677.6143780.95994653.264099571.250.078673240.79091290.69787780
HC1T NKB2MKLRD11.366319e-080.00035887485.6666676.8652500.95790743.297900211.750.171293086.96092020.86240790
HC1B NKB2MCD3D 2.039520e-080.01923068775.3333337.5185430.95970233.232586595.000.077683050.68121070.69101790
HC1MacrophagesNKB2MKLRD12.039520e-080.00035887486.6666676.8650270.95790673.297814210.750.171287516.96062050.87364560

This dataframe provides the results from running each method, as well as the consensus scores across the methods. In our case, we are interested in the magnitude consensus scored denoted by the 'magnitude_rank' column.

We can visualize the output as a dotplot across multiple samples:

[15]:
ligand_complex <- 'B2M'
receptor_complex <- c('CD3D', 'KLRD1')
source_labels <- c("B", "pDC", "Macrophages")
target_labels <- c("T", "Mast", "pDC", "NK")
sample_key <- 'sample_new'
colour <- 'magnitude_rank'
size <- "specificity_rank"

lv <- liana_res %>%
    filter(source %in% source_labels) %>%
    filter(target %in% target_labels) %>%
    filter(ligand.complex %in% ligand_complex) %>%
    filter(receptor.complex %in% receptor_complex) %>%
    unite(lr.pairs, ligand.complex, receptor.complex, sep = ' -> ') %>%
    mutate(lr.pairs = factor(lr.pairs, levels = unique(lr.pairs))) %>%
    # inverse ranks
    mutate_at(vars(specificity_rank, magnitude_rank), ~-log10(. + 1e-10))

h_ = 7
w_ = 15
options(repr.plot.height=h_, repr.plot.width=w_)

ggplot(lv, aes(x = target, y = source, color = .data[[colour]], size=.data[[size]])) +
        geom_point() +
        facet_wrap(sample_new ~ lr.pairs, ncol = 12, scales='free_x') +
        theme_bw()
../../_images/notebooks_ccc_R_QuickStart_38_0.png

Let’s export it in case we want to access it later:

[16]:
write.csv(liana_res, file.path(output_folder, '/rLIANA_by_sample.csv'))

Alternatively, one could just export the whole SingleCellExperiment object, together with the ligand-receptor results stored:

[17]:
saveRDS(covid_data, file.path(output_folder, '/sce_processed.rds'))

Comparing cell-cell communication across multiple samples

We can use Tensor-cell2cell to infer context-dependent CCC patterns from multiple samples simultaneously. To do so, we must first restructure the communication scores (LIANA’s output) into a 4D-Communication Tensor.

The tensor is built as follows: we create matrices with the communication scores for each of the ligand-receptor pairs within the same sample, then generate a 3D tensor for each sample, and finally concatenate them to form the 4D tensor:

ccc-scores

First, we generate a list containing all samples from our SingleCellExperiment object. For visualization purposes we sorted them according toby COVID-19 severity. Here, we can find the names of each of the samples in the sample_new column of the adata.obs information:

[18]:
sorted_samples = sort(names(covid_data@metadata$liana_res))

Then we can directly pass the communication scores from LIANA to build the 3D tensors for each sample (panel c in last figure), and concatenate them, with the following function:

[19]:
# build the tensor
context_df_dict <- liana:::preprocess_scores(context_df_dict = covid_data@metadata[['liana_res']],
                                             outer_frac = 1/3, # Fraction of samples as threshold to include cells and LR pairs.
                                             invert = TRUE, # transform the scores
                                             invert_fun = function(x) 1-x, # Transformation function
                                             non_negative = TRUE, # fills negative values
                                             non_negative_fill = 0 # set negative values to 0
                             )


tensor <- liana_tensor_c2c(context_df_dict = context_df_dict,
                          sender_col = "source", # Column name of the sender cells
                          receiver_col = "target", # Column name of the receiver cells
                          ligand_col = "ligand.complex", # Column name of the ligands
                          receptor_col = "receptor.complex", # Column name of the receptors
                          score_col = 'magnitude_rank', # Column name of the communication scores to use
                          how='outer',  # What to include across all samples
                          context_order=sorted_samples, # Order to store the contexts in the tensor
                          conda_env = 'ccc_protocols', # used to pass an existing conda env with cell2cell
                          build_only = TRUE,# , # set this to FALSE to combine the downstream rank selection and decomposition steps all here
                          device = device # Device to use when backend is pytorch.
                        )
Inverting `magnitude_rank`!

Inverting `magnitude_rank`!

Inverting `magnitude_rank`!

Inverting `magnitude_rank`!

Inverting `magnitude_rank`!

Inverting `magnitude_rank`!

Inverting `magnitude_rank`!

Inverting `magnitude_rank`!

Inverting `magnitude_rank`!

Inverting `magnitude_rank`!

Inverting `magnitude_rank`!

Inverting `magnitude_rank`!

[1] 0
Loading `ccc_protocols` Conda Environment

Building the tensor using magnitude_rank...

The key parameters when building a tensor are:

  • context_df_dict is the dataframe containing the results from LIANA, usually located in covid_data@metadata[['liana_res']].

  • sender_col, receiver_col, ligand_col, receptor_col, and score_col are the column names in the dataframe containing the samples, sender cells, receiver cells, ligands, receptors, and communication scores, respectively. Each row of the dataframe contains a unique combination of these elements.

  • invert and invert_fun control the function we use to convert the communication score before using it to build the tensor. In this case, the ‘magnitude_rank’ score generated by LIANA considers low values as the most important ones, ranging from 0 to 1. In contrast, Tensor-cell2cell requires higher values to be the most important scores, so here we pass a function (function(x) 1-x) to adapt LIANA’s magnitude-rank scores (subtracts the LIANA’s score from 1).

  • how controls which ligand-receptor pairs and cell types to include when building the tensor. This decision depends on whether the missing values across a number of samples for both ligand-receptor interactions and sender-receiver cell pairs are considered to be biologically-relevant. Options are:

    • 'inner' is the more strict option since it only considers only cell types and LR pairs that are present in all contexts (intersection).

    • 'outer' considers all cell types and LR pairs that are present across contexts (union).

    • 'outer_lrs' considers only cell types that are present in all contexts (intersection), while all LR pairs that are present across contexts (union).

    • 'outer_cells' considers only LR pairs that are present in all contexts (intersection), while all cell types that are present across contexts (union).

  • outer_frac controls the elements to include in the union scenario of the how options. Only elements that are present at least in this fraction of samples/contexts will be included. When this value is 0, the tensor includes all elements across the samples. When this value is 1, it acts as using how='inner'.

  • context_order is a list specifying the order of the samples. The order of samples does not affect the results, but it is useful for posterior visualizations.

We can check the shape of this tensor to verify the number of samples, LR pairs, sender cells, adn receiver cells, respectively:

[20]:
tensor$shape
torch.Size([12, 1054, 10, 10])

We can export our tensor:

[21]:
reticulate::py_save_object(object = tensor,
                           filename = file.path(output_folder, 'BALF-Tensor-R.pkl'))

Then, we can load it with:

[22]:
tensor <- reticulate::py_load_object(filename = file.path(output_folder, 'BALF-Tensor-R.pkl'))

Perform Tensor Factorization

Now that we have built the tensor and its metadata, we can run Tensor Component Analysis via Tensor-cell2cell with one simple command that we implemented for our unified framework:

[23]:
tensor <- liana::decompose_tensor(tensor = tensor,
                                  rank = NULL, # Number of factors to perform the factorization. If NULL, it is automatically determined by an elbow analysis
                                  tf_optimization = 'robust', # To define how robust we want the analysis to be.
                                  seed = 0, # Random seed for reproducibility
                                  factors_only = FALSE,
                                  )
Estimating ranks...

Decomposing the tensor...

Key parameters are:

  • rank is the number of factors or latent patterns we want to obtain from the analysis. You can either indicate a specific number or leave it as None to perform the decomposition with a suggested number from an elbow analysis.

  • tf_optimization indicates whether running the analysis in the 'regular' or the 'robust' way. The 'regular' way runs the tensor decomposition less number of times than the robust way to select an optimal result. Additionally, the former employs less strict convergence parameters to obtain optimal results than the latter, which is also translated into a faster generation of results. Important: When using tf_optimization='robust' the analysis takes much longer to run than using tf_optimization='regular'. However, the latter may generate less robust results.

  • seed is the seed for randomization. It controls the randomization used when initializing the optimization algorithm that performs the tensor decomposition. It is useful for reproducing the same result every time that the analysis is run. If NULL, a different randomization will be used each time.

[24]:
print(paste0('The estimated tensor rank is: ', tensor$rank))
[1] "The estimated tensor rank is: 10"
[25]:
# Estimate standard error
error_average <- tensor$elbow_metric_raw %>%
    t() %>%
    as.data.frame() %>%
    mutate(rank=row_number()) %>%
    pivot_longer(-rank, names_to = "run_no", values_to = "error") %>%
    group_by(rank) %>%
    summarize(average = mean(error),
              N = n(),
              SE.low = average - (sd(error)/sqrt(N)),
              SE.high = average + (sd(error)/sqrt(N))
           )

# plot
error_average %>%
    ggplot(aes(x=rank, y=average), group=1) +
    geom_line(col='red') +
    geom_ribbon(aes(ymin = SE.low, ymax = SE.high), alpha = 0.1) +
    geom_vline(xintercept = tensor$rank, colour='darkblue') + # rank of interest
    theme_bw() +
    labs(y="Error", x="Rank")
../../_images/notebooks_ccc_R_QuickStart_60_0.png
[26]:
covid_data@metadata$tensor_res <- liana:::format_c2c_factors(tensor$factors)

h_ = 13
w_ = 15
options(repr.plot.height=h_, repr.plot.width=w_)

plot_c2c_overview(sce = covid_data, group_col = 'condition', sample_col = 'sample_new')
../../_images/notebooks_ccc_R_QuickStart_61_0.png

The figure representing the loadings in each factor generated here can be interpreted by interconnecting all dimensions within a single factor. For example, if we take one of these factors, the cell-cell communication program occurs in each sample proportionally to their loadings. Then, this signature can be interpreted with the loadings of the ligand-receptor pairs, sender cells, and receiver cells. Ligands in high-loading ligand-receptor pairs are sent predominantly by high-loading sender cells, and interact with the cognate receptors on the high-loadings receiver cells.

Downstream Visualizations: Making sense of the factors

After running the decomposition, the results are stored in the factors attribute of the tensor object. This attribute is a dictionary containing the loadings for each of the elements in every tensor dimension. Loadings delineate the importance of that particular element to that factor. Keys are the names of the different dimension.

[27]:
names(tensor$factors)
  1. 'contexts'
  2. 'interactions'
  3. 'senders'
  4. 'receivers'

We can inspect the loadings of the samples, for example, located under the key 'Contexts':

[28]:
tensor$factors[['contexts']]
A data.frame: 12 × 10
Factor 1Factor 2Factor 3Factor 4Factor 5Factor 6Factor 7Factor 8Factor 9Factor 10
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
HC10.571677510.00403223510.24866760.069511630.19927370.091382590.21099480.099011180.087434610.10624311
HC20.613458160.00093741410.24222000.079044580.16627040.010860450.16963400.057797210.052163250.09475538
HC30.280845520.06400109830.33431330.389251440.44101950.259509030.29442040.210486990.067302530.13640665
M10.178115520.02433695640.27887510.542565050.26497680.519761200.39170960.315592050.297609600.25299737
M20.157660800.03674701230.28518600.500092390.26896720.506432950.40550260.302759650.295829470.22803977
M30.255313870.10383091120.28003230.247461100.19437550.352842210.36945250.237261700.153499860.11353971
S10.162276830.60680115220.29689220.163801850.27580420.096722410.15981140.262955930.310083090.16405572
S20.058484420.45251742010.29558280.273415830.31429510.252637390.29166000.351023020.421035920.36295748
S30.171097680.22141410410.31451120.299349340.31257990.125736580.28797740.292251230.169058220.30484933
S40.108372010.25052455070.30780770.052092810.34278180.323285880.27113940.402529980.430962380.47937286
S50.043106140.47058612110.25805550.162721530.21198630.204466120.22392440.350422890.433286990.40211794
S60.153561580.27739039060.30730830.106338800.35044980.194298570.26165480.363863050.331641640.43490598

Downstream Visualizations

We can use these loadings to compare pairs of sample major groups with boxplots and statistical tests:

[29]:
sample_loadings <- liana::get_c2c_factors(covid_data,
                                          sample_col='sample_new',
                                          group_col='condition') %>%
                                          purrr::pluck("contexts") %>%
                                          pivot_longer(-c("condition", "context"),
                                                names_to = "factor",
                                                values_to = "loadings") %>%
                                          mutate(factor = factor(factor)) %>%
                                          mutate(factor = fct_relevel(factor, "Factor.10", after=9))

h_ = 10
w_ = 15
options(repr.plot.height=h_, repr.plot.width=w_)

# Do pairwise tests
pwc <- sample_loadings %>%
    group_by(factor) %>%
    rstatix::pairwise_t_test(
      loadings ~ condition,
      p.adjust.method = "fdr"
      )

# plot
pwc <- pwc %>% add_xy_position(x = "condition")
ggboxplot(sample_loadings, x = "condition", y = "loadings", add = "point", fill="condition") +
    facet_wrap(~factor, ncol = 4) +
    stat_pvalue_manual(pwc) +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
../../_images/notebooks_ccc_R_QuickStart_69_0.png

Using the loadings for any dimension, we can also generate heatmaps for the elements with loadings above a certain threshold. Additionally, we can cluster these elements by the similarity of their loadings across all factors:

[30]:
h_ = 10
w_ = 15
options(repr.plot.height=h_, repr.plot.width=w_)

liana::plot_lr_heatmap(sce = covid_data,  n = 5)
../../_images/notebooks_ccc_R_QuickStart_71_0.png

Here patients are grouped by the importance that each communication pattern (factor) has in relation to the other patients. This captures combinations of related communication patterns that explain similarities and differences at a sample-specific resolution. In this case the differences are reflected with an almost perfect clustering by COVID-19 severity, where moderate cases are more similar to control cases.

Overall CCI potential: Heatmap and network visualizations of sender-receiver cell pairs

In addition, we can also evaluate the overall interactions between sender-receiver cell pairs that are determinant for a given factor or program. We can do it through a heatmap where the X-axis represent the receiver cells and the Y-axis shows the receiver cells. Here, the potential of interaction is calculated as the outer product between the loadings for the sender and receiver cells dimensions of a particular factor. To illustrate this, we chose Factor 8, but this can be repeated for every factor obtained in the decomposition.

[31]:
selected_factor <- 'Factor.8'
liana::plot_c2c_cells(sce = covid_data,
                   factor_of_int = selected_factor,
                   name = "Loadings \nProduct")
../../_images/notebooks_ccc_R_QuickStart_75_0.png

Similarly, an interaction network can be created for each factor by using the loading product between sender and receiver cells. First we need to choose a threshold to indicate what pair of cells are interacting. This can be done as shown in the extended tutorial

[32]:
threshold = 0.075

Then, we can plot all networks we are interested in:

[33]:
# format for use with cell2cell
format_py_df<-function(df){
    df<-df %>%
    as.data.frame() %>%
    textshape::column_to_rownames() %>%
    select(-contains("condition"))

    return(df)
}
factors <- liana::get_c2c_factors(sce = covid_data,
                                  sample_col = "sample_new",
                                  group_col = "condition")
py_factors<-unname(lapply(factors, function(df) format_py_df(df)))
py_factors = reticulate::py_dict(keys = names(factors), values = py_factors)

c2c$plotting$ccc_networks_plot(py_factors,
                               included_factors=c('Factor.2', 'Factor.8', 'Factor.9'),
                               sender_label='senders',
                               receiver_label='receivers',
                               ccc_threshold=threshold, # Only important communication
                               nrows=as.integer(1),
                               filename=file.path(output_folder, 'network_plot_qs_r.png')
                              )
[[1]]
<Figure size 2400x800 with 3 Axes>

[[2]]
[[2]][[1]]
<Axes: title={'center': 'Factor.2'}>

[[2]][[2]]
<Axes: title={'center': 'Factor.8'}>

[[2]][[3]]
<Axes: title={'center': 'Factor.9'}>


network_plot_qs

Pathway Enrichment Analysis: Interpreting the context-driven communication

Classical Pathway Enrichment

As the number of inferred interactions increases, the interpretation of the inferred cell-cell communication networks becomes more challenging. To this end, we can perform pathway enrichment analysis to identify the general biological processes that are enriched in the inferred interactions. Here, we will perform classical gene set enrichment analysis with KEGG Pathways.

For the pathway enrichment analysis with GSEA, we use ligand-receptor pairs instead of individual genes. KEGG was initially designed to work with sets of genes, so first we need to generate ligand-receptor sets for each of it’s pathways:

[34]:
# Generate list with ligand-receptors pairs in DB
lr_pairs <- liana::select_resource('Consensus')[[1]] %>%
    select(ligand = source_genesymbol, receptor = target_genesymbol)
lr_list <- lr_pairs %>%
     unite('interaction', ligand, receptor, sep = '^') %>%
     pull(interaction)

# Specify the organism and pathway database to use for building the LR set
organism = "human"
pathwaydb = "KEGG"

# Generate ligand-receptor gene sets
lr_set <- c2c$external$generate_lr_geneset(lr_list = lr_list,
                                           complex_sep='_', # Separation symbol of the genes in the protein complex
                                           lr_sep='^', # Separation symbol between a ligand and a receptor complex
                                           organism=organism,
                                           pathwaydb=pathwaydb,
                                           readable_name=TRUE
                                           )

Next, we can perform enrichment analysis on each factor using the loadings of the ligand-receptor pairs to obtain the normalized-enrichment scores (NES) and corresponding P-values from GSEA:

[35]:
factors <- liana::get_c2c_factors(sce = covid_data,
                            sample_col = "sample_new",
                            group_col = "condition")

lr_loadings = factors[['interactions']] %>%
                column_to_rownames("lr")

gsea_res <- c2c$external$run_gsea(loadings=lr_loadings,
                               lr_set=lr_set,
                               output_folder=output_folder,
                               weight=as.integer(1),
                               min_size=as.integer(15),
                               permutations=as.integer(999),
                               processes=as.integer(6),
                               random_state=as.integer(6),
                               significance_threshold=as.numeric(0.05)
                              )
names(gsea_res)<-c('pvals', 'scores', 'gsea_df')

The enriched pathways for each factor are:

[36]:
gsea_df <- gsea_res$gsea_df
gsea_df %>%
    filter(`Adj. P-value` < 0.05 & NES > 0) %>%
    arrange(desc(abs(NES)))
A data.frame: 14 × 5
FactorTermNESP-valueAdj. P-value
<chr><chr><dbl><dbl><dbl>
52Factor.5 CHEMOKINE SIGNALING PATHWAY 1.5464060.0010000000.01444444
39Factor.4 ANTIGEN PROCESSING AND PRESENTATION 1.4797290.0010000000.01444444
117Factor.10ANTIGEN PROCESSING AND PRESENTATION 1.3647770.0010000000.01444444
78Factor.7 ANTIGEN PROCESSING AND PRESENTATION 1.3616380.0010000000.01444444
13Factor.2 ANTIGEN PROCESSING AND PRESENTATION 1.3560780.0010000000.01444444
0Factor.1 ANTIGEN PROCESSING AND PRESENTATION 1.3453140.0010000000.01444444
53Factor.5 CYTOKINE CYTOKINE RECEPTOR INTERACTION1.3197030.0010000000.01444444
40Factor.4 CELL ADHESION MOLECULES CAMS 1.3172780.0010000000.01444444
91Factor.8 CHEMOKINE SIGNALING PATHWAY 1.2656770.0050050050.04647505
14Factor.2 CHEMOKINE SIGNALING PATHWAY 1.2573630.0030030030.03549004
26Factor.3 ANTIGEN PROCESSING AND PRESENTATION 1.2295430.0040040040.04337671
104Factor.9 ANTIGEN PROCESSING AND PRESENTATION 1.2251970.0050050050.04647505
1Factor.1 CELL ADHESION MOLECULES CAMS 1.2207500.0010000000.01444444
16Factor.2 CYTOKINE CYTOKINE RECEPTOR INTERACTION1.1705830.0020020020.02602603

The depleted pathways are:

[37]:
gsea_df %>%
    filter(`Adj. P-value` < 0.05 & NES < 0) %>%
    arrange(desc(abs(NES)))
A data.frame: 0 × 5
FactorTermNESP-valueAdj. P-value
<chr><chr><dbl><dbl><dbl>

Finally, we can visualize the enrichment results using a dotplot:

[38]:
gsea.dotplot<-function(pval_df, score_df, significance = 0.05, font_size = 15){
    pval_df <- pval_df[apply(pval_df<significance,1,any), apply(pval_df<significance, 2, any)] %>%
            mutate_all(function(x) -1*log10(x + 1e-9))
    score_df <- score_df[rownames(pval_df), colnames(pval_df)]
    pval_df <- pval_df %>%
                    tibble::rownames_to_column('Annotation')
    score_df <- score_df %>%
                    tibble::rownames_to_column('Annotation')

    viz_df <- cbind(melt(pval_df), melt(score_df)[[3]])
    names(viz_df) <- c('Annotation', 'Factor', 'Significance', 'NES')


    dotplot <- ggplot(viz_df, aes(x = Factor, y = Annotation)) + geom_point(aes(size = Significance, color = NES)) +
                scale_color_gradient2() + theme_bw(base_size = font_size) +
                theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=0.5))+
                       scale_size(range = c(0, 8)) +
                       guides(size=guide_legend(title="-log10(p-value)"))
    return(dotplot)
}

[39]:
h_ = 8
w_ = 10
options(repr.plot.height=h_, repr.plot.width=w_)

dotplot <- gsea.dotplot(pval_df = gsea_res$pvals,
                       score_df = gsea_res$scores,
                       significance = 0.05)
dotplot
Using Annotation as id variables

Using Annotation as id variables

../../_images/notebooks_ccc_R_QuickStart_93_1.png

Footprint Enrichment

Footprint enrichment analysis build upon classic geneset enrichment analysis, as instead of considering the genes involved in a biological activity, they consider the genes affected by the activity, or in other words the genes that change downstream of said activity (Dugourd and Saez-Rodriguez, 2019).

In this case, we will use the PROGENy pathway resource to perform footprint enrichment analysis. PROGENy was built in a data-driven manner using perturbation and cancer lineage data (Schubert et al, 2019), as a consequence it also assigns different importances or weights to each gene in its pathway genesets. To this end, we need an enrichment method that can take weights into account, and here we will use multi-variate linear models from the decoupler-py package to perform this analysis (Badia-i-Mompel et al., 2022).

Let’s load the PROGENy genesets and then convert them to sets of weighted ligand-receptor pairs:

[40]:
# obtain progeny gene sets
net <- decoupleR::get_progeny(organism = 'human', top=5000) %>%
    dplyr::select(-p_value)

# convert to LR sets
progeny_lr <- liana::generate_lr_geneset(sce = covid_data,
                                          resource = net)

Next, we can run the footprint enrichment analysis:

[41]:
# interaction loadings to matrix
mat <- factors$interactions %>%
    textshape::column_to_rownames("lr") %>%
  as.matrix()

# run enrichment analysis with decoupler
# (we fit a univariate linear model for each gene set)
# We don't consider genesets with minsize < 10
res <- decoupleR::run_ulm(mat = mat,
                          network = progeny_lr,
                          .source = "set",
                          .target = "lr",
                          minsize=10) %>%
  mutate(p_adj = p.adjust(p_value, method = "fdr"))

Finally, we can visualize the results:

[42]:
res %>% # sig/isnig flag
  mutate(significant = if_else(p_adj <= 0.05, "signif.", "not")) %>%
  ggplot(aes(x=source, y=condition, shape=significant,
             colour=score, size=-log10(p_value+1e-36))) +
  geom_point() +
  scale_colour_gradient2(high = "red", low="blue") +
  scale_size_continuous(range = c(3, 12)) +
  scale_shape_manual(values=c(21, 16)) +
  theme_bw(base_size = 15) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(x="Pathway",
       y="Factor",
       colour="Activity"
       )
../../_images/notebooks_ccc_R_QuickStart_100_0.png

Let’s zoom in on Factor 8 which is associated with COVID-19 cases.

[43]:
selected_factor = 'Factor.8'
sample_loadings <- liana::get_c2c_factors(covid_data,
                                          sample_col='sample_new',
                                          group_col='condition') %>%
                                          purrr::pluck("contexts") %>%
                                          pivot_longer(-c("condition", "context"),
                                                names_to = "factor",
                                                values_to = "loadings") %>%
                                          mutate(factor = factor(factor)) %>%
                                          mutate(factor = forcats::fct_relevel(factor, "Factor.10", after=9))

res.factor <- res %>%
                filter(condition == selected_factor) %>%
                mutate(p_adj = p.adjust(p_value, method = "fdr")) %>%
                arrange(desc(score)) %>%
                mutate(source = factor(source, levels = source))

ggplot(data=res.factor, aes(x=score, y=source, fill = score)) +
  geom_bar(stat="identity", width=0.5) + theme_bw() + scale_fill_gradient2(trans = 'reverse') +
ylab('') + xlab('Activity')
../../_images/notebooks_ccc_R_QuickStart_102_0.png
[ ]: