S1. Batch Correction

Batch correction removes technical variation while preserving biological variation between samples. This is a processing step that can be applied to our dataset after normalization (see Tutorial 01 for processing steps up to normalization). Additional considerations for expression data to be compatible with CCC inference and Tensor-cell2cell is that counts should be non-negative.

Here, we demonstrate batch correction with scVI, a method that yield non-negative corrected counts and also has been benchmarked to work well.

[1]:
suppressMessages({
    library(SingleCellExperiment, quietly = T)
    library(zellkonverter, quietly = TRUE)

    library(reticulate, quietly = T)
    scvi <- import("scvi", convert = FALSE)
})
[2]:
data.path <- file.path('..', '..', 'data')

seed<-888
scvi$settings$seed = as.integer(seed)

use_gpu<-TRUE
if (!use_gpu){
    n.cores<-1
    scvi$settings$num_threads = as.integer(1)
    scvi$settings$dl_num_workers = as.integer(n.cores)
}

Load and prepare the data:

[3]:
covid_data <- readRDS(file.path(data.path, 'covid_balf_norm.rds'))
adata.batch<-zellkonverter::SCE2AnnData(covid_data,
                                        X_name = 'logcounts' # specifying this stores raw UMI counts in "counts" layer
                                       )

According to the scVI tutorial, scVI models expect raw UMI counts rather than log-normalized counts as input. Keep in mind this is tool specific, and the version of the expression dataset you input will depending on the batch correction method you employ.

From the above cell, we stored this in the "counts" layer. Let’s set up the batch correction model:

[4]:
reticulate::py_set_seed(as.integer(seed))
raw_counts_layer = 'counts' # raw UMI counts layer
batch_col = 'sample' # metadata colum specifying batches
scvi$model$SCVI$setup_anndata(adata.batch, layer = raw_counts_layer, batch_key = batch_col)
model = scvi$model$SCVI(adata.batch, n_layers = 2L, n_latent = 30L, gene_likelihood= "nb")
None

Now, we can run teh batch correction and transform it similar to how we did the normalization. Note, this is slightly different than the typical latent representation that is output by scVI. Many batch correction tools output the counts in a reduced latent space. However, for CCC analysis, we need tools that can output corrected counts for each gene in order to score the communication between ligands and receptors.

[5]:
model$train()
corrected.data = model$get_normalized_expression(transform_batch = sort(unique(colData(covid_data)[[batch_col]])),
                                                 library_size = 1e4)
corrected.data<-t(log1p(reticulate::py_to_r(corrected.data)))
assays(covid_data)[['batch.corrected.counts']] <- corrected.data # store the corrected data in the SCE object
None