S1. Batch Correction

Batch correction removes technical variation while preserving biological variation between samples. This is a processing step that can be applied to our dataset after normalization (see Tutorial 01 for processing steps up to normalization). Additional considerations for expression data to be compatible with CCC inference and Tensor-cell2cell is that counts should be non-negative.

Here, we demonstrate batch correction with scVI, a method that yield non-negative corrected counts and also has been benchmarked to work well.

[1]:
import os
import scanpy as sc
import scvi
import numpy as np
/home/hratch/miniconda3/envs/ccc_protocols/lib/python3.10/site-packages/scvi/_settings.py:63: UserWarning: Since v1.0.0, scvi-tools no longer uses a random seed by default. Run `scvi.settings.seed = 0` to reproduce results from previous versions.
  self.seed = seed
/home/hratch/miniconda3/envs/ccc_protocols/lib/python3.10/site-packages/scvi/_settings.py:70: UserWarning: Setting `dl_pin_memory_gpu_training` is deprecated in v1.0 and will be removed in v1.1. Please pass in `pin_memory` to the data loaders instead.
  self.dl_pin_memory_gpu_training = (
[2]:
data_path = '../../data/'

seed = 888
scvi.settings.seed = seed


use_gpu = True
if not use_gpu:
    n_cores = 1
    scvi.settings.num_threads = 1
    scvi.settings.dl_num_workers = n_cores
Global seed set to 888

Load the data:

[3]:
adata = sc.read_h5ad(os.path.join(data_path, 'processed.h5ad'))

According to the scVI tutorial, scVI models expect raw UMI counts rather than log-normalized counts as input. Keep in mind this is tool specific, and the version of the expression dataset you input will depending on the batch correction method you employ.

From the above cell, we stored this in the "counts" layer. Let’s set up the batch correction model:

[4]:
raw_counts_layer = 'counts' # raw UMI counts layer
batch_col = 'sample' # metadata colum specifying batches
scvi.model.SCVI.setup_anndata(adata, layer = raw_counts_layer, batch_key = batch_col)
model = scvi.model.SCVI(adata, n_layers = 2, n_latent = 30, gene_likelihood= "nb")
/home/hratch/miniconda3/envs/ccc_protocols/lib/python3.10/abc.py:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.

For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.

  return _abc_instancecheck(cls, instance)
An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.

Now, we can run teh batch correction and transform it similar to how we did the normalization. Note, this is slightly different than the typical latent representation that is output by scVI. Many batch correction tools output the counts in a reduced latent space. However, for CCC analysis, we need tools that can output corrected counts for each gene in order to score the communication between ligands and receptors.

[5]:
model.train()
corrected_data = model.get_normalized_expression(transform_batch = sorted(adata.obs[batch_col].unique()),
                                                 library_size = 1e4)
corrected_data.iloc[:,:] = np.log1p(corrected_data.values)
adata.layers['batch_corrected_counts'] = corrected_data
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Epoch 1/128:   0%|                                      | 0/128 [00:00<?, ?it/s]
/home/hratch/miniconda3/envs/ccc_protocols/lib/python3.10/abc.py:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.

For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.

  return _abc_instancecheck(cls, instance)
Epoch 128/128: 100%|█| 128/128 [11:11<00:00,  5.25s/it, v_num=1, train_loss_step
`Trainer.fit` stopped: `max_epochs=128` reached.
Epoch 128/128: 100%|█| 128/128 [11:11<00:00,  5.24s/it, v_num=1, train_loss_step