1. Preprocessing expression data

This tutorial demonstrate how to pre-process single-cell raw UMI counts to generate expression matrices that can be used as input to cell-cell communication tools. We recommend the quality control chapter in the Single-cell Best Practices book as a starting point for a detailed overview of QC and single-cell RNAseq analysis pipelines in general.

We demonstrate a typical workflow using the popular single-cell analysis software scanpy to generate an AnnData object which can be used downstream. For these tutorials, we will use a lung dataset of 63k immune and epithelial cells across three control, three moderate, and six severe COVID-19 patients.

Details and caveats regarding batch correction, which removes technical variation while preserving biological variation between samples, can be viewed in the additional examples tutorial entitled “S1_Batch_Correction”.

Preparare the object for cell-cell communication analysis

Import the required packages

[1]:
import os

import scanpy as sc
import pandas as pd
import numpy as np

import cell2cell as c2c
[2]:
import warnings
warnings.filterwarnings('ignore')
[3]:
data_path = '../../data/'
c2c.io.directories.create_directory(data_path)
../../data/ already exists.

Loading

The 12 samples can be downloaded as .h5 files from here. You can also download the cell metadata from here

Alternatively, cell2cell has a helper function to load the data as an AnnData object:

[4]:
adata = c2c.datasets.balf_covid(os.path.join(data_path, 'BALF-COVID19-Liao_et_al-NatMed-2020.h5ad'))
adata.obs.head()
[4]:
sample sample_new group disease hasnCoV cluster celltype condition
AAACCCACAGCTACAT_3 C100 HC3 HC N N 27.0 B Control
AAACCCATCCACGGGT_3 C100 HC3 HC N N 23.0 Macrophages Control
AAACCCATCCCATTCG_3 C100 HC3 HC N N 6.0 T Control
AAACGAACAAACAGGC_3 C100 HC3 HC N N 10.0 Macrophages Control
AAACGAAGTCGCACAC_3 C100 HC3 HC N N 10.0 Macrophages Control
[5]:
adata.obs.condition.unique()
[5]:
['Control', 'Moderate COVID-19', 'Severe COVID-19']
Categories (3, object): ['Control', 'Moderate COVID-19', 'Severe COVID-19']

Calculate Basic QC metrics

We will calculate some basic QC metrics associated with the library size that will be used for filtering:

[6]:
adata.var['mt'] = adata.var_names.str.startswith('MT-')  # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(adata,
                           qc_vars=['mt'],
                           percent_top=None,
                           log1p=False,
                           inplace=True)

Quality-Control Filtering

Exclude cells that visually do not fall within the normal range of standard QC metrics (see chapter) – fraction of genes in a cell that are mitochondrial, number of unique genes, and potential doublets.

Note that since this is a pre-processed dataset, these steps are largely to demonstrate the workflow.

[7]:
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
[8]:
sc.pl.violin(adata, ['n_genes', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True)
../../_images/notebooks_ccc_python_01-Preprocess-Expression_15_0.png
[9]:
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata, x='total_counts', y='n_genes')
../../_images/notebooks_ccc_python_01-Preprocess-Expression_16_0.png
../../_images/notebooks_ccc_python_01-Preprocess-Expression_16_1.png

We can remove any droplets with high mitochondrial content or those that potentially represent doublets:

[10]:
adata = adata[adata.obs.pct_counts_mt < 15, :]
adata = adata[adata.obs.n_genes < 5500, :]

Note that in this case we filter out cells with high total counts, but a best practice approach would be to instead filter out doublets. As using different doublet detection techniques in R and Python introduces some differences between the workflows, we will not do this here.

Nevertheless, the code for doublet detection is provided below:

sc.external.pp.scrublet(adata, batch_key="sample_new", verbose=False)
adata = adata[~adata.obs['predicted_doublet']]

Normalize

For single-cell inference across sample and across cell types, most CCC tools require the library sizes to be comparable. We can use the scanpy function sc.pp.normalize_total to normalize the library sizes. This function divides each cell by the total counts per cell and multiplies by the median of the total counts per cell. Furthermore, we log1p-transform the data to make it more Gaussian-like, as this is a common assumption for the analyses downstream. Finally, such a normalization maintains non-negative counts, which is important for tensor decomposition.

[11]:
# save the raw counts to a layer
adata.layers["counts"] = adata.X.copy()

# Normalize the data
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

Dimensionality Reduction

While dimensionality reduction is not used directly in these tutorials, a number of these steps are necessary to obtain cell group labels when processing your own data. Steps such as filtering for highly variable genes (HVGs) and scaling the data improve dimensionality reudction and clustering results. However, for CCC, we recommend using the entire gene expression matrix unscaled (either raw or library- and log-normalized). Scaling introduces negative counts, which poses challenges for CCC inference or tensor decompsoition. Filtering for HVGs reduces the list of potential ligand-receptor pairs that can be tested for interactions.

This tutorial diverges with its companion tutorial in R here due to minor algorithmic differences, but since the inputs will be the full expression matrices above, these discrepancies will not affect downstream tutorials.

Note that these are not a necessary steps for CCC analysis, but are common steps in single-cell analysis pipelines. We thus include them for demonstration purposes and to illustrate the data.

[12]:
sc.pp.highly_variable_genes(adata, flavor = 'seurat', n_top_genes = 2000, batch_key="sample")
[13]:
sc.tl.pca(adata, svd_solver='arpack', use_highly_variable = True)

sc.pl.pca_variance_ratio(adata, log=True, n_pcs = 20) # visualize PCA variance explained
../../_images/notebooks_ccc_python_01-Preprocess-Expression_24_0.png
[14]:
sc.pp.neighbors(adata, n_neighbors=5, n_pcs=20)
sc.tl.umap(adata)

# plot pre-annotated cell types
sc.pl.umap(adata, color=['celltype'])
../../_images/notebooks_ccc_python_01-Preprocess-Expression_25_0.png
[15]:
adata
[15]:
AnnData object with n_obs × n_vars = 62551 × 24798
    obs: 'sample', 'sample_new', 'group', 'disease', 'hasnCoV', 'cluster', 'celltype', 'condition', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'n_genes'
    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'celltype_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts'
    obsp: 'distances', 'connectivities'

Save the data for downsteam analysis

[16]:
adata.write_h5ad(os.path.join(data_path, 'processed.h5ad'), compression='gzip')