6. Pathway Enrichment of LR loadings

Cell-cell communication inference from single-cell data provides unprecedented insights into the molecular mechanisms underlying cell-cell communication. In particular, it enables the capture of CCC interactions at a systems level, which is not possible with traditional approaches. However, as the number of inferred interactions increases, the interpretation of the inferred cell-cell communication networks becomes more challenging. To this end, as done in other omics studies, we can perform pathway enrichment analysis to identify the general biological processes that are enriched in the inferred interactions. Pathway enrichment thus serves two purposes; it reduces the dimensionality of the inferred interactions, and also provides a biological summary of the inferred interactions.

In this tutorial, we will show how to use the cell2cell and liana packages to perform classical gene set enrichment analysis using GSEA with KEGG Pathways, and also we will use perform a footprint enrichment analysis using a multi-variate linear model with the PROGENy pathway resource. We will use the `GSEAPY <https://gseapy.readthedocs.io/en/latest/>`__ and `decoupler-py <https://decoupler-py.readthedocs.io/en/latest>`__ packages to perform these analyses.

[1]:

import cell2cell as c2c
import liana as li

import pandas as pd

import decoupler as dc

import matplotlib.pyplot as plt
import plotnine as p9
import seaborn as sns

%matplotlib inline

[2]:

import logging
logging.disable(logging.WARNING) # This is to avoid too many logs because of GSEApy

import warnings
warnings.filterwarnings('ignore')

Directories

[3]:

output_folder = '../../data/tc2c-outputs/'
c2c.io.directories.create_directory(output_folder)

../../data/tc2c-outputs/ already exists.

Load Data

Open the loadings obtained from the tensor factorization

[4]:

factors = c2c.io.load_tensor_factors(output_folder + 'Loadings.xlsx')

Load list of ligand-receptor pairs used as reference torun LIANA

[5]:

lr_pairs = li.resource.select_resource('consensus')
lr_pairs.head()

[5]:

	ligand	receptor
0	LGALS9	PTPRC
1	LGALS9	MET
2	LGALS9	CD44
3	LGALS9	LRP1
4	LGALS9	CD47

Generate a list with the LR pair names

KEGG was originally designed to work with list of genes. Here we are using ligand-receptor pairs, so instead of using Gene Sets we need to build our LR-gene sets. This list is a first step to build the LR-gene sets.

[6]:

lr_list = ['^'.join(row) for idx, row in lr_pairs.iterrows()]

Gene Set Enrichment Analysis

In this case, we will run the PreRank method of GSEA since our LR pairs are pre-ranked based on their loadings in each factor obtained by Tensor-cell2cell.

Generate LR-gene set for running GSEA

cell2cell includes some resources associated with the Gene sets employed in GSEA. It includes annotations of GO Terms (BP), KEGG, and Reactome, for human and mouse.

First, specify which organism and annotation DB to use

[7]:

organism = 'human' # For the COVID-19 data analyzed with LIANA+Tensor-cell2cell
pathwaydb = 'KEGG' # KEGG Pathways

Generate the LR-gene set that will be used for running GSEA

[8]:

lr_set = c2c.external.generate_lr_geneset(lr_list,
                                          complex_sep='_',
                                          lr_sep='^',
                                          organism=organism,
                                          pathwaydb=pathwaydb,
                                          readable_name=True,
                                          output_folder=output_folder
                                          )

Extract LR loadings for each factor

[9]:

lr_loadings = factors['Ligand-Receptor Pairs']

Run GSEA

This external function implemented in cell2cell builds upon the function gseapy.prerank().

All outputs will be saved in the output_folder. Make sure that this path exist before running this function.

[10]:

%%time
pvals, scores, gsea_df = c2c.external.run_gsea(loadings=lr_loadings,
                                               lr_set=lr_set,
                                               output_folder=output_folder,
                                               weight=1,
                                               min_size=15,
                                               permutations=999,
                                               processes=6,
                                               random_state=6,
                                               significance_threshold=0.05,
                                              )

../../data/tc2c-outputs/ already exists.

100%|███████████████████████████████████████████| 10/10 [00:08<00:00,  1.11it/s]

CPU times: user 9.86 s, sys: 132 ms, total: 9.99 s
Wall time: 9.05 s

Visualization

We can use the pvals and scoresoutputs to visualize the results as a Dot Plot:

[11]:

with sns.axes_style("darkgrid"):
    dotplot = c2c.plotting.pval_plot.generate_dot_plot(pval_df=pvals,
                                                       score_df=scores,
                                                       significance=0.05,
                                                       xlabel='',
                                                       ylabel='{} Annotations'.format(pathwaydb),
                                                       cbar_title='NES',
                                                       cmap='PuOr',
                                                       figsize=(5, 12),
                                                       label_size=20,
                                                       title_size=20,
                                                       tick_size=12,
                                                       filename=output_folder + '/GSEA-Dotplot.pdf'
                                                      )

../../_images/notebooks_ccc_python_06-Functional-Footprinting_23_0.png

Similarly, we can use gsea_df to easily see the Enriched Pathways

[12]:

gsea_df.loc[(gsea_df['Adj. P-value'] < 0.05) & (gsea_df['NES'] > 0.)]

[12]:

	Factor	Term	NES	P-value	Adj. P-value
13	Factor 2	ANTIGEN PROCESSING AND PRESENTATION	1.622506	0.001000	0.016266
14	Factor 2	CELL ADHESION MOLECULES CAMS	1.428369	0.001000	0.016266
26	Factor 3	ANTIGEN PROCESSING AND PRESENTATION	1.370556	0.001001	0.016266
39	Factor 4	ANTIGEN PROCESSING AND PRESENTATION	1.556417	0.001000	0.016266
40	Factor 4	NATURAL KILLER CELL MEDIATED CYTOTOXICITY	1.374844	0.002002	0.026026
42	Factor 4	CELL ADHESION MOLECULES CAMS	1.220381	0.002002	0.026026
54	Factor 5	ANTIGEN PROCESSING AND PRESENTATION	1.320129	0.001000	0.016266
55	Factor 5	CELL ADHESION MOLECULES CAMS	1.277904	0.001000	0.016266
91	Factor 8	ANTIGEN PROCESSING AND PRESENTATION	1.424629	0.001000	0.016266
93	Factor 8	CELL ADHESION MOLECULES CAMS	1.243899	0.003003	0.035490

And Depleted Pathways

[13]:

gsea_df.loc[(gsea_df['Adj. P-value'] < 0.05) & (gsea_df['NES'] < 0.)]

[13]:

	Factor	Term	NES	P-value	Adj. P-value
41	Factor 4	APOPTOSIS	-1.260816	0.001	0.016266

Footprint Enrichment

Footprint enrichment analysis build upon classic geneset enrichment analysis, as instead of considering the genes involved in a biological activity, they consider the genes affected by the activity, or in other words the genes that change downstream of said activity (Dugourd and Saez-Rodriguez, 2019). This is particularly useful when working with transcriptomics data, as it allows to identify the genes that are regulated by a biological process, instead of the genes that are part of the process. In this case, we will use the PROGENy pathway resource to perform footprint enrichment analysis. PROGENy was built in a data-driven manner using perturbation and cancer lineage data (Schubert et al, 2019), as a consequence it also assigns different importances or weights to each gene in its pathway genesets.

To this end, we need an enrichment method that can take weights into account, and here we will use multi-variate linear models from the decoupler-py package to perform this analysis (Badia-i-Mompel et al., 2022).

load the PROGENy pathway resource:

[14]:

net = dc.get_progeny(organism='human', top=5000)

Now we will use the PROGENy genesets to assign pathways and weights to the ligand-receptor pairs in liana’s consensus resource. Specifically, we will use the weighted bipartite networks from PROGENy, where the weight represents the importance of the genes to a given geneset, and assign a weight to each ligand-receptor interaction, based on the mean of the weights of the proteins in that ligand-receptor interaction. We keep only ligand-receptor weights only if all the protein in the ligand-receptor interaction are present for a given pathway, and are sign-coherent.

[15]:

lr_progeny = li.rs.generate_lr_geneset(lr_pairs, net, lr_sep="^")
lr_progeny.head()

[15]:

	source	interaction	weight
1	JAK-STAT	LGALS9^PTPRC	1.307807
74	Androgen	SEMA4D^MET	-0.831693
90	Androgen	HGF^MET	-1.288956
106	Androgen	TIMP3^MET	-1.122612
151	NFkB	SELE^CD44	3.332552

Run footprint enrichment analysis

We will now use the decoupler-py package to perform the analysis.

[16]:

estimate, pvals =  dc.run_mlm(lr_loadings.transpose(),
                              lr_progeny,
                              source="source",
                              target="interaction",
                              use_raw=False)

Visualize Results

[17]:

sns.clustermap(estimate, xticklabels=estimate.columns, cmap='coolwarm', z_score=4)
plt.savefig(output_folder + '/PROGENy.pdf', dpi=300, bbox_inches='tight')

../../_images/notebooks_ccc_python_06-Functional-Footprinting_38_0.png

Let’s zoom in on Factor 5, which is highly correlated with COVID-19 severity.

[18]:

selected_factor = 'Factor 5'

dc.plot_barplot(estimate,
                selected_factor,
                vertical=True,
                cmap='coolwarm',
                save=output_folder + '/PROGENy-{}.pdf'.format(selected_factor.replace(' ', '-')))

../../_images/notebooks_ccc_python_06-Functional-Footprinting_40_0.png

From the plots above, we can see that the JAK-STAT pathway is active and highly active in Factor 5. Let’s see which ligand-receptor pairs are enriched in this pathway.

[19]:

pathway = 'JAK-STAT'
# loadings to long format
lr_loadings_long = lr_loadings.reset_index().melt(id_vars='index', var_name="Factor", value_name="Loadings").rename(columns={'index':'interaction'})
# join progeny weights and keep only NFkB pathway
lr_loadings_long = lr_loadings_long.merge(lr_progeny, on='interaction').query("source == '{}' and Factor == '{}'".format(pathway, selected_factor))
# add sign to the weights
lr_loadings_long['sign'] = lr_loadings_long['weight'].apply(lambda x: 'positive' if x > 0 else 'negative')
# keep only relevant interactions for labels
lr_loadings_long['relevant_interactions'] = lr_loadings_long.apply(lambda x: x['interaction'] if (x['Loadings'] > 0.025) and (x['weight'] > 2.5) else None, axis=1)
lr_loadings_long.head()

[19]:

	interaction	Factor	Loadings	source	weight	sign	relevant_interactions
818	B2M^LILRB2	Factor 5	1.086240e-01	JAK-STAT	1.428233	positive	None
1024	CCL18^CCR1	Factor 5	2.542308e-03	JAK-STAT	1.096684	positive	None
1058	CCL2^CCR1	Factor 5	7.912275e-02	JAK-STAT	2.103921	positive	None
1138	CCL7^CCR1	Factor 5	2.964358e-02	JAK-STAT	1.145112	positive	None
1184	CCL7^CXCR3	Factor 5	7.615716e-11	JAK-STAT	0.682378	positive	None

Plot the loadings for Factor 5 and weights of the JAK-STAT pathway

[20]:

(p9.ggplot(lr_loadings_long,
           p9.aes(x='weight', y='Loadings')) +
 p9.geom_smooth(method='lm') +
 p9.geom_point(p9.aes(colour='sign')) +
 p9.theme_bw() +
 p9.theme(legend_position='none') +
 p9.labs(title="{} | {}".format(pathway, selected_factor), x="PROGENy Weights", y="Loadings") +
 p9.scale_colour_manual(values=["royalblue", "red"]) +
 p9.geom_label(p9.aes(label='relevant_interactions'), size=8, nudge_y=0.01, nudge_x=0.01) +
 p9.xlim(-1, 8.5)
 )

../../_images/notebooks_ccc_python_06-Functional-Footprinting_44_0.png

[20]:

<Figure Size: (640 x 480)>

Here, we can see that for example the CCL8 -> CCR1 is an interaction found with high importances in the JAK-STAT pathway, and high loadings. This interaction is proinflammatory (https://doi.org/10.1016%2Fj.cytogfr.2020.12.005), which could be one reason why this factor is highly correlated with the disease severity.