{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d94b8e4a",
   "metadata": {},
   "source": [
    "# S1. Batch Correction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3fbfcfc4",
   "metadata": {},
   "source": [
    "[Batch correction](https://www.nature.com/articles/s41592-018-0254-1) removes technical variation while preserving biological variation between samples. This is a processing step that can be applied to our dataset after normalization (see [Tutorial 01](./01-Preprocess-Expression.ipynb) for processing steps up to normalization). Additional considerations for expression data to be compatible with CCC inference and Tensor-cell2cell is that counts should  be non-negative.  \n",
    "\n",
    "Here, we demonstrate batch correction with [scVI](https://docs.scvi-tools.org/en/stable/tutorials/notebooks/harmonization.html), a method that yield non-negative corrected counts and also has been [benchmarked](https://www.nature.com/articles/s41592-021-01336-8) to work well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "c3f00e7c",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/hratch/miniconda3/envs/ccc_protocols/lib/python3.10/site-packages/scvi/_settings.py:63: UserWarning: Since v1.0.0, scvi-tools no longer uses a random seed by default. Run `scvi.settings.seed = 0` to reproduce results from previous versions.\n",
      "  self.seed = seed\n",
      "/home/hratch/miniconda3/envs/ccc_protocols/lib/python3.10/site-packages/scvi/_settings.py:70: UserWarning: Setting `dl_pin_memory_gpu_training` is deprecated in v1.0 and will be removed in v1.1. Please pass in `pin_memory` to the data loaders instead.\n",
      "  self.dl_pin_memory_gpu_training = (\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import scanpy as sc\n",
    "import scvi\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c913eb28",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Global seed set to 888\n"
     ]
    }
   ],
   "source": [
    "data_path = '../../data/'\n",
    "\n",
    "seed = 888\n",
    "scvi.settings.seed = seed\n",
    "\n",
    "\n",
    "use_gpu = True\n",
    "if not use_gpu:\n",
    "    n_cores = 1\n",
    "    scvi.settings.num_threads = 1\n",
    "    scvi.settings.dl_num_workers = n_cores"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c5b2586",
   "metadata": {},
   "source": [
    "Load the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "fb87101b",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata = sc.read_h5ad(os.path.join(data_path, 'processed.h5ad'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adf764ff",
   "metadata": {},
   "source": [
    "According to the [scVI tutorial](https://docs.scvi-tools.org/en/stable/tutorials/notebooks/api_overview.html), scVI models expect raw UMI counts rather than log-normalized counts as input. Keep in mind this is tool specific, and the version of the expression dataset you input will depending on the batch correction method you employ. \n",
    "\n",
    "From the above cell, we stored this in the `\"counts\"` layer. Let's set up the batch correction model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "2596c52f",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/hratch/miniconda3/envs/ccc_protocols/lib/python3.10/abc.py:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.\n",
      "\n",
      "For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.\n",
      "\n",
      "For creation, use `anndata.experimental.sparse_dataset(X)` instead.\n",
      "\n",
      "  return _abc_instancecheck(cls, instance)\n",
      "An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.\n"
     ]
    }
   ],
   "source": [
    "raw_counts_layer = 'counts' # raw UMI counts layer\n",
    "batch_col = 'sample' # metadata colum specifying batches\n",
    "scvi.model.SCVI.setup_anndata(adata, layer = raw_counts_layer, batch_key = batch_col)\n",
    "model = scvi.model.SCVI(adata, n_layers = 2, n_latent = 30, gene_likelihood= \"nb\") "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40c106db",
   "metadata": {},
   "source": [
    "Now, we can run teh batch correction and transform it similar to how we did the normalization. Note, this is slightly different than the typical latent representation that is output by scVI. Many batch correction tools output the counts in a reduced latent space. However, for CCC analysis, we need tools that can output corrected counts for each gene in order to score the communication between ligands and receptors. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "b0c0deac",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "GPU available: True (cuda), used: True\n",
      "TPU available: False, using: 0 TPU cores\n",
      "IPU available: False, using: 0 IPUs\n",
      "HPU available: False, using: 0 HPUs\n",
      "You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision\n",
      "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/128:   0%|                                      | 0/128 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/hratch/miniconda3/envs/ccc_protocols/lib/python3.10/abc.py:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.\n",
      "\n",
      "For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.\n",
      "\n",
      "For creation, use `anndata.experimental.sparse_dataset(X)` instead.\n",
      "\n",
      "  return _abc_instancecheck(cls, instance)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 128/128: 100%|█| 128/128 [11:11<00:00,  5.25s/it, v_num=1, train_loss_step"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "`Trainer.fit` stopped: `max_epochs=128` reached.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 128/128: 100%|█| 128/128 [11:11<00:00,  5.24s/it, v_num=1, train_loss_step\n"
     ]
    }
   ],
   "source": [
    "model.train()\n",
    "corrected_data = model.get_normalized_expression(transform_batch = sorted(adata.obs[batch_col].unique()),\n",
    "                                                 library_size = 1e4)\n",
    "corrected_data.iloc[:,:] = np.log1p(corrected_data.values)\n",
    "adata.layers['batch_corrected_counts'] = corrected_data"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:ccc_protocols]",
   "language": "python",
   "name": "conda-env-ccc_protocols-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  },
  "vscode": {
   "interpreter": {
    "hash": "a89d9df9e41c144bbb86b791904f32fb0efeb7b488a88d676a8bce57017c9696"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}