Alignment of Tumors and Cell Lines
In this tutorial, we show how to apply CellMirror to align bulk tumor samples and cancer cell lines. As an example, we directly downloaded the RNA-seq data of bulk tumor and cell line samples from Warren, et al. 2021., including 12,236 cancer samples with 33 cancer types, and 1,249 cell line samples with 34 cancer type. We directly utilized the normalized RNA-seq data by the previous study.
Step0: Loading packages (Python)
1import warnings
2warnings.filterwarnings("ignore")
3
4import os
5os.environ['R_HOME'] = '/sibcb2/chenluonanlab7/zuochunman/anaconda3/envs/r4.0/lib/R'
6os.environ['R_USER'] = '/sibcb2/chenluonanlab7/zuochunman/anaconda3/envs/CellMirror/lib/python3.8/site-packages/rpy2'
7
8import random
9import numpy as np
10import scanpy as sc
11
12from CellMirror_utils.utilities import *
13from CellMirror_utils.layers import *
14from CellMirror_utils.cLDVAE_torch import *
15import torch.utils.data as data_utils
16
17parser = parameter_setting()
18args = parser.parse_known_args()[0]
19
20# Set seed
21np.random.seed(args.seed)
22random.seed(args.seed)
23torch.manual_seed(args.seed)
24torch.cuda.manual_seed(args.seed)
25
26# Changeable parameters
27args.n_hidden_en = [1000]; args.n_hidden_de = [1000]
28args.lr_cLDVAE = 3e-6; args.beta = 1; args.gamma = -100
29args.n_latent_s = 2; args.n_latent_z = 100
30
31# Set workpath
32Save_path = '/sibcb2/chenluonanlab7/zuochunman/Share_data/xiajunjie/TCGA_CCLE/'
Step1: Reading and preprocessing data (Python)
The necessary input files includes: 1. The gene expression matrix; 2. Samples’ attributes; 3. Genes’ information.
First, we need to read the RNA-seq data of tumors and cell lines into AnnData objects. See AnnData for more details if you are unfamiliar, including how to construct AnnData objects from scratch, and how to read data in other formats(.csv, .mtx, etc.) into AnnData objects. Here, we compile a function Load_TCGA_CCLE_data() for loading all the aforementioned 3 types of files.
Then, we follow the scanpy pipeline for data preprocessing. See scanpy for more details if you are unfamiliar. In this scenario, we select 5,000 highly variable genes for tumors and cell lines, respectively.
To align the number of samples for tumors and cell lines, which will be used in Step2 for training model, we generate random cell line samples from a multivariate normal distribution with the mean vector and covariance matrix of cell line gene expression via the pseudo_data_padding() function.
1# load data & preprocessing
2result = load_TCGA_CCLE_data()
3TCGA_obj = sc.AnnData(X=result[1], obs=result[3], var=result[0])
4CCLE_obj = sc.AnnData(X=result[2], obs=result[4], var=result[0])
5
6sc.pp.highly_variable_genes(TCGA_obj, n_top_genes=5000)
7sc.pp.highly_variable_genes(CCLE_obj, n_top_genes=5000)
8common_HVGs=np.intersect1d(list(TCGA_obj.var.index[TCGA_obj.var['highly_variable']]),list(CCLE_obj.var.index[CCLE_obj.var['highly_variable']])).tolist()
9
10TCGA_obj, CCLE_obj = TCGA_obj[:,common_HVGs], CCLE_obj[:,common_HVGs]
11print(len(common_HVGs))
12
13print('\nShape of target object: ', TCGA_obj.shape, '\tShape of background object: ', CCLE_obj.shape)
14
15# Last batch setting
16args = set_last_batchsize(args, TCGA_obj, CCLE_obj)
17
18# Pseudo-data padding
19TCGA_obj.X = TCGA_obj.X - np.mean(TCGA_obj.X, axis=0); CCLE_obj.X = CCLE_obj.X - np.mean(CCLE_obj.X, axis=0)
20pseudo_TCGA, pseudo_CCLE = pseudo_data_padding(TCGA_obj, CCLE_obj)
Hyperparameters
n_top_genes: Number of highly variable genes to keep.
Step2: Training cLDVAE model (Python)
Next, we train a cLDVAE model for aligning bulk tumor samples and cancer cell lines.
1# Dataloader preparation
2train = data_utils.TensorDataset(torch.from_numpy(pseudo_TCGA),torch.from_numpy(pseudo_CCLE))
3train_loader = data_utils.DataLoader(train, batch_size=args.batch_size, shuffle=True)
4
5total = data_utils.TensorDataset(torch.from_numpy(pseudo_TCGA),torch.from_numpy(pseudo_CCLE))
6total_loader = data_utils.DataLoader(total, batch_size=args.batch_size, shuffle=False)
7
8# Run cLDVAE
9model_cLDVAE = cLDVAE(args=args, n_input=TCGA_obj.shape[1]).cuda()
10history = model_cLDVAE.fit(train_loader, total_loader)
After convergence, the trained model can be used for predicting aligned outputs.
Hyperparameters
batch_size: The batch size for training cLDVAE model. The default value is 128. You can modify it based on your memory size. The larger the parameter, the less time.
n_hidden_en / n_hidden_de: Number of nodes in the hidden layer of encoder / decoder.
n_latent_s / n_latent_z: The dimensionality of salient / shared representation vector.
lr_cLDVAE: Learning rate parameter for training cLDVAE. The default value of the parameters is 3e-6.
beta: The penalty for the KL divergence. The default value is 1. You can adjust it from 0 to 1 by 0.1.
gamma: The penalty for the Total Correlation loss.
Step3: Saving cLDVAE outputs (Python)
To retain the sample size of the processed data, we use the pseudo_data_deparser() function for recovering the original number of samples for tumors and cell lines.
1# Pseudo-data deparsing
2outputs = model_cLDVAE.predict(total_loader)
3TCGA_obj.obsm['cLDVAE'], CCLE_obj.obsm['cLDVAE'] = pseudo_data_deparser(TCGA_obj, outputs['tg_z_outputs'], CCLE_obj, outputs['bg_z_outputs'])
4
5TCGA_obj.obsm['salient_features'], _ = pseudo_data_deparser(TCGA_obj, outputs['tg_s_outputs'], CCLE_obj, outputs['bg_s_outputs'])
6
7TCGA_obj.uns['s_loadings'] = model_cLDVAE.get_loadings()[:,-(args.n_latent_s):]
Output
This step generates results including shared features of tumors / cell lines and salient features that are specific to tumors stored in AnnData.obsm, linear decoder weights related to salient features stored in AnnData.uns.
You can save these output results in the following format for further downstream analyses:
── Your work path
├─ TCGA_CCLE_data_tumor_X_cLDVAE_only.csv
├─ TCGA_CCLE_data_CL_X_cLDVAE_only.csv
├─ TCGA_CCLE_data_tumor_salient_features.csv
└─ TCGA_CCLE_data_salient_loadings_matrix.csv
Step4: Implementing MNN on the processed data (Python)
1# Run MNN
2TCGA_obj.obsm['CellMirror'], CCLE_obj.obsm['CellMirror'] = mnn_correct(TCGA_obj.obsm['cLDVAE'], CCLE_obj.obsm['cLDVAE'])
3
4# UMAP embeddings
5adata_concat = sc.concat([TCGA_obj, CCLE_obj], label="type")
6sc.pp.neighbors(adata_concat, n_neighbors=10, metric='correlation',use_rep='CellMirror')
7sc.tl.umap(adata_concat,min_dist=0.5)
8adata_concat.obs = adata_concat.obs.merge(adata_concat.obsm['X_umap'].to_df(), how='inner', left_index=True, right_index=True)
9adata_concat.obs.to_csv(f"TCGA_CCLE_CellMirror_comb_Ann.csv")
Hyperparameters
k1: The number of nearest neighbors of target data in the reference data.
k2: The number of nearest neighbors of reference data in the target data.
ndist: The ndist parameter used for MNN. The default value is 3.
Output
This step generates files including common features of tumors / cell lines extracted by CellMirror and corresponding UMAP coordinates.
Save these output results in the following format for running downstream analyses:
── Your work path
├─ TCGA_CCLE_CellMirror_comb_Ann.csv
├─ TCGA_CCLE_data_tumor_X_CellMirror.csv
└─ TCGA_CCLE_data_CL_X_CellMirror.csv
Step5: Visualization (R)
1library(here)
2library(magrittr)
3library(tidyverse)
4source(here::here('CellMirror_utils','CellMirror_methods.R'))
5
6alignment <- read.csv('C:\\Users\\我的电脑\\Desktop\\待办\\TCGA_CCLE_CellMirror_comb_Ann.csv')
7
8p8 <- ggplot2::ggplot(alignment,
9 ggplot2::aes(X_umap1, X_umap2, fill=lineage, size=type, color = type)) +
10ggplot2::geom_point(pch=21, alpha=0.7) +
11ggplot2::scale_color_manual(values=c(`CL`='black', `tumor`='white')) +
12ggplot2::scale_size_manual(values=c(`CL`=1, `tumor`=0.75)) +
13ggplot2::theme_void() +
14ggplot2::theme(legend.position = 'right',
15 text=ggplot2::element_text(size=8),
16 legend.margin =ggplot2::margin(0,0,0,0),
17 plot.title = ggplot2::element_text(hjust=0.5),
18 legend.key.size = unit(10,'point')) +
19ggplot2::guides(fill=guide_legend(order=1,ncol=3,override.aes = list(size = 5)),size='none', color='none') +
20ggplot2::scale_fill_manual(values=tissue_colors) +
21ggplot2::xlab("UMAP 1") +
22ggplot2::ylab("UMAP 2")