Pan-Cancer Tumour Classification

Feb 18, 2018 · 2059 words · 10 minutes read ML bioinformatics cancer R

For someone who has spent the better part of the past two years dabbling in cancer genomics, I have remarkably little experience with penalized regression techniques. Bioinformatics is brimming with \(p \gg N\) problems that require subset selection, but somehow my work always seems to involve watching other people do penalized regression rather than doing it myself.

To rectify this situation, I recently spent a Sunday afternoon playing with data from The Cancer Genome Atlas. This blog post details that adventure.

Introduction to Penalized Regression

Penalized regression is a class of techniques for dealing with \(p \gg N\) problems. When there are more predictors than samples in a regression model, the model matrix \(\mathbf{X}\) contains more columns than rows. Consequently, \(\mathbf{X}^T\mathbf{X}\) is singular, and the least squares coefficients \(\hat{\beta}\) are not uniquely defined.

Ridge regression gets around the singularity problem by adding a constant \(\lambda\) to the diagonal of the matrix \(\mathbf{X}^T\mathbf{X}\). This corresponds to finding a set of coefficients that minimize a penalized residual sum of squares.

\[\begin{equation} \hat{\beta}^{RIDGE} = \text{argmin}_\beta \left\{ \sum_{i=1}^N \left( y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j \right)^2 + \lambda \sum_{j=1}^p \beta_j^2 \right\} \end{equation}\]

The penalization term \(\lambda \sum_{j=1}^p \beta_j^2\) shrinks the coefficients towards zero, and \(\lambda\) is a hyperparameter that controls the degree of shrinkage.

An alternative to ridge regression is the LASSO, which penalizes the absolute value rather than the square of the parameters.

\[\begin{equation} \hat{\beta}^{LASSO} = \text{argmin}_\beta \left\{ \sum_{i=1}^N \left( y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j \right)^2 + \lambda \sum_{j=1}^p \lvert \beta_j \rvert \right\} \end{equation}\]

The absolute value penalization causes the LASSO to come up with a sparse solution. Unlike ridge regression, which retains all variables, the LASSO solution can set coefficients to zero. The degree of penalization, and hence the number of variables retained, is controlled by the penalization parameter \(\lambda\).

In fields that value model interpretability, the sparse LASSO solution is an extremely valuauble property. For example, when designing genomic biomarkers, sparse predictive models reduce costs. A biomarker that only includes a handful of genes can be implemented as a targeted panel rather than an assay for all 20,000 human genes.

The degree of sparsity can be guided by either model interpretability or performance. To assess the model performance for different values of \(\lambda\), we apply nested cross-validation. Before fitting the model, we split our data into a training set we will use to estimate the parameters and a test set we will use to evaluate model performance. To find the best value of \(\lambda\) without overfitting the data, we use a further \(k\)-fold cross-validation loop on the training dataset. For a range of values of \(\lambda\), the model is fit on \(k-1\) folds and model performance is assessed on the \(k\)th fold. The prediction error is then averaged across the \(k\) folds.

Classifying Tumour vs. Normal

To experiment with some penalized regression models of my own, I downloaded RNA-seq data from TCGA Firehose. RNA-seq is a next-generation sequencing technique for quantifying mRNA abundance. The mRNA abundance for a gene tells us something about the expression level of that gene, which in turn says something about what’s going on in the cell. Skin cells will express different genes from blood cells, and cells that are dividing express other genes than cells that can no longer divide.

Gene expression signatures have been a major area of cancer research since the early 2000s, and gene expression biomarkers are in routine clinical use for several cancer types. By evaluating the expression of a set of genes, we can get a sense of how quickly the cancer cell is dividing. More rapid division is a sign of a more aggressive cancer, and the patient is likely to need more intensive therapy.

Instead of focusing on signatures of cell proliferation, I decided to develop a classifier for distinguishing tumour cells from benign tissue. This is an easy problem from both a clinical and a machine learning perspective. Cancer cells are typically so different from normal tissue that pathologists can identify them by their shape, and gene expression patterns are different enough that I knew I would be able to find genes with high predictive power.

Most of the TCGA samples are from cancer tissues. In addition, there are some normal (non-cancerous) tissues, either blood or from the same tissue as the cancer. I focused on cancer types for which RNA-seq was available for at least 20 same tissue non-cancerous samples.1

Abbreviation Name Cancer samples Normal samples
BRCA Breast invasive carcinoma 1093 112
COADREAD Colorectal adenocarcinoma 623 51
HNSC Head and neck squamous cell carcinoma 520 44
LUAD Lung adenocarcinoma 515 59
PRAD Prostate adenocarcinoma 497 52
STAD Stomach adenocarcinoma 415 36
THCA Thyroid carcinoma 501 59

RNA-seq data from TCGA alreaady comes in matrix form and needs minimal pre-processing. To prepare the data I read it into R, extracted gene symbols, and parsed the sample IDs to select the relevant samples.


# path to file downloaded from TCGA
# using head and neck cancer as an example
rseq.file <- '~/tcga/data/HNSC/';

# there are two header lines, only want the first of these <- read.table(
    skip = 2, 
    header = FALSE, 
    stringsAsFactors = FALSE
names( <- read.header(rseq.file);

# remove rows without proper gene names <-[ !grepl('\\?',$Hybridization ), ];

# extract gene symbols
gene.symbols <- sapply(
    strsplit($Hybridization, split = '\\|'),
    function(x) x[1]

# transpose data
rseq.matrix <- t([, grepl('TCGA', names( ) ]);
colnames(rseq.matrix) <- gene.symbols;

# get information about the type of sample
# this can be extracted from the TCGA ID, see:
# I have already implemented this in the function of the hedgehog package
sample.ids <- rownames(rseq.matrix);
sample.type <-, 'sample');

# make sure only tumour and solid normal is included (codes 1 and 11)
rseq.matrix <- rseq.matrix[ sample.type %in% c('01', '11'), ];
sample.type <- sample.type[ sample.type %in% c('01', '11') ];

# create response variable
is.tumour <- ifelse(sample.type == '01', 1, 0);

After pre-processing, we’re left with a 564x20502 numeric matrix of predictors and a binary-coded 564-dimensional response vector. I further split the data into a training set for fitting the model and a test set for model evaluation. The createDataPartition function in the caret package ensures the class labels are balanced between the training and test set.

Penalized regression is implemented in the R package glmnet. The main function for model fitting is glmnet, and the alpha argument controls which type of penalization is applied. To simultaneously choose an optimal penalization parameter \(\lambda\), we use the cv.glmnet function. By default, 10-fold crossvalidation is used to assess model performance for different values of \(\lambda\).

I picked the value of \(\lambda\) that resulted in the lowest prediction error, and extracted the genes included in the classifier by selecting the non-zero coefficients.

# create train/ test split (70/30)
# want this balanced on class labels
train.index <- createDataPartition(
    p = 0.7,
    list = FALSE,
    times = 1

rseq.train <- rseq.matrix[ train.index, ];
rseq.test <- rseq.matrix[ -train.index, ];

label.train <- is.tumour[ train.index ];
label.test <- is.tumour[ -train.index ];

# run LASSO <- cv.glmnet(
    x = rseq.train, 
    y = label.train, 
    family = 'binomial', # logistic regression
    alpha = 1 # lasso penalty

# pick best-performing value of lambda
lasso.coefficients <- coef(, s = 'lambda.min');
nonzero.indices <- which(lasso.coefficients != 0);

selected.genes <- rownames(lasso.coefficients)[ nonzero.indices ];

# remove intercept, not interesting
selected.genes <- selected.genes[ '(Intercept)' != selected.genes ];

##  [1] "ACER1"    "ADH4"     "BARX2"    "C2orf40"  "C2orf54"  "CAB39L"  
##  [7] "CCL14"    "CEACAM1"  "CYP2F1"   "CYP3A4"   "CYP4B1"   "DBC1"    
## [13] "DNASE1L3" "FAM107A"  "FAM3D"    "FLJ44635" "GPX3"     "HS3ST1"  
## [19] "MAPKAPK3" "MGLL"     "MIDN"     "MYH8"     "NRG2"     "PLCXD3"  
## [25] "PLG"      "PPARG"    "QARS"     "SLC19A2"  "SLC27A6"  "SUCLG2"  
## [31] "THSD4"    "ZFP36"

I ran this code for each of the seven cancer types. The figure below shows the genes selected by each classifier (if you need a primer on UpSet plots, check out the vignette). The size of the classifier ranges from 19 to 33 genes, and there is very little ovelap between the gene sets. Only three genes were selected by more than one classifier: DNASE1L3, DPT, and METTL7A.

Gene sets

Cross-Cancer Performance

To investigate whether these largely disjoint gene sets meant the gene signatures identifying tumours were fundamentally different, I applied each classifier in each other cancer type. To make the comparison as fair as possible, I evaluated everything on the test set.

The AUC curves below show the performance when applying each classifier in all cancers. Each colour represents a cancer type, and the black line in each plot shows the performance when evaluating the classifier on the same cancer type it was trained on. As expected in this easy problem, all classifiers do well when applied in this fashion.

Applying each classifier in all other cancers
Even though the same-cancer classifier wins out, in general all classifiers perform well in other cancer types too. One exception is prostate cancer (PRAD). A classifier that performs no better than random chance would achieve an AUC of 0.5. The prostate cancer classifier achieves an AUC of 0.512 and 0.526 when applied to colorectal cancer data and head and neck cancer data, respectively – hardly a large improvement over random chance.

We can also flip this figure to make it clearer which cancers are the easiest to classify. In this figure, each panel is a cancer type and the lines show the performance of the different classifiers.

Classifying all cancers
The easiest cancers to distinguish from normal tissue seem to be breast cancer (BRCA) and lung cancer (LUAD). By contrast, the most difficult classification problems are the head and neck cancer (HNSC) and prostate cancer ones. Interestingly, the previous plot showed the head and neck-trained classifier to generally perform well in other cancer types, while the prostate-trained classifier generally performs badly.

Correlated Genes

One possible explanation for the non-overlapping gene sets resulting in classifiers that still perform well across tasks stems from the properties of the LASSO. When faced with correlated variables with predictive power, the LASSO will include one or more in the model, and leave the rest out. Many genes have highly correlated expression, typically if they are involved in the same cellular processes or regulate each other, making it likely that this could have happened.

To explore whether this could have been the case in the TCGA data, I looked at the colorectal cancer (COADREAD) and lung cancer (LUAD) classifiers. The two classifiers do not share any genes, yet perform very well when applied to each other’s test data (colorectal in lung AUC = 0.986, lung in colorectal AUC = 0.961).

The heatmap shows the correlation in the LUAD samples between all genes included in either the LUAD or the COADREAD classifier. The top right corner of the plot shows a group of highly correlated genes, some of which were selected by the LUAD classifier and some of which were selected by the COADREAD classifier. With high correlation values, including all of the genes in a single classifier is unlikely to add more information and improve the predictive performance.

Correlation heatmap

Pan-Cancer Classifier

Finally, I tried training a tumour/ normal classifier on the full dataset. To ensure the same balance of cancer types in the training data and the test data, I merged the data from the cancer-level training/ test split, and applied cv.glmnet on the merged training data.

The resulting classifier contains 127 genes, far more than any of the cancer-specific classifiers. 97 of these are unique, while the remaining 30 are shared with one or more cancer-specific classifiers.

Pan-cancer gene set
As expected, this pan-cancer classifier performs well in all cancer types. The plot below shows the AUC when evaluating the classifier on each cancer-specific test set, as well as the combined pan-cancer test set.

Pan-cancer AUC


Identifying gene expression signatures that can distinguish tumours from normal tissue is an easy problem. There are a large number of genes that can be used to identify cancerous tissue, many of which are shared between different cancer types. When faced with correlated predictors, the LASSO will exclude some of them from the model by setting the coefficients equal to zero. In our case, this means classifiers with different features can still achieve good performance on the same classification problems.


Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The Elements of Statistical Learning. Springer, 2009.

  1. The KIPAN and STES datasets also have more than 20 normal samples available, but were left out due to numerical issues in model fitting.