Lazy TCGA Copy Number Processing

May 31, 2017 · 200 words · 1 minute read bioinformatics R

I feel a certain affection towards copy number alterations. Back in my cancer research days I mostly worked on prostate cancer, which is predominantly driven by copy number changes rather than point mutations. Analyzing CNAs meant better power than analyzing SNVs, and consequently a happier Erle.

Thankfully, The Cancer Genome Atlas has plenty of copy number data that I can play with now that I no longer have access to private datasets.

Their public CNA files consists of segments per patient. Each row gives a region of the genome that would seem to be affected by a copy number change, with a mean intensity of the probes in that region. For most analyses this is too low-level, and we’re instead interested in the absolute number of copies of each gene.

I wrote a quick R script to get the data into a more useful format. And when I say quick, I mean writing it was quick. Running it is anything but. Processing a single sample takes somewhere between 10 and 20 seconds. But it does the trick, and returns a gene by sample CNA matrix with minimal effort.

If you’re lazy too, you can check out the code on Github.