Content: Raw Data Processing Cell Line WES Organoid WGS Categorisation of Total Copy Number Values Cell Line SNP6

Copy Number

This page outlines raw data and the subsequent processing for use within the Cancer Dependency Map at Sanger.

Raw Data

Dataset	Origin	Data Type	Model Type	Details	Link
SNP6 Cell Lines	Sanger	CEL	Cell Line	Affymetrix SNP6	EGAS00001000978
Whole Exome Sequencing	Sanger	BAM	Cell Line	Illumina HiSeq 2000	EGAS00001000978
Whole Exome Sequencing	Broad (Remapped to GRCh38 at Sanger)	BAM	Cell Line	Illumina HiSeq 2000 or Illumina GAIIX	PRJNA523380

Processed Data

Descriptions of how the raw data was processed including algorithms and filtering. Processed datasets can be downloaded here, the active dataset can also be accessed using the DepMap web resources and API.

Whole Exome Copy Number Data (Cell Lines)

Somatic copy number alterations were detected by integrating the output of GATK CNV pipeline (version 4.1.8.0) and the PureCN software (version 1.22) (https://github.com/lima1/PureCN)[PMID: 27999612]. Both gene and segment level data is available for download however only gene based data is available via the web portal and API.

The GATK4 Somatic CNV workflow (GATK, version 4.1.8.0) was utilized for the normalization of read counts, allelic count calculation of potential germline sites (GNOMAD SNPs with population allele frequencies greater than 0.1 %) and the posterior genome segmentation (https://gatk.broadinstitute.org/hc/en-us/articles/360035531092--How-to-part-I-Sensitively-detect-copy-ratio-alterations-and-allelic-segments ). Normal (non-cancer) cell lines in the dataset were used to build the panel of normals (PoN), which was run separately for males and females. For those samples with no sex information, a PoN was used that integrated all normal cell lines and ignored the sex chromosomes for downstream analysis. Finally, PureCN was used to integrate the output of GATK4 (read counts, segmentation and allelic count files) to estimate the allele-specific consensus copy number profile, purity, and ploidy of each sample (as suggested in the PureCN best practices guide, https://bioconductor.org/packages/release/bioc/vignettes/PureCN/inst/doc/Quick.html ). As PureCN does not provide calls for sex chromosomes in male samples, categorical calls for the sex chromosomes of these samples have been obtained by using the segments and log2 ratios from the GATK4 pipeline.

To integrate as many samples as possible across the two datasets target regions across all three bed files were intersected and overlapping regions selected for the CNA calling. The following bed files were integrated:

agilent_v1.1_grch38_liftover.bed
SureSelect_Human_All_Exon_50Mb_95_ucsc_liftover_grch38.bed
SureSelect_Whole_Human_Exome_v5_GRCh38_liftover_160.bed

Whole Genome Copy Number Data (Organoids)

Samples were analysed using HMF Tools. Somatic copy number alterations were detected using the PURPLE algorithm version 2.54.

Both gene and segment level data is available for download however only gene based data is available via the web portal and API.

Categorisation of Total Copy Number values

The total copy number values have been categorised (CNA Call) using the following calculation:

Val = round( 2 * 2^log2(C/Ploidy) )

if Val == 0: Category = 'Deletion'
if Val == 1: Category = 'Loss'
if Val == 2: Category = 'Neutral'
if Val == 3: Category = 'Gain'
if Val >= 4: Category = 'Amplification'

This has been applied to the following datasets:

Whole Exome Copy Number Data (Cell Lines)
Whole Genome Copy Number Data (Organoids)

Affymetrix SNP6 Data (Cell Lines)

Segment copy number data was downloaded from the TCGA (Cancer Genome Atlas Research Network et al., 2013) (8,182 samples) and analysed with ADMIRE (van Dyk et al., 2013). The cohorts of COAD and READ were merged due to their high similarity in tissue type and response profile. The ADMIRE analysis results comprised copy number segments statistically different from expectation. Filter criteria were defined to focus the analysis on potential driver segments. The filter list required the segments to include at least one protein coding or antisense gene, but no more than 100 of them. It required the deletions to include an exon (a proxy for gene disruption) and amplifications to span a gene (as sub-genic amplifications are unlikely to be functional). The false discovery rate (FDR) controlled p-value was required to be smaller than 0.05, and the segment was required to be at recursion level two or higher unless it was a top-level segment. To ensure clinical relevance, the identified segment needed to be affected in at least 2.5% of the subjects. The latter was evaluated on two levels, using the overall background variance, and using the local background variance. The first was calculated on the log2 values not part of any identified segment, regardless of filtering. The second was calculated on the recursion level below the identified segment. Within each tumor type the segments obtained after filtering (Table S2D) were further compacted by pruning all overlapping segments such that only the shortest were retained. This results in a fairly concise set of segments per tumor type. The pan-cancer set of segments was derived from the entire collection of filtered cancer specific segments, but only the largest overlapping segment was retained (Table S2E).

Publication reference: Iorio et al. A Landscape of Pharmacogenomic Interactions in Cancer. Cell, 2016.

DepMap Documentation