This page outlines raw data and the subsequent processing for use within the Cancer Dependency Map at Sanger.

Raw Data

Dataset Origin Data Type Model Type Details Link
Whole Exome Sequencing Sanger BAM Cell Line Illumina HiSeq 2000 EGAS00001000978
Whole Exome Sequencing Broad (Remapped to GRCh38 at Sanger) BAM Cell Line Illumina HiSeq 2000 or Illumina GAIIX PRJNA523380
Targeted Gene Squencing Sanger CRAM Organoid Illumina HiSeq 4000 EGAS00001002221
Whole Genome Sequencing Sanger CRAM Organoid Illumina HiSeq 4000 EGAS00001002222

Processed Data

Descriptions of how the raw data was processed including algorithms and filtering. Processed datasets can be accessed on the Downloads page, and the active dataset can be accessed using the DepMap web resources and API.

Raw Data Processing & Mutation Calling

Somatic mutations from whole genome sequencing are called using the CaVEMan and Pindel algorithms.

Cell Lines: Most cell lines do not have a matched normal, and instead use an in silico normal generated by Wgsim. A panel of unrelated normal samples is used to flag both polymorphisms and locations prone to aberrant mapping or systematic sequencing artifacts in genome sequencing.

Organoids: Sequencing data presented is from samples obtained from the established organoid model at the time of banking. Most organoid samples have a matched normal blood sample which is used to discard germline variants, where a blood sample was not available the same in silico normal was used as in the cell lines. The normal used is indicated in the VCF file.

Variant consequences were annotated using VAGrENT.

Somatic variants reported by these algorithms are flagged by filters designed to detect common causes of false positives. For CaVEMan this filtering is performed by cgpCaVEManPostProcessing. Those variants which pass all filters are given a PASS flag in the filter column of the VCF file.

Variant Allele Frequency & Flagging

Unbiased analysis of mutant and wild-type reads found at the loci of the base substitutions and indels were assessed across the related samples using vafCorrect (Yates et al., 2017), which assigns a VAF (Variant Allele Frequency) to each sample within each variant.

In cell lines: Only VCF records with PASS in the FILTER column and with a VAF >= 0.15 were considered for flagging as drivers.

In organoids: The normal blood control increases the specificity of calling variants, so a VAF cutoff is not used. For those organoid models where a normal blood control is not available the VAF >= 0.15 cutoff is applied. Only variants with the PASS flag present in at least one of the samples were considered for flagging as drivers.

Flagging Driver Variants with DRV and CPV

DRV: Mutations are flagged with a DRV tag in the INFO column if they either:

  1. Overlap the genomic coordinates of a mutation with the same effect (e.g. missense, nonsense, inframe, or silent) in the library of mutations.
  2. Are in a LoF or ambiguous gene, and the mutation is a loss of function effect, i.e. one of frameshift, nonsense, ess_splice, start_lost or stop_lost.

CPV: Mutations are flagged with a CPV tag in the INFO column if they overlap the genomic coordinates of a mutation with the same effect in the library of cancer predisposition mutations.

Normal Panel Germ Line Data & NPGL Flagging

The union of two large cohorts available on the GRCh38 assembly were used to build a panel of germ line variants:

  • gnomAD v3.1.1 Contains 76,156 genomes from unrelated individuals in 9 ethnic populations sequenced in both population genetics and disease specific studies. Variants with an allele frequency (AF_popmax) greater than 0.001 in any of the represented populations within gnomAD were included.
  • 1000 Genomes Phase 3: Contains 2,504 individuals from 26 ethnic populations. Variants with an allele frequency (AF) greater than 0.001 in the full dataset (not per population) were included.

Any variant not flagged with DRV or CPV, but which exactly matches the REF allele and one of the ALT alleles at the same location in the germ line panel is flagged with NPGL in the INFO column. For the sequencing data which does not have a matched normal, this flag can be used to exclude variants which are unlikely to be somatic mutations.

Dataset Annotation & Integration

For further documentation on the annotation of genomic data set and model authentication see the links below.