Mutation

This page outlines raw data and the subsequent processing for use within the Cancer Dependency Map at Sanger.


Raw Data

Dataset Origin Data Type Model Type Details Link
Whole Exome Sequencing Sanger BAM Cell Line Illumina HiSeq 2000 EGAS00001000978
Whole Exome Sequencing Broad (Remapped to GRCh38 at Sanger) BAM Cell Line Illumina HiSeq 2000 or Illumina GAIIX PRJNA523380
Targeted Gene Squencing Sanger CRAM Organoid Illumina HiSeq 4000 EGAS00001002221
Whole Genome Sequencing Sanger CRAM Organoid Illumina HiSeq 4000 EGAS00001002222

Processed Data

Descriptions of how the raw data was processed including algorithms and filtering. Processed datasets can be downloaded here, the active dataset can also be accessed using the DepMap web resources and API.


Whole Genome/Exome Sequencing Data (Cell Lines)

Raw Data Processing & Mutation Calling

Following whole exome sequencing (WES) variants were called using the Caveman and Pindel algorithms. Caveman was used for the identification of the single nucleotide variants (SNVs) and Pindel for the identification of insertions and deletions (INDELs). For each model a separate vcf file is generated for SNVs and INDELs. Due to the lack of the matching normal tissue, both of these programs used an insilico normal generated using Wgsim for the GRCh38 reference genome along with an in-house panel of unrelated normal samples (100) mapped to GRCh38. Variant consequences were annotated using VAGrENT.

Germline Filtering Set

The resulting variants were then screened against normal samples to remove sequencing artefacts and germline variants. As we have mapped to the GRCh38 human reference sequence, large cohorts of normal samples that are generated using the GRCh38 reference genome were selected. Two large cohorts were utilised to perform the filtering:

  • gnomAD v3.1.1 : This dataset contains 76,156 genomes from unrelated individuals sequenced as part of various disease-specific and population genetic studies, and is aligned against the GRCh38 reference. Data set is derived from 9 different ethnic populations.
  • 1000 Genomes Phase 3: The dataset contains 2,504 individuals from 26 ethnic populations. Processed using the new variant set released by the 1000G Consortium generated using the GRCh38 human reference genome.

All variants with an allele frequency (AF_popmax)greater than 0.001 in any of the represented populations within gnomAD are filtered out as germline mutations, whereas all variants with an allele frequency (AF) greater than 0.001 in the full dataset (not per population) are filtered out as germline mutations in 1000 Genomes dataset.

Germline Filtering set went through additional filtering step to remove variants overlapping with Cancer Predisposition Gene variants based on 10,224 TCGA samples (Hunag et al., 2018), germline variants overlapping with known driver mutations and LoF (Loss of Function) variants from known LoF genes.

Normal panel generation workflow

VCF Annotation & Cancer Driver Flagging

Following germline filtering the vcf’s are annotated and cancer drivers were flagged. DRV and CPV fields in the INFO column can be used to access driver only records.

Flagging germline variants:

Germline variants were flagged with the soft flag ‘NPGL’ (Normal Panel GermLine) in the vcf INFO field.

Flagging Cancer Drivers:

Only vcf records passing all filters, VAF>=0.15 and without an NPGL flag were considered for driver gene annotations.

Somatic drivers, mutations were flagged with a DRV=<variant effect(s)> tag in the INFO column if it either:

  1. Overlaps the genomic coordinates of a mutation with the same effect (e.g. missense, nonsense, inframe, or silent) in the table of mutations
  2. Is in a LoF or ambiguous gene, and the mutation effect is one of frameshift, nonsense, ess_splice, start_lost or stop_lost.

Germline drivers, mutations were flagged as cancer predisposition variants with CPV=<variant effect(s)> tag in the INFO column if they overlap the genomic coordinates of a mutation with the same effect (e.g. missense, nonsense, inframe, or silent) in the table of germline driver mutations. If a germline driver overlaps with a somatic driver mutation then preference is given to the somatic mutation call.


Whole Genome Sequencing Data (Organoids)

Raw Data Processing & Mutation Calling

Sequencing is performed on samples obtained from the established organoid model at the time of banking.

Somatic mutations from whole genome sequencing are called using the CaVEMan and Pindel algorithms. Sequencing data from a matched normal sample and a panel of unrelated normal samples are used as references (GRCh38) to discard germline variants and technology specific artefacts. Variant consequences were annotated using VAGrENT.

Somatic variants reported by these algorithms are flagged by filters designed to detect common causes of false positives. This filtering was performed using cgpCaVEManPostProcessing. Those variants which pass all filters are given a PASS flag.

Unbiased analysis of mutant and wild-type reads found at the loci of the base substitutions and indels were assessed across the related samples using vafCorrect (Yates et al., 2017).

Only variants with the PASS flag or VAF > 0.05 (based on vafCorrect) and where the ‘PASS’ flag is present in at least one of the related samples are uploaded to the DepMap API.

VCF Annotation & Cancer Driver Flagging

Somatic drivers, Mutations were flagged with a DRV=<variant effect(s)> tag in the INFO column if it either:

  1. Overlaps the genomic coordinates of a mutation with the same effect (e.g. missense, nonsense, inframe, or silent) in the table of mutations
  2. Is in a LoF or ambiguous gene, and the mutation effect is one of frameshift, nonsense, ess_splice, start_lost or stop_lost.

Targeted Gene Sequencing Data (Organoids)

Sequencing is performed on samples obtained from the established organoid model at the time of banking.

Somatic mutations from targeted pulldown sequencing of our v4 panel are called using our CaVEMan and Pindel algorithms. Sequencing data from a matched normal sample and a panel of unrelated normal samples are used as references (GRCh38) to discard germline variants and technology specific artefacts. Variant consequences were annotated using VAGrENT.

Somatic variants reported by these algorithms are flagged by filters designed to detect common causes of false positives. This filtering was performed using cgpCaVEManPostProcessing. Those variants which pass all filters are given a PASS flag.

VCF Annotation & Cancer Driver Flagging

Somatic drivers, Mutations were flagged with a DRV=<variant effect(s)> tag in the INFO column if it either:

  1. Overlaps the genomic coordinates of a mutation with the same effect (e.g. missense, nonsense, inframe, or silent) in the table of mutations
  2. Is in a LoF or ambiguous gene, and the mutation effect is one of frameshift, nonsense, ess_splice, start_lost or stop_lost.

Dataset Annotation & Integration

For further documentation on the annotation of genomic data set and model authentication see the links below.