Gene Mapping & Cancer Driver Annotation

Gene Mapping & SIDG's

All genes have an internal ID that is mapped directly to an HGNC id, allowing mapping to current HGNC gene symbols, Ensembl gene IDs and other external gene identifiers. Datasets are uploaded using Ensembl gene identifiers wherever possible and if not using gene symbols. Any dataset values that are mapped to genes without a current id or symbol are discarded from processing, but continue to be available in raw data downloads.

All genes have an internal identifier (SIDG) that is mapped directly to an HGNC ID, allowing mapping to current HGNC gene symbols. For a current HGNC release all IDs map 1:1 to Ensembl gene identifiers. Datasets are uploaded using Ensembl gene IDs wherever possible. In other cases, e.g., Ensembl transcript identifiers, Biomart is used to map to the SIDG records. Only in the absence of other identifiers are datasets imported based using gene symbols.

When updating to a later HGNC release, withdrawn and merged/split HGNC entries are marked as not current. Any previously uploaded dataset values that are mapped to genes where the SIDG has an HGNC status that is not current are removed from downloads and webpages, but continue to be available in the API and raw data downloads.


Cancer Driver Annotation: Genes

The list of driver genes is the union of two complementary gene sets: intOGen & COSMIC Tier 1 .

intOGen Genes

The intOGen analysis pipeline uses seven different methods to identify cancer driver genes:

COSMIC Tier 1 Genes

COSMIC Tier 1 gene requirements:

  • Mutation patterns which strongly support involvement in cancer aetiology.
  • Evidence of how gene’s dysfunction impacts hallmarks of cancer.
  • Publications from two independent groups describing mutations in at least one type of cancer.
Method of Action (Cancer Driver Genes)

The method of action of a gene can be one of four values:

Act (Activating)

Oncogene

LoF (Loss of Function)

Tumour Suppressor Gene

ambiguous

Evidence of both Activating and Loss of Function events.

fusion

Only seen as a partner in a translocation + gene fusion.

Given its roles assigned by intOGen and COSMIC, the method of action for each gene was chosen using the following rules:

  1. Ignore fusion unless it is the only method seen.
  2. If there’s only one method of action across both intOGen and COSMIC, use that provided.
  3. Otherwise choose ambiguous.

The original methods of action assigned by intOGen (intogen_moa) and COSMIC (cosmic_moa) are also available.

Gene contributions by intOGen and COSMIC
Method of Action COSMIC Only Intogen Only Both COSMIC and Intogen Total
Act 85 66 117 268
LoF 60 89 108 257
ambiguous 15 52 134 201
fusion 57 0 0 57
Total 217 207 359 783


Cancer Predisposition Annotation: Genes

The list of Cancer Predisposition Genes (CPG's) has been developed from the 'S1A.Cancer_predisposition_genes' and 'S2A.Pathogenic_variants' from Huang et al., Cell, 2018.

Genes listed in 'S1A.Cancer_predisposition_genes' were flagged as being CPG's unless:

  1. The gene is annotated in 'S1A.Cancer_predisposition_genes: Gene_Classification' as 'Not classified' AND the gene is either, not present in the Driver Genes list, or is annotated as a fusion in the CMP Driver Genes list.
  2. The gene listed in 'S1A.Cancer_predisposition_genes' has no variants present in 'S2A.Pathogenic_variants' AND the 'S1A.Cancer_predisposition_genes: Gene_Classification' is Oncogene OR CMP:MoA = Act.

Genes present in 'S2A.Pathogenic_variants' but not in 'S1A.Cancer_predisposition_genes', AR and BARD1, were included in the Cancer Predisposition Genes list.

Method of Action (Cancer Predisposition Genes)

For Cancer Predisposition Genes that are also a Cancer Driver Gene the driver Method of Action is used. There are no direct conflicts between the Driver Gene MoA and the 'S1A.Cancer_predisposition_genes: Gene_Classification' however some genes defined as a Tumor Suppressor Gene or Oncogene are listed as Ambiguous.

For those Cancer Predisposition Genes which are not also Cancer Driver Genes the 'S1A.Cancer_predisposition_genes: Gene_Classification' is used and translated into the matching Act & LoF terms.

The Huang et al., Cell, 2018. Gene Classification is provided in the Cancer Driver and Predisposition Gene List.


Cancer Driver & Predisposition Annotation: Mutations

The cancer driver mutation list is the union of three datasets:

  • CGI catalog of validated oncogenic mutations
  • MSKCC Hotspots version 2
    • Includes a set of in-frame indels unique to this dataset
  • BoostDM dataset
    • A new machine learning based classification of mutation in the intOGen cohorts

The cancer predisposition variant list uses the variants from Huang et al., Cell, 2018 for the selected Cancer Predisposition Genes.

Processing of Mutation Datasets

The CGI, MSKCC and CPV datasets have chromosome coordinates recorded on GRCh37, and were remapped to GRCh38 using UCSC’s hg19 to hg38 chain file and liftOver utility. BoostDM is provided on GRCh38.

The files for each dataset were processed to extract the information required for a row in the table:

CGI catalog

Each row in the catalog_of_validated_oncogenic_mutations.tsv file contains a separate row for each nucleotide mutation. Mutations with the same effect (e.g. "missense", “ess_splice”) at the same location are merged together.

MSKCC Hotspots

Each entry in the mskcc_hotspots.json file (downloaded from the MSKCC API) groups mutations at an amino acid position in a gene. Missense and nonsense mutations in the same amino acid are split into separate rows in the table. Splice site mutations are grouped by amino acid position, i.e. by intron, which were split into separate donor and acceptor entries by grouping contiguous ranges of coordinates into new rows. The SNV-hotspots and INDEL-hotspots sheets of hotspots_v2.xls were parsed in parallel to add information required, such as genomic coordinates, which is not present in the mskcc_hotspots.json file. (Other information, such as EnsEMBL transcript IDs, is only found in mskcc_hotspots.json.)

BoostDM dataset

The BoostDM data consists of a separate .tsv file for each cancer type. Any mutation which has a “boostDM class” of “True” (which is a score > 0.5) in any tissue is loaded.

CPV dataset

Germline predisposition variants taken from the S2A.Pathogenic_variants sheet in supplementary data file Germline predisposition variants from Huang et al.

Post Processing and QC

Overlapping mutations with the same effect were merged together. The EnsEMBL transcript IDs were checked and updated where they were no longer current, and the genomic strand was added. Amino acid coordinates were mapped back to genomic coordinates to check that they match the recorded coordinates, and mutations had their footprint on the genomic sequence expanded to span whole codons. Mutations in codons which straddle an intron were split into two separate rows with separate genomic coordinates.

Driver Mutation Data Field Definitions
symbol

The current HGNC gene symbol for the gene

chr_name

The chromosome name, without a chr prefix for the main chromosomes.

chr_start

Start position of the chromosome in 1-based coordinates.

chr_end

End position of the chromosome in 1-based coordinates.

strand

1 for the forward strand of the chromosome, and -1 for the reverse strand.

effect

The effect (a.k.a. consequence) of the mutation, as it would be annotated by VAGrENT.

ref_aa

The reference amino acid(s) in this span of the chromosome.

pep_coord

The coodinate of the first amino acid in the peptide of this transcript.

alt_aa_list

A comma separated list of alternate amino acids observed in mutations with this effect at this position.

ensembl_transcript

The EnsEMBL stable ID of this transcript.

cgi

The number of mutations from the CGI catalog which contributed to this row, 0 if none.

mskcc

The number of mutations from the MSKCC Hotspots which contributed to this row, 0 if none.

boost

The number of mutations from the BoostDM dataset which contributed to this row, 0 if none.

cpv

The number of germline predisposition variants from the CPV dataset which contributed to this row, 0 if none. All variants where this column is non-zero are used to flag CPV variants, regardless of whether or not the gene is considered a cancer driver in the genes dataset.