Gene Mapping & Cancer Driver Annotation

Gene Mapping & SIDG's

All genes have an internal ID, allowing mapping to current and previous HGNC gene symbols, Ensembl Gene IDs (v91) and other external gene identifiers. All genes with an HGNC-approved symbol as of April 2018 are currently included, including those without a protein product.

Any dataset values that were mapped to genes without an official gene symbol have been discarded from processing, but continue to be available in raw data downloads.


Cancer Driver Annotation: Genes

The list of driver genes is the union of two complementary gene sets: intOGen & COSMIC Tier 1 .

intOGen Genes

The intOGen analysis pipeline uses seven different methods to identify cancer driver genes:

COSMIC Tier 1 Genes

COSMIC Tier 1 gene requirements:

  • Mutation patterns which strongly support involvement in cancer aetiology.
  • Evidence of how gene’s dysfunction impacts hallmarks of cancer.
  • Publications from two independent groups describing mutations in at least one type of cancer.
Method of Action

The method of action of a gene can be one of four values:

Act (Activating)

Oncogene

LoF (Loss of Function)

Tumour Suppressor Gene

ambiguous

Evidence of both Activating and Loss of Function events.

fusion

Only seen as a partner in a translocation + gene fusion.

Given its roles assigned by intOGen and COSMIC, the method of action for each gene was chosen using to the following rules:

  1. Ignore fusion unless it is the only method seen.
  2. If there’s only one method of action across both intOGen and COSMIC, use that provided.
  3. Otherwise choose ambiguous.

The original methods of action assigned by intOGen (intogen_moa) and COSMIC (cosmic_moa) are also available.

Gene contributions by intOGen and COSMIC
Method of Action COSMIC Only Intogen Only Both COSMIC and Intogen Total
Act 85 66 117 268
LoF 60 89 108 257
ambiguous 15 52 134 201
fusion 57 0 0 57
Total 217 207 359 783


Cancer Driver Annotation: Mutations

The driver mutations table is the union of four datasets:

  • CGI catalog of validated oncogenic mutations
  • MSKCC Hotspots version 2
    • Includes a set of in-frame indels unique to this dataset
  • BoostDM dataset
    • A new machine learning based classification of mutation in the intOGen cohorts
  • CPV dataset

Processing of Mutation Datasets

The CGI, MSKCC and CPV datasets have chromosome coordinates recorded on GRCh37, and were remapped to GRCh38 using UCSC’s hg19 to hg38 chain file and liftOver utility. BoostDM is provided on GRCh38.

The files for each dataset were processed to extract the information required for a row in the table:

CGI catalog

Each row in the catalog_of_validated_oncogenic_mutations.tsv file contains a separate row for each nucleotide mutation. Mutations with the same effect (e.g. "missense", “ess_splice”) at the same location are merged together.

MSKCC Hotspots

Each entry in the mskcc_hotspots.json file (downloaded from the MSKCC API) groups mutations at an amino acid position in a gene. Missense and nonsense mutations in the same amino acid are split into separate rows in the table. Splice site mutations are grouped by amino acid position, i.e. by intron, which were split into separate donor and acceptor entries by grouping contiguous ranges of coordinates into new rows. The SNV-hotspots and INDEL-hotspots sheets of hotspots_v2.xls were parsed in parallel to add information required, such as genomic coordinates, which is not present in the mskcc_hotspots.json file. (Other information, such as EnsEMBL transcript IDs, is only found in mskcc_hotspots.json.)

BoostDM dataset

The BoostDM data consists of a separate .tsv file for each cancer type. Any mutation which has a “boostDM class” of “True” (which is a score > 0.5) in any tissue is loaded.

CPV dataset

Germ line predisposition variants taken from the S2A.Pathogenic_variants sheet in supplementary data file Germline predisposition variants from Huang et.al.

Post Processing and QC

Overlapping mutations with the same effect were merged together. The EnsEMBL transcript IDs were checked and updated where they were no longer current, and the genomic strand was added. Amino acid coordinates were mapped back to genomic coordinates to check that they match the recorded coordinates, and mutations had their footprint on the genomic sequence expanded to span whole codons. Mutations in codons which straddle an intron were split into two separate rows with separate genomic coordinates.

Driver Mutation Data Field Definitions
symbol

The current HGNC gene symbol for the gene

chr_name

The chromosome name, without a chr prefix for the main chromosomes.

chr_start

Start position of the chromosome in 1-based coordinates.

chr_end

End position of the chromosome in 1-based coordinates.

strand

1 for the forward strand of the chromosome, and -1 for the reverse strand.

effect

The effect (a.k.a. consequence) of the mutation, as it would be annotated by VAGrENT.

ref_aa

The reference amino acid(s) in this span of the chromosome.

pep_coord

The coodinate of the first amino acid in the peptide of this transcript.

alt_aa_list

A comma separated list of alternate amino acids observed in mutations with this effect at this position.

ensembl_transcript

The EnsEMBL stable ID of this transcript.

cgi

The number of mutations from the CGI catalog which contributed to this row, 0 if none.

mskcc

The number of mutations from the MSKCC Hotspots which contributed to this row, 0 if none.

boost

The number of mutations from the BoostDM dataset which contributed to this row, 0 if none.

cpv

The number of germ line predisposition variants from the CPV dataset which contributed to this row, 0 if none. All variants where this column is non-zero are used to flag CPV variants, regardless of whether or not the gene is considered a cancer driver in the genes dataset.