Gene Mapping & Cancer Driver Annotation
Gene Mapping & SIDG's
All genes have an internal ID, allowing mapping to current and previous HGNC gene symbols, Ensembl Gene IDs (v91) and other external gene identifiers. All genes with an HGNC-approved symbol as of April 2018 are currently included, including those without a protein product.
Any dataset values that were mapped to genes without an official gene symbol have been discarded from processing, but continue to be available in raw data downloads.
Cancer Driver Annotation: Genes
COSMIC Tier 1 Genes
COSMIC Tier 1 gene requirements:
- Mutation patterns which strongly support involvement in cancer aetiology.
- Evidence of how gene’s dysfunction impacts hallmarks of cancer.
- Publications from two independent groups describing mutations in at least one type of cancer.
Method of Action
The method of action of a gene can be one of four values:
|LoF (Loss of Function)||
Tumour Suppressor Gene
Evidence of both Activating and Loss of Function events.
Only seen as a partner in a translocation + gene fusion.
Given its roles assigned by intOGen and COSMIC, the method of action for each gene was chosen using to the following rules:
- Ignore fusion unless it is the only method seen.
- If there’s only one method of action across both intOGen and COSMIC, use that provided.
- Otherwise choose ambiguous.
The original methods of action assigned by intOGen (intogen_moa) and COSMIC (cosmic_moa) are also available.
|Method of Action||COSMIC Only||Intogen Only||Both COSMIC and Intogen||Total|
Cancer Driver Annotation: Mutations
The driver mutations table is the union of three datasets:
- CGI catalog of validated oncogenic mutations
- MSKCC Hotspots version 2
- Includes a set of in-frame indels unique to this dataset
- BoostDM dataset
- A new machine learning based classification of mutation in the intOGen cohorts
Processing of Mutation Datasets
The files for each dataset were processed to extract the information required for a row in the table, and to merge mutations from different datasets at the same position.
Each row in the catalog_of_validated_oncogenic_mutations.tsv file contains a separate row for each nucleotide mutation. Mutations with the same effect (e.g. "missense", “ess_splice”) at the same location are merged together.
Each entry in the mskcc_hotspots.json file (downloaded from the MSKCC API) groups mutations at an amino acid position in a gene. Missense and nonsense mutations in the same amino acid are split into separate rows in the table. Splice site mutations are grouped by amino acid position, i.e. by intron, which were split into separate donor and acceptor entries by grouping contiguous ranges of coordinates into new rows. The SNV-hotspots and INDEL-hotspots sheets of hotspots_v2.xls were parsed in parallel to add information required, such as genomic coordinates, which is not present in the mskcc_hotspots.json file. (Other information, such as EnsEMBL transcript IDs, is only found in mskcc_hotspots.json.)
The BoostDM data consists of a separate .tsv file for each cancer type. Any mutation which has a “boostDM class” of “True” (which is a score > 0.5) in any tissue is loaded.
Post Processing and QC
Overlapping mutations with the same effect were merged together. The EnsEMBL transcript IDs were checked and updated where they were no longer current, and the genomic strand was added. Amino acid coordinates were mapped back to genomic coordinates to check that they match the recorded coordinates, and mutations had their footprint on the genomic sequence expanded to span whole codons. Mutations in codons which straddle an intron were split into two separate rows with separate genomic coordinates.
Driver Mutation Data Field Definitions
The current HGNC gene symbol for the gene
The chromosome name, without a chr prefix for the main chromosomes.
Start position of the chromosome in 1-based coordinates.
End position of the chromosome in 1-based coordinates.
1 for the forward strand of the chromosome, and -1 for the reverse strand.
The effect (a.k.a. consequence) of the mutation, as it would be annotated by VAGrENT.
The reference amino acid(s) in this span of the chromosome.
The coodinate of the first amino acid in the peptide of this transcript.
A comma separated list of alternate amino acids observed in mutations with this effect at this position.
The EnsEMBL stable ID of this transcript.
The number of mutations from the CGI catalog which contributed to this row, 0 if none.
The number of mutations from the MSKCC Hotspots which contributed to this row, 0 if none.
The number of mutations from the BoostDM dataset which contributed to this row, 0 if none.
Germline Cancer Predisposition Variants
- List of germline predisposition variants taken from "S2A.Pathogenic_variants" tab in supplementary file (Germline predisposition variants).
- Select subset of columns chr, start, stop, effect and gene to generate bed file
- LiftOver to GRCh38 and modifying annotations to match with VAGrENT annotations.
- Generate final tsv file with header "#CHR FROM TO INFO/CPV GENE"