Gene Mapping & Cancer Driver Annotation
Gene Mapping & SIDG's
All genes have an internal ID, allowing mapping to current and previous HGNC gene symbols, Ensembl Gene IDs (v91) and other external gene identifiers. All genes with an HGNC-approved symbol as of April 2018 are currently included, including those without a protein product.
Any dataset values that were mapped to genes without an official gene symbol have been discarded from processing, but continue to be available in raw data downloads.
Cancer Driver Annotation: Genes
The list of driver genes is the union of two complementary gene sets: intOGen & COSMIC Tier 1 .
intOGen Genes
The intOGen analysis pipeline uses seven different methods to identify cancer driver genes:
- Mutation count bias: dNdScv, CBaSE & MutPanning.
- Mutations in protein:
- sequence: OncodriveCLUSTL
- structure: HotMAPS
- functional domains: smRegions
- Functional impact of mutations: OncodriveFML
COSMIC Tier 1 Genes
COSMIC Tier 1 gene requirements:
- Mutation patterns which strongly support involvement in cancer aetiology.
- Evidence of how gene’s dysfunction impacts hallmarks of cancer.
- Publications from two independent groups describing mutations in at least one type of cancer.
Method of Action
The method of action of a gene can be one of four values:
Act (Activating) | Oncogene |
LoF (Loss of Function) | Tumour Suppressor Gene |
ambiguous | Evidence of both Activating and Loss of Function events. |
fusion | Only seen as a partner in a translocation + gene fusion. |
Given its roles assigned by intOGen and COSMIC, the method of action for each gene was chosen using to the following rules:
- Ignore fusion unless it is the only method seen.
- If there’s only one method of action across both intOGen and COSMIC, use that provided.
- Otherwise choose ambiguous.
The original methods of action assigned by intOGen (intogen_moa) and COSMIC (cosmic_moa) are also available.
Method of Action | COSMIC Only | Intogen Only | Both COSMIC and Intogen | Total |
---|---|---|---|---|
Act | 85 | 66 | 117 | 268 |
LoF | 60 | 89 | 108 | 257 |
ambiguous | 15 | 52 | 134 | 201 |
fusion | 57 | 0 | 0 | 57 |
Total | 217 | 207 | 359 | 783 |
Cancer Driver Annotation: Mutations
The driver mutations table is the union of four datasets:
- CGI catalog of validated oncogenic mutations
- MSKCC Hotspots version 2
- Includes a set of in-frame indels unique to this dataset
- BoostDM dataset
- A new machine learning based classification of mutation in the intOGen cohorts
- CPV dataset
- A catalog of germline cancer predisposition variants from Huang et.al., Cell, 2018
Processing of Mutation Datasets
The CGI, MSKCC and CPV datasets have chromosome coordinates recorded on GRCh37, and were remapped to GRCh38 using UCSC’s hg19 to hg38 chain file and liftOver utility. BoostDM is provided on GRCh38.
The files for each dataset were processed to extract the information required for a row in the table:
Each row in the catalog_of_validated_oncogenic_mutations.tsv file contains a separate row for each nucleotide mutation. Mutations with the same effect (e.g. "missense", “ess_splice”) at the same location are merged together.
Each entry in the mskcc_hotspots.json file (downloaded from the MSKCC API) groups mutations at an amino acid position in a gene. Missense and nonsense mutations in the same amino acid are split into separate rows in the table. Splice site mutations are grouped by amino acid position, i.e. by intron, which were split into separate donor and acceptor entries by grouping contiguous ranges of coordinates into new rows. The SNV-hotspots and INDEL-hotspots sheets of hotspots_v2.xls were parsed in parallel to add information required, such as genomic coordinates, which is not present in the mskcc_hotspots.json file. (Other information, such as EnsEMBL transcript IDs, is only found in mskcc_hotspots.json.)
The BoostDM data consists of a separate .tsv file for each cancer type. Any mutation which has a “boostDM class” of “True” (which is a score > 0.5) in any tissue is loaded.
CPV dataset
Germ line predisposition variants taken from the S2A.Pathogenic_variants sheet in supplementary data file Germline predisposition variants from Huang et.al.
Post Processing and QC
Overlapping mutations with the same effect were merged together. The EnsEMBL transcript IDs were checked and updated where they were no longer current, and the genomic strand was added. Amino acid coordinates were mapped back to genomic coordinates to check that they match the recorded coordinates, and mutations had their footprint on the genomic sequence expanded to span whole codons. Mutations in codons which straddle an intron were split into two separate rows with separate genomic coordinates.
Driver Mutation Data Field Definitions
symbol | The current HGNC gene symbol for the gene |
chr_name | The chromosome name, without a chr prefix for the main chromosomes. |
chr_start | Start position of the chromosome in 1-based coordinates. |
chr_end | End position of the chromosome in 1-based coordinates. |
strand | 1 for the forward strand of the chromosome, and -1 for the reverse strand. |
effect | The effect (a.k.a. consequence) of the mutation, as it would be annotated by VAGrENT. |
ref_aa | The reference amino acid(s) in this span of the chromosome. |
pep_coord | The coodinate of the first amino acid in the peptide of this transcript. |
alt_aa_list | A comma separated list of alternate amino acids observed in mutations with this effect at this position. |
ensembl_transcript | The EnsEMBL stable ID of this transcript. |
cgi | The number of mutations from the CGI catalog which contributed to this row, 0 if none. |
mskcc | The number of mutations from the MSKCC Hotspots which contributed to this row, 0 if none. |
boost | The number of mutations from the BoostDM dataset which contributed to this row, 0 if none. |
cpv | The number of germ line predisposition variants from the CPV dataset which contributed to this row, 0 if none. All variants where this column is non-zero are used to flag CPV variants, regardless of whether or not the gene is considered a cancer driver in the genes dataset. |