Gene Mapping & Cancer Driver Annotation
Gene Mapping & SIDG's
All genes have an internal ID that is mapped directly to an HGNC id, allowing mapping to current HGNC gene symbols, Ensembl gene IDs and other external gene identifiers. Datasets are uploaded using Ensembl gene identifiers wherever possible and if not using gene symbols. Any dataset values that are mapped to genes without a current id or symbol are discarded from processing, but continue to be available in raw data downloads.
All genes have an internal identifier (SIDG) that is mapped directly to an HGNC ID, allowing mapping to current HGNC gene symbols. For a current HGNC release all IDs map 1:1 to Ensembl gene identifiers. Datasets are uploaded using Ensembl gene IDs wherever possible. In other cases, e.g., Ensembl transcript identifiers, Biomart is used to map to the SIDG records. Only in the absence of other identifiers are datasets imported based using gene symbols.
When updating to a later HGNC release, withdrawn and merged/split HGNC entries are marked as not current. Any previously uploaded dataset values that are mapped to genes where the SIDG has an HGNC status that is not current are removed from downloads and webpages, but continue to be available in the API and raw data downloads.
Cancer Driver Annotation: Genes
The list of driver genes is the union of two complementary gene sets: intOGen & COSMIC Tier 1 .
intOGen Genes
The intOGen analysis pipeline uses seven different methods to identify cancer driver genes:
- Mutation count bias: dNdScv, CBaSE & MutPanning.
- Mutations in protein:
- sequence: OncodriveCLUSTL
- structure: HotMAPS
- functional domains: smRegions
- Functional impact of mutations: OncodriveFML
COSMIC Tier 1 Genes
COSMIC Tier 1 gene requirements:
- Mutation patterns which strongly support involvement in cancer aetiology.
- Evidence of how gene’s dysfunction impacts hallmarks of cancer.
- Publications from two independent groups describing mutations in at least one type of cancer.
Method of Action (Cancer Driver Genes)
The method of action of a gene can be one of four values:
Act (Activating) | Oncogene |
LoF (Loss of Function) | Tumour Suppressor Gene |
ambiguous | Evidence of both Activating and Loss of Function events. |
fusion | Only seen as a partner in a translocation + gene fusion. |
Given its roles assigned by intOGen and COSMIC, the method of action for each gene was chosen using the following rules:
- Ignore fusion unless it is the only method seen.
- If there’s only one method of action across both intOGen and COSMIC, use that provided.
- Otherwise choose ambiguous.
The original methods of action assigned by intOGen (intogen_moa) and COSMIC (cosmic_moa) are also available.
Method of Action | COSMIC Only | Intogen Only | Both COSMIC and Intogen | Total |
---|---|---|---|---|
Act | 85 | 66 | 117 | 268 |
LoF | 60 | 89 | 108 | 257 |
ambiguous | 15 | 52 | 134 | 201 |
fusion | 57 | 0 | 0 | 57 |
Total | 217 | 207 | 359 | 783 |
Cancer Predisposition Annotation: Genes
The list of Cancer Predisposition Genes (CPG's) has been developed from the 'S1A.Cancer_predisposition_genes' and 'S2A.Pathogenic_variants' from Huang et al., Cell, 2018.
Genes listed in 'S1A.Cancer_predisposition_genes' were flagged as being CPG's unless:
- The gene is annotated in 'S1A.Cancer_predisposition_genes: Gene_Classification' as 'Not classified' AND the gene is either, not present in the Driver Genes list, or is annotated as a fusion in the CMP Driver Genes list.
- The gene listed in 'S1A.Cancer_predisposition_genes' has no variants present in 'S2A.Pathogenic_variants' AND the 'S1A.Cancer_predisposition_genes: Gene_Classification' is Oncogene OR CMP:MoA = Act.
Genes present in 'S2A.Pathogenic_variants' but not in 'S1A.Cancer_predisposition_genes', AR and BARD1, were included in the Cancer Predisposition Genes list.
Method of Action (Cancer Predisposition Genes)
For Cancer Predisposition Genes that are also a Cancer Driver Gene the driver Method of Action is used. There are no direct conflicts between the Driver Gene MoA and the 'S1A.Cancer_predisposition_genes: Gene_Classification' however some genes defined as a Tumor Suppressor Gene or Oncogene are listed as Ambiguous.
For those Cancer Predisposition Genes which are not also Cancer Driver Genes the 'S1A.Cancer_predisposition_genes: Gene_Classification' is used and translated into the matching Act & LoF terms.
The Huang et al., Cell, 2018. Gene Classification is provided in the Cancer Driver and Predisposition Gene List.
Cancer Driver & Predisposition Annotation: Mutations
The cancer driver mutation list is the union of three datasets:
- CGI catalog of validated oncogenic mutations
- MSKCC Hotspots version 2
- Includes a set of in-frame indels unique to this dataset
- BoostDM dataset
- A new machine learning based classification of mutation in the intOGen cohorts
The cancer predisposition variant list uses the variants from Huang et al., Cell, 2018 for the selected Cancer Predisposition Genes.
Processing of Mutation Datasets
The CGI, MSKCC and CPV datasets have chromosome coordinates recorded on GRCh37, and were remapped to GRCh38 using UCSC’s hg19 to hg38 chain file and liftOver utility. BoostDM is provided on GRCh38.
The files for each dataset were processed to extract the information required for a row in the table:
Each row in the catalog_of_validated_oncogenic_mutations.tsv file contains a separate row for each nucleotide mutation. Mutations with the same effect (e.g. "missense", “ess_splice”) at the same location are merged together.
Each entry in the mskcc_hotspots.json file (downloaded from the MSKCC API) groups mutations at an amino acid position in a gene. Missense and nonsense mutations in the same amino acid are split into separate rows in the table. Splice site mutations are grouped by amino acid position, i.e. by intron, which were split into separate donor and acceptor entries by grouping contiguous ranges of coordinates into new rows. The SNV-hotspots and INDEL-hotspots sheets of hotspots_v2.xls were parsed in parallel to add information required, such as genomic coordinates, which is not present in the mskcc_hotspots.json file. (Other information, such as EnsEMBL transcript IDs, is only found in mskcc_hotspots.json.)
The BoostDM data consists of a separate .tsv file for each cancer type. Any mutation which has a “boostDM class” of “True” (which is a score > 0.5) in any tissue is loaded.
CPV dataset
Germline predisposition variants taken from the S2A.Pathogenic_variants sheet in supplementary data file Germline predisposition variants from Huang et al.
Post Processing and QC
Overlapping mutations with the same effect were merged together. The EnsEMBL transcript IDs were checked and updated where they were no longer current, and the genomic strand was added. Amino acid coordinates were mapped back to genomic coordinates to check that they match the recorded coordinates, and mutations had their footprint on the genomic sequence expanded to span whole codons. Mutations in codons which straddle an intron were split into two separate rows with separate genomic coordinates.
Driver Mutation Data Field Definitions
symbol | The current HGNC gene symbol for the gene |
chr_name | The chromosome name, without a chr prefix for the main chromosomes. |
chr_start | Start position of the chromosome in 1-based coordinates. |
chr_end | End position of the chromosome in 1-based coordinates. |
strand | 1 for the forward strand of the chromosome, and -1 for the reverse strand. |
effect | The effect (a.k.a. consequence) of the mutation, as it would be annotated by VAGrENT. |
ref_aa | The reference amino acid(s) in this span of the chromosome. |
pep_coord | The coodinate of the first amino acid in the peptide of this transcript. |
alt_aa_list | A comma separated list of alternate amino acids observed in mutations with this effect at this position. |
ensembl_transcript | The EnsEMBL stable ID of this transcript. |
cgi | The number of mutations from the CGI catalog which contributed to this row, 0 if none. |
mskcc | The number of mutations from the MSKCC Hotspots which contributed to this row, 0 if none. |
boost | The number of mutations from the BoostDM dataset which contributed to this row, 0 if none. |
cpv | The number of germline predisposition variants from the CPV dataset which contributed to this row, 0 if none. All variants where this column is non-zero are used to flag CPV variants, regardless of whether or not the gene is considered a cancer driver in the genes dataset. |