This dataset was generated by performing systematic genome-scale CRISPR-Cas9 knockout screens in a large number of highly-annotated cancer models in order to identify genes required for cell fitness in defined molecular contexts. These results can be used to identify dependencies in cancer cells and could inform the development of new precision cancer medicines. Furthermore, they are a rich resource with applications in basic cell biology, genome engineering and human genetics.
This data can be explored using Project Score.
Data download: Genome-wide CRISPR KO Data
Experimental Method (Sanger Data)
The CRISPR library used for this screen (Tzelepis et al, 2016 is available from Addgene, Cat no. 67989) contains 90,709 sgRNAs targeting 18,009 genes (~5 sgRNAs/gene). All pooled screens were completed in technical triplicate at 100x coverage of the library (i.e. ~100 cells per sgRNA were transduced). Stringent quality control are applied at all stages of the experiment pipeline, including:
- every screened cell line has >75% Cas9 activity;
- every cell line is transduced with the library at >15% efficiency;
- cells are monitored for changes in morphology or growth rates following lentiviral transduction;
- a DNA yield of >72ug is required to maintain library coverage;
- quality and size of all PCR products are checked.
Further rigorous quality control assessment of the data are also completed described in the associated manuscript. Only data satisfying all quality control measures are included in this dataset.
Defining fitness genes: Gene independent responses (e.g. copy number) to CRISPR-Cas9 are corrected using CRISPRcleanR. Loss of fitness scores are generated from corrected FCs through an in-house R implementation of the BAGEL method to call significantly depleted genes (code publicly available at https://github.com/francescojm/BAGELR). Our BAGEL implementation computes gene-level Bayesian factors (BFs) by calculating the average of the sgRNAs on a targeted-gene basis, instead of summing them. Additionally, it uses reference sets of predefined essential and non-essential genes further curated to exclude high-confidence cancer driver genes. A statistical significance threshold for gene-level BFs is determined for each cell line. Each gene is assigned a scaled BF computed by subtracting the BF at the 5% FDR threshold (obtained from classifying reference essential/non-essential genes based on BF rankings) defined for each cell line from the original BF. For consistency of visualisation, all scaled BF values are multiplied by -1 resulting in significantly depleted values having a loss of fitness score <0.
Gene fitness metrics used:
- Fitness Score: based on scaled BF from BAGEL. A score <0 indicates a statistically significant effect on cell fitness.
- Corrected fold change: Copy-number-bias corrected gene depletion fold change, computed between average representation of targeting sgRNAs 14 days post-transfection versus plasmid library
Core fitness genes: Fitness genes common to the majority of cell lines tested, or common within a cancer type, may be involved in cell essential processes - we refer to these as core fitness genes. In order to identify core fitness genes, we developed a statistical method, ADaM (Adaptive Daisy Model) which adaptively determines the minimum number of dependent cell models required for a gene to be classified as a core fitness gene. ADaM was implemented at both a cancer-type specific level and a pan-cancer specific level (code publicly available at https://github.com/francescojm/ADAM).