CRISPR: Knockout

This page outlines raw data and the subsequent processing for use within the Cancer Dependency Map at Sanger. This data can be accessed through the Cell Model Passport and explored using Project Score.


Raw Data

Dataset Name Origin Data Type Link
Project Score v1.0 Sanger Unprocessed sgRNA read counts https://cog.sanger.ac.uk/cmp/download/raw_sgrnas_counts.zip
CRISPR (Avana) Public 20Q2 Broad Unprocessed sgRNA read counts Achilles_gene_effect_unscaled.csv

Processed Data

Descriptions of how the raw data was processed including algorithms and filtering. Processed datasets can be downloaded here, the active dataset can also be accessed using the DepMap web resources and API.


Corrected Fold Changes

Gene independent responses (e.g. copy number) to CRISPR-Cas9 were corrected using CRISPRcleanR.

Independent datasets generated at Sanger and Broad showed batch effects due to differences in library design and assay length (Dempster et al. Nat.Commun 2019). To allow combined analysis of these two datasets they were batch corrected to account for these technical differences (Pacini et al. Nat Commun 2021). Batch correction was calculated using ComBat using a set of overlapping cell lines screened at both institutes. Where overlapping cell lines were present the data from Sanger were preferentially selected for inclusion in the final combined datasets. The batch corrected data was then quantile normalised to adjust for differences in screen quality. Finally the first principal component is removed from the data that associates with differences in media.

Defining Fitness Genes

Loss of fitness scores are generated from corrected FCs for both individual Sanger and Broad datasets and the combined CRISPR-Cas9 dataset using BAGEL2 (Kim et al. Genome Med 2021) to call significantly depleted genes. The gene-level Bayesian factors (BFs) are calculated as the average of the sgRNAs on a targeted-gene basis. As input to BAGEL2 we used reference sets of predefined essential and non-essential genes further curated to exclude high-confidence cancer driver genes. A statistical significance threshold for gene-level BFs is determined for each cell line. Each gene is assigned a scaled BF computed by subtracting the BF at the 5% FDR threshold (obtained from classifying reference essential/non-essential genes based on BF rankings) defined for each cell line from the original BF. For consistency of visualisation, all scaled BF values are multiplied by -1 resulting in significantly depleted values having a loss of fitness score < 0. Archived Project Score v1 data was processed using an implementation of BAGEL in R as described (Behan F, et al. Nature. 2019).

Gene Fitness Metrics

  • Fitness Score: based on scaled BF from BAGEL. A score < 0 indicates a statistically significant effect on cell fitness.
  • Corrected fold change: Copy-number-bias corrected gene depletion fold change, computed between average representation of targeting sgRNAs post-transfection at the end of the experiment versus plasmid library.

Core Fitness Genes

Fitness genes common to the majority of cell lines tested, or common within a cancer type, may be involved in cell essential processes - we refer to these as core fitness genes. In order to identify core fitness genes, we developed a statistical method, ADaM (Adaptive Daisy Model) which adaptively determines the minimum number of dependent cell models required for a gene to be classified as a core fitness gene. ADaM was implemented at both a cancer-type specific level and a pan-cancer specific level (code publicly available at https://github.com/francescojm/ADAM).


Experimental Method (Sanger Data)

The CRISPR library used for this screen (Tzelepis et al, 2016 is available from Addgene, Cat no. 67989) contains 90,709 sgRNAs targeting 18,009 genes (~5 sgRNAs/gene). All pooled screens were completed in technical triplicate at 100x coverage of the library (i.e. ~100 cells per sgRNA were transduced). Stringent quality control are applied at all stages of the experiment pipeline, including:

  • every screened cell line has >75% Cas9 activity;
  • every cell line is transduced with the library at >15% efficiency;
  • cells are monitored for changes in morphology or growth rates following lentiviral transduction;
  • a DNA yield of >72ug is required to maintain library coverage;
  • quality and size of all PCR products are checked.

Further rigorous quality control assessment of the data are also completed described in the associated manuscript. Only data satisfying all quality control measures are included in this dataset.

Publication reference: Behan, F.M., Iorio, F., Picco, G. et al. Prioritization of cancer therapeutic targets using CRISPR–Cas9 screens. Nature, 2019.