Content: Overview Raw Data Processing ProcessingCRISPR Experimental Method

CRISPR: Dependencies

This page outlines raw data and the subsequent processing for use within the Cancer Dependency Map at Sanger. This data can be accessed through the Cell Model Passport and explored using DepMap Miner.

Raw Data

Dataset Name	Origin	Data Type	Link
Sanger CRISPR Dependencies 2024	Sanger	Unprocessed sgRNA read counts	Sanger sgRNA Raw Counts
CRISPR (Avana) Public 21Q2	Broad	Unprocessed sgRNA read counts	Achilles_gene_effect_unscaled.csv

Processed Data (Cell Lines)

Descriptions of how the raw data was processed including algorithms and filtering. Processed datasets can be downloaded here, the active dataset can also be accessed using the DepMap web resources and API.

Corrected Fold Changes

Gene independent responses (e.g. copy number) to CRISPR-Cas9 were corrected using CRISPRcleanR.

16 additional screens generated at Sanger were added to the Project Score data. The screens were combined using ComBat, and correction vectors were estimated from eight models in both datasets to correct for differences in assay lengths. The Sanger day 14 screens were used as the baseline and the day 18 screens were batch corrected using the overlapping cell lines. Once the overlapping cell lines had been used to estimate the batch effects, only the day 14 screens were retained for further analysis. (Pacini, Cancer Cell 2024).

Independent datasets generated at Sanger and Broad showed batch effects due to differences in library design and assay length (Dempster et al. Nat.Commun 2019). To allow combined analysis of these two datasets they were batch corrected to account for these technical differences (Pacini et al. Nat Commun 2021). Batch correction was calculated using ComBat using a set of overlapping cell lines screened at both institutes. Where overlapping cell lines were present, the data from Sanger were preferentially selected for inclusion in the final combined datasets. Finally the first principal component is removed from the data that is associated with differences in media.

Defining Fitness Genes

Loss of fitness scores are generated from corrected FCs for both individual Sanger and Broad datasets and the combined CRISPR-Cas9 dataset using BAGEL2 (Kim et al. Genome Med 2021) to call significantly depleted genes. The gene-level Bayesian factors (BFs) are calculated as the average of the sgRNAs on a targeted-gene basis. As input to BAGEL2 we used reference sets of predefined essential and non-essential genes further curated to exclude high-confidence cancer driver genes. A statistical significance threshold for gene-level BFs is determined for each cell line. Each gene is assigned a scaled BF computed by subtracting the BF at the 5% FDR threshold (obtained from classifying reference essential/non-essential genes based on BF rankings) defined for each cell line from the original BF. For consistency of visualisation, all scaled BF values are multiplied by -1 resulting in significantly depleted values having a loss of fitness score < 0. Archived Project Score v1 data was processed using an implementation of BAGEL in R as described (Behan F, et al. Nature. 2019).

Gene Fitness Metrics

Loss of Fitness Score: Scaled Bayesian Factors values are multiplied by -1.. A score < 0 indicates a statistically significant effect on cell fitness.
Corrected Fold Change: Copy-number-bias corrected gene depletion fold change, computed between average representation of targeting sgRNAs post-transfection at the end of the experiment versus plasmid library.
Binary Gene Essentiality Score: Scaled Bayesian Factors >0 are assigned 1 as significantly depleted and 0 otherwise.

Core Fitness Genes

Fitness genes common to the majority of cell lines tested, or common within a cancer type, may be involved in cell essential processes - we refer to these as core fitness genes. In order to identify core fitness genes, we developed a statistical method, ADaM (Adaptive Daisy Model) which adaptively determines the minimum number of dependent cell models required for a gene to be classified as a core fitness gene. ADaM was implemented at both a cancer-type specific level and a pan-cancer specific level (code publicly available at https://github.com/DepMap-Analytics/ADAM2).

Experimental Method (Sanger Data - Cell Lines)

The CRISPR library used for the Project Score screen (Tzelepis et al, 2016 is available from Addgene, Cat no. 67989) contains 90,709 sgRNAs targeting 18,009 genes (~5 sgRNAs/gene). All pooled screens were completed in technical triplicate at 100x coverage of the library (i.e. ~100 cells per sgRNA were transduced). Stringent quality control are applied at all stages of the experiment pipeline, including:

every screened cell line has >75% Cas9 activity;
every cell line is transduced with the library at >15% efficiency;
cells are monitored for changes in morphology or growth rates following lentiviral transduction;
a DNA yield of >72ug is required to maintain library coverage;
quality and size of all PCR products are checked.

Further rigorous quality control assessment of the data are also completed described in the associated manuscript. Only data satisfying all quality control measures are included in this dataset.

Publication reference: Behan, F.M., Iorio, F., Picco, G. et al. Prioritization of cancer therapeutic targets using CRISPR–Cas9 screens. Nature, 2019.

Additional data generated in head and neck squamous cell carcinoma models has been incorporated.

Publication reference: Annie Wai Yeeng Chai et al. Genome-wide CRISPR screens of oral squamous cell carcinoma reveal fitness genes in the Hippo pathway. eLife. 2020

DepMap Documentation

CRISPR: Dependencies

Raw Data

Processed Data (Cell Lines)

Experimental Method (Sanger Data - Cell Lines)