Content: Expression Raw Data Processing Processing RNA-Seq Experimental Method

Expression

This page outlines raw data and the subsequent processing for use within the Cancer Dependency Map at Sanger.

Raw Data

Dataset	Origin	Data Type	Model Type	Details	Link
RNA-Seq	Sanger	BAM	Cell Line	Illumina HiSeq 2000	EGAS00001000828
RNA-Seq	Broad	BAM	Cell Line	Illumina HiSeq 2000 or HiSeq 2500	PRJNA169425
RNA-Seq	Sanger	BAM	Organoid	Illumina HiSeq 4000	To be published.

Processed Data

Descriptions of how the raw data was processed including algorithms and filtering. Processed datasets can be downloaded here, the active dataset can also be accessed using the DepMap web resources and API.

RNA-Seq (Cell Lines)

RNA-seq data were collated from the Wellcome Sanger Institute and the Broad Institute (https://depmap.org/portal/data_page/?tab=allData). The data from the Broad Institute (release 24Q2, files OmicsExpressionGenesExpectedCountProfile.csv and OmicsExpressionAllGenesTPMLogp1Profile.csv) was processed using their DepMap Omics pipeline (https://github.com/broadinstitute/depmap_omics#rnaseq).

Read counts and TPM (transcripts per million) values for both data sources were inferred from using the RSEM tool (https://doi.org/10.1186/1471-2105-12-323). TPM values are reported after log2 transformation, using a pseudo-count of 1; log2(TPM+1). FPKM values are also available for the Sanger dataset.

Sanger data was originally processed using HTseq (doi:10.1093/bioinformatics/btu638). Read counts and FPKM values from HTseq are still available in the long format file.

Data presented through the API and Cell Model Passports website combines the Sanger and Broad datasets. Where cell models have been screened at both institutes, the Sanger data has been selected for the merged dataset. The data is available for download as a merged dataset with separate wide format files (genes by row, and samples by column) for read counts, TPM and FPKM values, or as a full dataset in long format.

RNA-Seq (Organoids)

Read counts, FPKM values and TPM (transcripts per million) values were inferred from using the RSEM tool (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-323). TPM values are reported after log2 transformation, using a pseudo-count of 1; log2(TPM+1).

Experimental Method (Cell Lines - Sanger Data)

For sequencing performed at the Sanger Institute, cell line pellets were collected during exponential growth in RPMI or Dulbecco’s Modified Eagle’s Medium/F12 and were lysed with TRIzol (Life Technologies) and stored at −70 °C. Following chloroform extraction, total RNA was isolated using the RNeasy Mini Kit (Qiagen). DNAse digestion was followed by the RNAClean Kit (Agencourt Bioscience). RNA integrity was confirmed on a Bioanalyzer 2100 (Agilent Technologies) prior to labeling using 3′ IVT Express (Affymetrix). Sequence libraries were prepared in an automated fashion on the Agilent Bravo platform using the stranded mRNA Library Prep Kit from KAPA Biosystems. Processing steps were unchanged from those specified in the KAPA manual, except for use of an in-house indexing set.

Publication reference: Picco, G., Chen, E.D., Alonso, L.G. et al. Functional linkage of gene fusions to cancer cell fitness assessed by pharmacological and CRISPR-Cas9 screening. Nature Communications, (2019).