1 Running PinAPL-Py
1.1 QUICK START:
Step 1: SET UP A RUN
Enter a project name for your analysis run. This name will help you identify your results in case you do multiple runs in a row. Provision of an email address is optional, but will let you safely close the browser during the analysis and receive a notification after completion.
Step 2: UPLOAD DATA
Upload your files via the drag-and-drop frame. Uncompressed format (.fastq) is supported, but compressed (.fastq.gz) is recommended.
Step 3: ENTER SAMPLE INFORMATION
Enter the name of the condition each file represents. Files representing replicates of the same condition have to be given the same name. Do not number your replicates. Numbering is done automatically by the program and displayed on the results page after completion of the analysis.
Please mark all control replicates with the checkbox to the right.
Step 4: CONFIGURE YOUR ANALYSIS RUN
First, choose the screen type. Choose between “enrichment” (e.g. a drug resistance screen) or “depletion” (e.g. a gene-essentiality screen), depending on whether your screen aims at finding sgRNAs of high or low abundance, respectively.Next, choose the sgRNA library used in your screen from the dropdown menu. If your screen uses a library not present in the list or a custom library, see “Uploading a custom library” in the Advanced Options below.
Optional: If you would like to edit the default parameter settings, click Advanced Options. For instructions on these parameters, see “Parameter description” in the Advanced Options section.
Step 5: RUNNING AND COMPLETION
You can follow the program’s execution log by refreshing the page repeatedly. In case another run was started shortly before yours, your run will be queued and start after completion of the previous.
If you provided an email address, you can close the browser; you will be notified by email and sent a link to the results after completion. Otherwise, please leave the progress screen open.
The results will remain on the server for 5 days. You can download all content shown on the results page in a single ZIP archive.
1.2 ADVANCED OPTIONS:
sgRNA Sequence Length (default = 20)
The length of your sgRNA sequence in the reads.
Adapter error rate (default = 0.1)
Error rate (mismatches and indels) allowed for the identification of the 5’ adapter (Refer to the cutadapt manual for more details). Increasing this rate can help to control for poor sequence quality.
Matching threshold (default = 40)
Minimal alignment score required to consider a read successfully matched. For a perfect match this must be double the sgRNA sequence length (Refer to the Bowtie2 manual for more details on calculation of the alignment score). Decreasing this threshold will include reads with a less than optimal match to a library entry which can be helpful to increase sensitivity or control for sequence quality.
Ambiguity threshold (default = 2):
Minimum tolerated difference between primary (best) and secondary (second-best) alignment to consider a read successfully matched. Reads with a difference lower than this threshold will be considered ambiguous and discarded. Increasing this threshold increases stringency. Decreasing this threshold increases sensitivity. With a threshold of 0, the program will accept reads even if they match multiple library entries equally well.
Seed length (default = 11):
Seed length parameter for Bowtie2 alignment (-L, refer to the Bowtie2 manual for more details). Changing this parameter is generally not required.
Seed number (default = 1):
Number of allowed mismatches for Bowtie2 seed alignment (-N, refer to the Bowtie2 manual for more details). Changing this parameter is generally not required.
Seed interval function (default = ‘S,1,0.75’):
Bowtie2 seed interval function (-i, refer to the Bowtie2 manual for more details). Changing this parameter is generally not required.
Normalization: (default = ‘cpm ’):
Method of read count normalization.
- cpm: Counts per million. Read counts are divided by the number of total read counts in the sample and multiplied by 1,000,000.
- total: Read counts are divided by the number of total read counts in the sample and multiplied by the mean total read count across all samples.
- size: Read counts are normalized using median ratios and the “size-factor” method, as decribed in (Li et al., 2014; Anders and Huber, 2010).
Cutoff (default = 0):
Cutoff threshold (given in cpm) to filter out low sgRNA counts. sgRNAs with counts lower than the cutoff will be set to 0 counts. If low counts are of minor interest for the experiment (e.g. in an enrichment screen), this can be helpful to reduce noise in the data.
Round counts (default = No):
Round counts after normalization to avoid fractional counts. Rounding only affects visualization, but not significance analysis.
Gene Metric (default = "αRRA:"):
Method to combine the sgRNA enrichment/depletion data for ranking of genes:
- αRRA: Adjusted robust rank aggregation (Li et al., 2014). This method ranks genes, based on a Beta model of the aggregation of sgRNAs. It requires a sgRNA to achieve at least a certain critical p-value (see “P0” parameter below) to be taken into account.
- STARS: STARS score (Doench et al., 2016). This method ranks genes, based on a binomial model. It requires a gene to have at least two sgRNAs ranked among the top x% (see “sgRNA percentage” parameter below).
For more details on these methods, please refer to the original publications.
Number of permutations (default = 1000):
Number of permutations for p-value estimation of the gene ranking score. CAUTION: STARS is more computationally demanding than aRRA, so reducing the number of permutations is recommended in this case.
sgRNA percentage (STARS only) (default = 10):
Percentage of sgRNAs to be included in the ranking analysis. Only relevant if “STARS ”method is chosen.
P0 (aRRA only) (default = 0.0005):
Critical p-value for individual sgRNAs to be included in the ranking analysis. Only relevant if “aRRA” method is chosen.
Significance level (sgRNAs) (default = 0.001)
Significance threshold for the fold-change enrichment/depletion of sgRNAs.
Significance level (genes) (default = 0.01)
Significance threshold for the gene ranking score.
p-value adjustment (default = ‘fdr_bh’):
Method for p-value adjustment for multiple tests.
- fdr_bh: Benjamini-Hochberg method.
- fdr_tsbh: Two-stage Benjamini-Hochberg method.
- sidak: Sidak correction method.
- bonferroni: Bonferroni correction method.
Cluster by… (default = ‘variance’):
Criterion for sample clustering.
- variance: Clustering of the samples is based on the sgRNAs with the highest read count variance across all samples.
- counts: Clustering of the samples is based on the sgRNAs with the highest/lowest abundance (depending on whether the screen type is “enrichment” or “depletion”).
Number of sgRNAs for clustering (default = 25):
Specify how many sgRNAs are used for clustering with the method selected above. In case of clustering by counts, the top x sgRNAs from each sample are combined.
Dotsize (default = 10):
Size of dots in replicate scatterplots.
Transparency level (default = 0.1):
Transparency of points in scatterplots. A low level is helpful to visualize density.
sgRNA annotation (default = No):
Annotate sgRNA with their IDs when highlighting individual genes in scatterplots.
Highlight non-targeting controls (default = No):
Highlight non-targeting control sgRNAs in scatterplots.
Table format (default = Text only):
File format for sgRNA and gene tables in the download archive. Use “Text only” for optimal workflow speed. Text files (.tsv) can be manually opened and converted with Excel. Use “Excel” to have the workflow automatically convert all text tables into .xlsx format (WARNING: This increases computation time).
PNG resolution (default = 300):
Resolution for PNG output (dpi).
1.2.2 Uploading a custom library:
Prepare your library file (e.g. in Excel) as a spreadsheet with 3 columns (with headers):
- gene: This column contains an identifier of the gene that is targeted by the sgRNA
- sgRNA_ID: This column contains an identifier of the sgRNA
- sequence: This column contains the 20bp sequence of the sgRNA
You can choose other header names for these columns. See example below.Example:
Save the spreadsheet as either tab-separated format (.tsv) or comma-separated format (.csv). You can use the "Save As" menu item in Excel to do this.
Use the file browser to select and upload your library file.
Next, specify the following parameters:
Enter the sequence of the 5’-adapter. Adapters are simply sequences lying 5’ or 3’ of the 20bp sgRNA. There are no restriction to length of your adapter definition, but it is generally recommended to define the 20-25 bp immediately 5’ of the sgRNA sequence (see image below). Also, it is recommended to let the adapter sequence end in an ‘N’ to allow possible mismatches (see example below). A sequence mapping program like SnapGene Viewer is helpful to define the adapter. Definition of the 3’ adapter is not necessary.
Example: If your reads have the following structure
TCGAATCTTGTGGAAAGGACGAAACACCG ACGGAGGCTAAGCGTCGCAA GTTTTAG
you can, for example, define TCTTGTGGAAAGGACGAAACACCN as the 5’-adapter.
Identifier for non-targeting controls:
If your library contains non-targeting controls, enter an identifier in the library spreadsheet to define sgRNAs containing non-targeting controls. The identifier is a part of the gene_ID that is unique to the non-targeting controls (see example below). If your library does not contain non-targeting controls, enter “none”
Example: An identifier in this case would be “Non_Target”.
Number of sgRNAs per gene:
Specifies the number of sgRNAs targeting a single gene (excluding non-targeting controls, miRNAs and other non-genes in your library).
2 Description of the PinAPL-Py Analysis output
The PinAPL-Py output is structured by logical order into tabs and subtabs on the results page. In addition, all output can be downloaded via the “Download Results Archive” button as a single .zip file. Images are saved both as high-resolution .png as well as as .svg vector graphics which can be further processed in Adobe Illustrator or similar image processing software. Tables are saved as raw text (.tsv), but can be manually opened with Excel and saved as Excel spreadsheets. For convenience, PinAPL-Py can convert tables on-the-fly (see the “Table Format” parameter on the configuration page), at the cost of additional computation time.
NOTE for Windows users: To view text files (.txt/.tsv/.csv), Notepad++ is recommended
This tab contains the results of the gene ranking analysis in a sortable table. The columns are:
- Gene: Name of gene (as defined in the library file)
- Gene Score: Value of the computed gene metric score. <gene metric> is either aRRA or STARS score, as chosen in on configuration page
- Gene Score p-value: Estimated (one-sided) p-value of the gene score
- Gene Score FDR: Estimated false discovery rate of the achieved gene score
- significant: Statistical significance of the obtained gene metric score. Declared “True” if the FDR is smaller than the significance threshold, defined on the configuration page
- # sgRNAs:Number of sgRNAs targeting the particular gene
- # Signif. sgRNAs: Number of sgRNAs targeting the particular gene that reached statistical significance in the sgRNA ranking
- Avg. log FC: Average log10 fold-change of all sgRNAs targeting the particular gene
Results are sorted by number of significant sgRNAs by default.
This tab contains the results of the sgRNA enrichment/depletion analysis. The columns are:
- sgRNA: Identifier of sgRNA
- Gene: Name of target gene
- Counts: Normalized Read count
- Control mean: Average normalized read count in the control samples
- Control StDev: Standard deviation of normalized read counts in the control samples
- Fold Change: The ratio of normalized read counts in the sample to the control average
- p-value: p-value (one-sided) of the normalized read count
- FDR: False discovery rate of the normalized read count
- Significant: Statistical significance of the normalized read count. Declared “True” if the FDR is smaller than the significance threshold, defined on the configuration page
This plot shows information about the overall efficacy of sgRNAs targeting the same gene. Genes are categorized by the number of targeting sgRNAs reaching statistical significance. Genes having no significant sgRNAs are omitted.
This tab contains various plots visualizing the fraction of sgRNAs and genes that reached statistical significance in the ranking.
- Gene Significance: The plot shows the distribution of p-values obtained in the gene ranking analysis, both before and after adjustment for multiple tests. In order for low p-values to be credible, this distribution should be noticeably different from a uniform distribution.
- sgRNA Significance: he plot shows the distribution of p-values obtained in the sgRNA ranking analysis, both before and after adjustment for multiple tests. In order for low p-values to be credible, this distribution should be noticeably different from a uniform distribution.
- sgRNA Volcano: The plot visualizes the fraction of sgRNAs whose fold change compared to the control yielded statistical significance. One-sided p-values are shown. p-values are capped at 1e-16 for technical purposes.
- sgRNA QQ: The plot visualizes the degree by which the p-values obtained from the sgRNA ranking analysis differ from a uniform distribution (=“expected p-values”). In order for low p-values to be credible, they should show noticeable distance from the dashed line. p-values are capped at 1e-16 for technical purposes.
- sgRNA z-Scores: The plot visualizes the fraction of sgRNAs whose z-Score (=normalized deviation from the mean read count) yielded statistical significance.
Read Count Distribution:
This tab contains information about the statistical distribution of sgRNA read counts.
- Lorenz curves and Gini coefficients: The Lorenz curve visualizes the distribution of reads, showing what fraction of sgRNAs/genes is represented by what fraction of reads. The Gini coefficient quantifies the difference of this distribution from a perfectly even distribution. A perfect even distribution results in a diagonal curve (Gini coefficient = 0). An extreme uneven distribution results in a flat curve (Gini coefficient = 1) (only a single sgRNA/gene is represented by all reads). These statistics can serves as an indicator of the strength of selection in a sample.
- Boxplots, histograms and descriptive statistics: Boxplots and histograms for the read counts per sgRNA or gene, respectively. Outliers are omitted for visualization purposes. Descriptive statistics are summarized below. sgRNA/Gene Representation measures the number of sgRNAs/genes detected by at least one count in the sample (as percentage of the full library).
Read Count Dispersion:
This tab shows the distribution of read counts in the control samples. The data shown is used to estimate the parameters for the negative binomial distribution describing the read counts of each sgRNA.
- Read Count Overdispersion: This plot visualizes the degree of overdispersion in the data, i.e. the degree by which the variance of read counts exceeds the mean (as typically seen in next-generation sequencing datasets).
- Mean/Variance Model: This plot shows shows the computed regression line, which is used to estimate the dispersion, i.e. the relationship between read count variance and mean. The dispersion is needed to estimate the parameters for the negative binomial distributions of each sgRNA.
This tab summarizes the read alignment process.
- Mapping Quality: Histogram of the overall quality by which the reads mapped to the library. Reads that uniquely align to a single library sequence yield a high mapping quality score. Reads that ambiguously align to multiple library sequences or that do not align to any library sequence yield a low mapping quality score. For more detailed information about computation of the mapping quality score, please refer to the Bowtie2 manual.
- Alignment Analysis: Barplot showing the primary (best) and secondary (second-best) alignment scores achieved for each read. If a read uniquely aligns to only one library sequence, its primary alignment score will be high, and its secondary alignment score will be 0. If a read aligns ambiguously to multiple library sequences, its secondary alignment score will be close to its primary alignment score. If a read does not align to any library sequence, both its primary and secondary alignment scores will be 0. The fraction of reads marked in red is being discarded.
This text file provides information about the success of the alignment, i.e. about the number of reads in each of the following fractions:
- Unique Alignments: The read aligns to only one library sequence.
- Alignments above ambiguity tolerance: The read aligns to more than one library sequence, but the difference between best and second-best alignment score is high enough to accept the best score.
- Alignments below ambiguity tolerance: The read aligns to more than one library sequence, but the difference between best and second-best alignment score is not high enough to safely assign the read to one particular library sequence.
- Failed Alignments: The read does not align to any library sequence.
This tab shows the log of the adapter trimming process, as reported by cutadapt. The output is explained in detail in the cutadapt manual.
This tab contains graphs for sequence quality control (produced by fastqc). For the full fastqc output, click the “See full report” link
- Per Base Quality: (upper left): This plot shows the quality distribution for every base position in the read. y-axis is sequence quality score (Phred).
- Per Sequence Quality: (upper right): This plot shows a sequence quality histogram. y-axis shows number of reads. Preferably, sequence quality should peak at a score >= 35.
- GC Content: (lower left): This plot shows a histogram of the the GC content. y-axis shows number of reads.
- Per Base Sequence: (lower right): This plot shows the fractions of T, C, A and G for every base position in the read. A balanced mix is typically only seen in the 20 bp sgRNA sequence.
This tab shows the sequencing depth (number of total reads) per sample. Results from the alignment analysis are superimposed on each bar.
2.3 Scatter Plots
Treatment vs Control:
Scatterplots of normalized sgRNA counts in the sample versus the average normalized count in the controls. The fraction reaching significant enrichment/depletion (dependent on screen type) compared to the control is marked in green.
Scatterplots showing the normalized sgRNA counts in one replicate of each condition versus another. Pearson and Spearman correlation coefficients are reported.
Clustering of all samples in the dataset, based on to the most variable or most abundant/depleted sgRNAs (as set up on the configuration page). Log10 normalized read counts are color-coded from lowest (yellow) to highest (red).
2.5 Run Info
This shows the program execution log.
This file shows the parameter settings used in the run.
This table linkes file names and sample names (Replicates of the same condition are automatically numbered).
Anders,S. and Huber,W. (2010) Differential expression analysis for sequence count data. Genome Biol. , 11 , R106.
Doench,J.G. et al. (2016) Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat. Biotechnol. , 34 , 184 –191.
Li,W. et al. (2014) MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biol. , 1 –12.