For CRISPR sequencing, the NGS data are analyzed as described here. In the following we introduce you to the format of the files that are delivered. For the purpose of demonstration, we will use results for a subset of a public dataset "Danio rerio CRISPR amplicon sequencing set1 MiSeq" (SRR1586614). We will pretend these data are coming from a CRISPR Sequencing order 1586614r in our system. The tube ID will be assigned as SRR. We assume that we processed the sample in our run 987654w, with internal ID ZA12. For each sample, three files will be delivered. These files will be in a folder, in this case named 1586614r_year_month_day_CRISPR_Sequencing.
a FASTQ file:
This file, named SSR_987654w_ZA12.fastq, represents the raw data.
You can analyze FASTQ data using any compatible NGS data software. Please note that paired-end reads can be read from a single FASTQ file in which the entries for the first and second read from each pair alternate. The first read in each pair comes before the second.
a FASTA file:
This file, named
SSR_987654w_ZA12_CRISPR_variants.seq, is a text file.
The first few lines are displayed here:
>NCBI_gi_157310886_270_p1 TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC TGAGAGACCGTCTGCACTCCGCTGAGCAAGAGAACCTCAAACGCTCCAAA GAGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA TAGCCTGAATGCAAGACCCA >SSR_ZA12_CONTIG_270_p1 26739 pairs of NGS reads, 50.23% TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC TGAGAGACCGTCTGCACTCCGCTGAGCAagagaacctcaaacgCTCCAAA GAGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA TAGCCTGAATGCAAGACCCA >SSR_ZA12_CONTIG_268_p2 7334 pairs of NGS reads, 13.77% TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC TGAGACCGTCTGCACTCCGCTGAGCAagagaacctcaaacgcTCCAAAGA GCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACAAG CACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAATA GCCTGAATGCAAGACCCA >SSR_ZA12_CONTIG_266_p3 6678 pairs of NGS reads, 12.54% TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC TGACCGTCTGCACTCCGCTGAGCAagagaacctcaaacgctcCAAAGAGC TCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACAAGCA CTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAATAGC CTGAATGCAAGACCCA >SSR_ZA12_CONTIG_227_p4 3061 pairs of NGS reads, 5.75% TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCgtgagtttctggccc tgagtttcagctcaacctggtgctggacgaaatcaagagagcCATCGCTG AGAAACAAGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGT GAGAGAATAGCCTGAATGCAAGACCCA
When we can find a high-quality genomic sequence matching the CRISPR sequences, it will be included as the first entry. Each of the other sequences in this file is an algorithmically called CRISPR variant. The number of NGS reads matched to each specific variant is indicated in the corresponding fasta header. If the amplicon is short enough, the middle region of the sequence will be covered by both the forward reads and the reverse reads. The overlap region will be represented by lower case letters (colored red in the above display but not in the text files).
a Multiple Alignment file:
This file, named
SSR_ZA12.seq_aln.txt, is a text file, but you
need to open it in a whitespace friendly editor (such as Microsoft Wordpad or
any popular web browser). Don't open it in Notepad or Microsoft Word.
Again, only a part of the file is displayed below:
CLUSTAL 2.1 multiple sequence alignment NCBI_gi_157310886_270_p1 TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT 50 SSR_ZA12_CONTIG_270_p1 TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT 50 SSR_ZA12_CONTIG_268_p2 TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT 50 SSR_ZA12_CONTIG_266_p3 TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT 50 SSR_ZA12_CONTIG_227_p4 TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT 50 ************************************************** NCBI_gi_157310886_270_p1 TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC 100 SSR_ZA12_CONTIG_270_p1 TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC 100 SSR_ZA12_CONTIG_268_p2 TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC 100 SSR_ZA12_CONTIG_266_p3 TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC 100 SSR_ZA12_CONTIG_227_p4 TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC 100 ************************************************** NCBI_gi_157310886_270_p1 TGAGAGACCGTCTGCACTCCGCTGAGCAAGAGAACCTCAAACGCTCCAAA 150 SSR_ZA12_CONTIG_270_p1 TGAGAGACCGTCTGCACTCCGCTGAGCAAGAGAACCTCAAACGCTCCAAA 150 SSR_ZA12_CONTIG_268_p2 TGA--GACCGTCTGCACTCCGCTGAGCAAGAGAACCTCAAACGCTCCAAA 148 SSR_ZA12_CONTIG_266_p3 T----GACCGTCTGCACTCCGCTGAGCAAGAGAACCTCAAACGCTCCAAA 146 SSR_ZA12_CONTIG_227_p4 T------------------------------GAGTTTC------------ 108 * ** ** NCBI_gi_157310886_270_p1 GAGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA 200 SSR_ZA12_CONTIG_270_p1 GAGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA 200 SSR_ZA12_CONTIG_268_p2 GAGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA 198 SSR_ZA12_CONTIG_266_p3 GAGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA 196 SSR_ZA12_CONTIG_227_p4 -AGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA 157 ************************************************* NCBI_gi_157310886_270_p1 AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA 250 SSR_ZA12_CONTIG_270_p1 AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA 250 SSR_ZA12_CONTIG_268_p2 AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA 248 SSR_ZA12_CONTIG_266_p3 AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA 246 SSR_ZA12_CONTIG_227_p4 AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA 207 ************************************************** NCBI_gi_157310886_270_p1 TAGCCTGAATGCAAGACCCA 270 SSR_ZA12_CONTIG_270_p1 TAGCCTGAATGCAAGACCCA 270 SSR_ZA12_CONTIG_268_p2 TAGCCTGAATGCAAGACCCA 268 SSR_ZA12_CONTIG_266_p3 TAGCCTGAATGCAAGACCCA 266 SSR_ZA12_CONTIG_227_p4 TAGCCTGAATGCAAGACCCA 227 ********************
CRISPR/Cas9 system-induced indel mutations, if any, are quite easy to spot in the multiple alignments. They are shown in red in the above example (but not highlighted in the text file).
The above case assumes we are sequencing amplicons that are shorter than two times the read length. When the amplicons are longer, however, the middle of the amplicons will not be covered by NGS reads. Instead of showing an overlap region in lower case letters, five Ns (NNNNN) will be inserted to point out the uncertainty of the sequence. In general, "indels" next to the NNNNN should be ignored since sequence quality in this region is suboptimal. You can take a look at these: SSR_987654w_ZA11_CRISPR_variants.seq and SSR_ZA11.seq_aln.txt. This is the same data but trimmed to 130 bp. The "indels" (in red) in contig SSR_11_CONTIG_265_p1 are not real.
CLUSTAL 2.1 multiple sequence alignment NCBI_gi_157310886_270_p1 TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT 50 SSR_11_CONTIG_265_p1 TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT 50 SSR_11_CONTIG_227_p2 TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT 50 SSR_11_CONTIG_220_p3 TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT 50 ************************************************** NCBI_gi_157310886_270_p1 TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC 100 SSR_11_CONTIG_265_p1 TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC 100 SSR_11_CONTIG_227_p2 TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC 100 SSR_11_CONTIG_220_p3 TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC 100 ************************************************** NCBI_gi_157310886_270_p1 TGAGAGACCGTCTGCACTCCGCTGAGCAAGAGAACCTCAAACGCTCCAAA 150 SSR_11_CONTIG_265_p1 TGAGAGACCGTCTGCACTCCGCTGAGCAAG-----NNNNNACGCTCCAAA 145 SSR_11_CONTIG_227_p2 TGAG------------------------------TTTC------------ 108 SSR_11_CONTIG_220_p3 TGA----------------------------------C------------ 104 *** NCBI_gi_157310886_270_p1 GAGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA 200 SSR_11_CONTIG_265_p1 GAGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA 195 SSR_11_CONTIG_227_p2 -AGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA 157 SSR_11_CONTIG_220_p3 ----TCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA 150 ********************************************** NCBI_gi_157310886_270_p1 AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA 250 SSR_11_CONTIG_265_p1 AGCACTGCGGGACACCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA 245 SSR_11_CONTIG_227_p2 AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA 207 SSR_11_CONTIG_220_p3 AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA 200 ************** *********************************** NCBI_gi_157310886_270_p1 TAGCCTGAATGCAAGACCCA 270 SSR_11_CONTIG_265_p1 TAGCCTGAATGCAAGACCCA 265 SSR_11_CONTIG_227_p2 TAGCCTGAATGCAAGACCCA 227 SSR_11_CONTIG_220_p3 TAGCCTGAATGCAAGACCCA 220 ********************