MGH   
CCIB
 
 
CRISPR Sequencing Results: File Format

For CRISPR sequencing, the NGS data are analyzed as described here. In the following we introduce you to the format of the files that are delivered. For the purpose of demonstration, we will use results for a subset of a public dataset "Danio rerio CRISPR amplicon sequencing set1 MiSeq" (SRR1586614). We will pretend these data are coming from a CRISPR Sequencing order 1586614r in our system. The tube ID will be assigned as SRR. We assume that we processed the sample in our run 987654w, with internal ID ZA12. For each sample, three files will be delivered. These files will be in a folder, in this case named 1586614r_year_month_day_CRISPR_Sequencing.

  • a FASTQ file:
    This file, named SSR_987654w_ZA12.fastq, represents the raw data. You can analyze FASTQ data using any compatible NGS data software. Please note that paired-end reads can be read from a single FASTQ file in which the entries for the first and second read from each pair alternate. The first read in each pair comes before the second.

  • a FASTA file:
    This file, named SSR_987654w_ZA12_CRISPR_variants.seq, is a text file. The first few lines are displayed here:

    
    >NCBI_gi_157310886_270_p1
    TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT
    TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC
    TGAGAGACCGTCTGCACTCCGCTGAGCAAGAGAACCTCAAACGCTCCAAA
    GAGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA
    AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA
    TAGCCTGAATGCAAGACCCA
    >SSR_ZA12_CONTIG_270_p1    26739 pairs of NGS reads, 50.23%
    TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT
    TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC
    TGAGAGACCGTCTGCACTCCGCTGAGCAagagaacctcaaacgCTCCAAA
    GAGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA
    AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA
    TAGCCTGAATGCAAGACCCA
    >SSR_ZA12_CONTIG_268_p2    7334 pairs of NGS reads, 13.77%
    TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT
    TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC
    TGAGACCGTCTGCACTCCGCTGAGCAagagaacctcaaacgcTCCAAAGA
    GCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACAAG
    CACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAATA
    GCCTGAATGCAAGACCCA
    >SSR_ZA12_CONTIG_266_p3    6678 pairs of NGS reads, 12.54%
    TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT
    TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC
    TGACCGTCTGCACTCCGCTGAGCAagagaacctcaaacgctcCAAAGAGC
    TCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACAAGCA
    CTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAATAGC
    CTGAATGCAAGACCCA
    >SSR_ZA12_CONTIG_227_p4    3061 pairs of NGS reads, 5.75%
    TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT
    TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCgtgagtttctggccc
    tgagtttcagctcaacctggtgctggacgaaatcaagagagcCATCGCTG
    AGAAACAAGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGT
    GAGAGAATAGCCTGAATGCAAGACCCA
    
    

    When we can find a high-quality genomic sequence matching the CRISPR sequences, it will be included as the first entry. Each of the other sequences in this file is an algorithmically called CRISPR variant. The number of NGS reads matched to each specific variant is indicated in the corresponding fasta header. If the amplicon is short enough, the middle region of the sequence will be covered by both the forward reads and the reverse reads. The overlap region will be represented by lower case letters (colored red in the above display but not in the text files).

  • a Multiple Alignment file:
    This file, named SSR_ZA12.seq_aln.txt, is a text file, but you need to open it in a whitespace friendly editor (such as Microsoft Wordpad or any popular web browser). Don't open it in Notepad or Microsoft Word. Again, only a part of the file is displayed below:

    
    CLUSTAL 2.1 multiple sequence alignment
    
    
    NCBI_gi_157310886_270_p1      TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT 50
    SSR_ZA12_CONTIG_270_p1        TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT 50
    SSR_ZA12_CONTIG_268_p2        TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT 50
    SSR_ZA12_CONTIG_266_p3        TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT 50
    SSR_ZA12_CONTIG_227_p4        TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT 50
                                  **************************************************
    
    NCBI_gi_157310886_270_p1      TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC 100
    SSR_ZA12_CONTIG_270_p1        TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC 100
    SSR_ZA12_CONTIG_268_p2        TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC 100
    SSR_ZA12_CONTIG_266_p3        TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC 100
    SSR_ZA12_CONTIG_227_p4        TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC 100
                                  **************************************************
    
    NCBI_gi_157310886_270_p1      TGAGAGACCGTCTGCACTCCGCTGAGCAAGAGAACCTCAAACGCTCCAAA 150
    SSR_ZA12_CONTIG_270_p1        TGAGAGACCGTCTGCACTCCGCTGAGCAAGAGAACCTCAAACGCTCCAAA 150
    SSR_ZA12_CONTIG_268_p2        TGA--GACCGTCTGCACTCCGCTGAGCAAGAGAACCTCAAACGCTCCAAA 148
    SSR_ZA12_CONTIG_266_p3        T----GACCGTCTGCACTCCGCTGAGCAAGAGAACCTCAAACGCTCCAAA 146
    SSR_ZA12_CONTIG_227_p4        T------------------------------GAGTTTC------------ 108
                                  *                              **   **            
    
    NCBI_gi_157310886_270_p1      GAGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA 200
    SSR_ZA12_CONTIG_270_p1        GAGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA 200
    SSR_ZA12_CONTIG_268_p2        GAGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA 198
    SSR_ZA12_CONTIG_266_p3        GAGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA 196
    SSR_ZA12_CONTIG_227_p4        -AGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA 157
                                   *************************************************
    
    NCBI_gi_157310886_270_p1      AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA 250
    SSR_ZA12_CONTIG_270_p1        AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA 250
    SSR_ZA12_CONTIG_268_p2        AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA 248
    SSR_ZA12_CONTIG_266_p3        AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA 246
    SSR_ZA12_CONTIG_227_p4        AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA 207
                                  **************************************************
    
    NCBI_gi_157310886_270_p1      TAGCCTGAATGCAAGACCCA 270
    SSR_ZA12_CONTIG_270_p1        TAGCCTGAATGCAAGACCCA 270
    SSR_ZA12_CONTIG_268_p2        TAGCCTGAATGCAAGACCCA 268
    SSR_ZA12_CONTIG_266_p3        TAGCCTGAATGCAAGACCCA 266
    SSR_ZA12_CONTIG_227_p4        TAGCCTGAATGCAAGACCCA 227
                                  ********************
                                  
    
    

    CRISPR/Cas9 system-induced indel mutations, if any, are quite easy to spot in the multiple alignments. They are shown in red in the above example (but not highlighted in the text file).


The above case assumes we are sequencing amplicons that are shorter than two times the read length. When the amplicons are longer, however, the middle of the amplicons will not be covered by NGS reads. Instead of showing an overlap region in lower case letters, five Ns (NNNNN) will be inserted to point out the uncertainty of the sequence. In general, "indels" next to the NNNNN should be ignored since sequence quality in this region is suboptimal. You can take a look at these: SSR_987654w_ZA11_CRISPR_variants.seq and SSR_ZA11.seq_aln.txt. This is the same data but trimmed to 130 bp. The "indels" (in red) in contig SSR_11_CONTIG_265_p1 are not real.


CLUSTAL 2.1 multiple sequence alignment


NCBI_gi_157310886_270_p1      TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT 50
SSR_11_CONTIG_265_p1          TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT 50
SSR_11_CONTIG_227_p2          TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT 50
SSR_11_CONTIG_220_p3          TCTATACAAGTGTGGGTTTTAAAACCAACAATTAATAGAGTTCTGTGTCT 50
                              **************************************************

NCBI_gi_157310886_270_p1      TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC 100
SSR_11_CONTIG_265_p1          TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC 100
SSR_11_CONTIG_227_p2          TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC 100
SSR_11_CONTIG_220_p3          TCTGCTTGTAGGTAATGTGGTGGACATTTACCAGCGTGAGTTTCTGGCCC 100
                              **************************************************

NCBI_gi_157310886_270_p1      TGAGAGACCGTCTGCACTCCGCTGAGCAAGAGAACCTCAAACGCTCCAAA 150
SSR_11_CONTIG_265_p1          TGAGAGACCGTCTGCACTCCGCTGAGCAAG-----NNNNNACGCTCCAAA 145
SSR_11_CONTIG_227_p2          TGAG------------------------------TTTC------------ 108
SSR_11_CONTIG_220_p3          TGA----------------------------------C------------ 104
                              ***                                               

NCBI_gi_157310886_270_p1      GAGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA 200
SSR_11_CONTIG_265_p1          GAGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA 195
SSR_11_CONTIG_227_p2          -AGCTCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA 157
SSR_11_CONTIG_220_p3          ----TCAACCTGGTGCTGGACGAAATCAAGAGAGCCATCGCTGAGAAACA 150
                                  **********************************************

NCBI_gi_157310886_270_p1      AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA 250
SSR_11_CONTIG_265_p1          AGCACTGCGGGACACCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA 245
SSR_11_CONTIG_227_p2          AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA 207
SSR_11_CONTIG_220_p3          AGCACTGCGGGACATCAACCGTACCTGGAGCAGCCTGTCAGGTGAGAGAA 200
                              ************** ***********************************

NCBI_gi_157310886_270_p1      TAGCCTGAATGCAAGACCCA 270
SSR_11_CONTIG_265_p1          TAGCCTGAATGCAAGACCCA 265
SSR_11_CONTIG_227_p2          TAGCCTGAATGCAAGACCCA 227
SSR_11_CONTIG_220_p3          TAGCCTGAATGCAAGACCCA 220
                              ********************