MGH   
CCIB
 
Complete Amplicon Sequencing Results: File Format

Please note that the file format depends on the length of the submitted PCR amplicons (or DNA fragments).

For PCR amplicons that are longer than 600bp, results are provided in the exact same format as described for our Complete Plasmid Sequencing service. Assuming that the submitted sample contains only one highly abundant PCR amplicon, we try to assemble the NGS reads into one single contig. If multiple amplicons should be present in the submitted sample, however, they cannot share significant sequence similarities as this could make the assembly results extremely difficult to interpret.

For PCR amplicons shorter than 600bp, we try to detect all possible variants that are more frequent than 1%. The results are presented in a format similar to our CRISPR Sequencing results, with slight modifications.

In the following we introduce you to the format of the files that are delivered. For the purpose of demonstration, we will use results for a subset of a public dataset "Danio rerio CRISPR amplicon sequencing set1 MiSeq" (SRR1586614). We will pretend these data are coming from a Complete Amplicon Sequencing order, named 1586614a in our system. The tube ID will be assigned as SRR. We assume that we processed the sample in our run 987654w, with the internal ID ZA12. For each sample, three files will be delivered. These files will be in a folder, in this case named 1586614a_year_month_day_Complete_Amplicon_Sequencing.
  • a FASTQ file:
    This file, named SSR_987654w_ZA12.fastq, represents the raw data. You can analyze FASTQ data using any compatible NGS data software. Please note that paired-end reads can be read from a single FASTQ file in which the entries for the first and second read from each pair alternate. The first read in each pair comes before the second.

  • a FASTA file:
    This file, named SSR_987654w_ZA12_Amplicons.seq, is a text file. The first few lines are displayed here:

    
    
    >SSR_ZA12_CONTIG_271_p1    24410 pairs of NGS reads, 47.61%  plus 32784 pairs with SNPs
    TGGGTCTTGCATTCAGGCTATTCTCTCACCTGACAGGCTGCTCCAGGTA
    CGGTTGATGTCCCGCAGTGCTTGTTTCTCAGCGATGGCTCTCTTGATTT
    CGTCCAGCACCAGGTTGAGCTCTTTGGAGCGTTTGAGGTTCTCTTGCTC
    AGCGGAGTGCAGACGGTCTCTCAGGGCCAGAAACTCACGCTGGTAAATT
    CCACCACATTACCTACAAGCAGAAGACACAGAACTCTATTAATTGTTGG
    TTTTAAAACCCACACTTGTATAGAA
    >SSR_ZA12_CONTIG_269_p2    7069 pairs of NGS reads, 13.78%  plus 11520 pairs with SNPs
    TGGGTCTTGCATTCAGGCTATTCTCTCACCTGACAGGCTGCTCCAGGTAC
    GGTTGATGTCCCGCAGTGCTTGTTTCTCAGCGATGGCTCTCTTGATTTCG
    TCCAGCACCAGGTTGAGCTCTTTGGAGCGTTTGAGGTTCTCTTGCTCAGC
    GGAGTGCAGACGGTCTCAGGGCCAGAAACTCACGCTGGTAAATGTCCACC
    ACATTACCTACAAGCAGAAGACACAGAACTCTATTAATTGTTGGTTTTAA
    AACCCACACTTGTATAGAA
    >SSR_ZA12_CONTIG_267_p3    6482 pairs of NGS reads, 12.64%  plus 9740 pairs with SNPs
    TGGGTCTTGCATTCAGGCTATTCTCTCACCTGACAGGCTGCTCCAGGTAC
    GGTTGATGTCCCGCAGTGCTTGTTTCTCAGCGATGGCTCTCTTGATTTCG
    TCCAGCACCAGGTTGAGCTCTTTGGAGCGTTTGAGGTTCTCTTGCTCAGC
    GGAGTGCAGACGGTCAGGGCCAGAAACTCACGCTGGTAAATGTCCACCAC
    ATTACCTACAAGCAGAAGACACAGAACTCTATTAATTGTTGGTTTTAAAA
    CCCACACTTGTATAGAA
    >SSR_ZA12_CONTIG_261_p4    2873 pairs of NGS reads, 5.6%  plus 3931 pairs with SNPs
    TGGGTCTTGCATTCAGGCTATTCTCTCACCTGACAGGCTGCTCCAGGTAC
    GGTTGATGTCCCGCAGTGCTTGTTTCTCAGCGATGGCTCTCTTGATTTCG
    TCCAGCACCAGGTTGAGCTCTTTGGAGCGTTTGAGGTTCTCTTGCTCAGC
    GGAGTGCAGACGGCCAGAAACTCACGCTGGTAAATGTCCACCACATTACC
    TACAAGCAGAAGACACAGAACTCTATTAATTGTTGGTTTTAAAACCCACA
    CTTGTATAGAA
    
    

  • a Multiple Alignment file:
    This file, named SSR_987654w_ZA12.seq_aln.txt, is a text file, but you need to open it in a whitespace friendly editor (such as Microsoft Wordpad or any popular web browser). Don't open it in Notepad or Microsoft Word. Again, only a part of the file is displayed below:

    
    CLUSTAL O(1.2.0) multiple sequence alignment
    
    
    SSR_ZA12_CONTIG_271_p1  TGGGTCTTGCATTCAGGCTATTCTCTCACCTGACAGGCTGCTCCAGGTACGGTTGATGTCCCGCAGTGCTTGTTTCTCAGCGATGGCTCTCTTGA	95
    SSR_ZA12_CONTIG_269_p2  TGGGTCTTGCATTCAGGCTATTCTCTCACCTGACAGGCTGCTCCAGGTACGGTTGATGTCCCGCAGTGCTTGTTTCTCAGCGATGGCTCTCTTGA	95
    SSR_ZA12_CONTIG_267_p3  TGGGTCTTGCATTCAGGCTATTCTCTCACCTGACAGGCTGCTCCAGGTACGGTTGATGTCCCGCAGTGCTTGTTTCTCAGCGATGGCTCTCTTGA	95
    SSR_ZA12_CONTIG_261_p4  TGGGTCTTGCATTCAGGCTATTCTCTCACCTGACAGGCTGCTCCAGGTACGGTTGATGTCCCGCAGTGCTTGTTTCTCAGCGATGGCTCTCTTGA	95
    SSR_ZA12_CONTIG_269_p5  TGGGTCTTGCATTCAGGCTATTCTCTCACCTGACAGGCTGCTCCAGGTACGGTTGATGTCCCGCAGTGCTTGTTTCTCAGCGATGGCTCTCTTGA	95
    SSR_ZA12_CONTIG_228_p6  TGGGTCTTGCATTCAGGCTATTCTCTCACCTGACAGGCTGCTCCAGGTACGGTTGATGTCCCGCAGTGCTTGTTTCTCAGCGATGGCTCTCTTGA	95
    SSR_ZA12_CONTIG_268_p7  TGGGTCTTGCATTCAGGCTATTCTCTCACCTGACAGGCTGCTCCAGGTACGGTTGATGTCCCGCAGTGCTTGTTTCTCAGCGATGGCTCTCTTGA	95
    SSR_ZA12_CONTIG_259_p8  TGGGTCTTGCATTCAGGCTATTCTCTCACCTGACAGGCTGCTCCAGGTACGGTTGATGTCCCGCAGTGCTTGTTTCTCAGCGATGGCTCTCTTGA	95
    SSR_ZA12_CONTIG_270_p9  TGGGTCTTGCATTCAGGCTATTCTCTCACCTGACAGGCTGCTCCAGGTACGGTTGATGTCCCGCAGTGCTTGTTTCTCAGCGATGGCTCTCTTGA	95
    SSR_ZA12_CONTIG_273_p10 TGGGTCTTGCATTCAGGCTATTCTCTCACCTGACAGGCTGCTCCAGGTACGGTTGATGTCCCGCAGTGCTTGTTTCTCAGCGATGGCTCTCTTGA	95
                            ***********************************************************************************************
    
    SSR_ZA12_CONTIG_271_p1  TTTCGTCCAGCACCAGGTTGAGCTCTTTGGAGCGTTTGAGGTTCTCTTGCTCAGCGGAGTGCAGACGGTCTCTC--AGGGCCAGAAACTCACGCT	188
    SSR_ZA12_CONTIG_269_p2  TTTCGTCCAGCACCAGGTTGAGCTCTTTGGAGCGTTTGAGGTTCTCTTGCTCAGCGGAGTGCAGACGGTCTCA----GGGCCAGAAACTCACGCT	186
    SSR_ZA12_CONTIG_267_p3  TTTCGTCCAGCACCAGGTTGAGCTCTTTGGAGCGTTTGAGGTTCTCTTGCTCAGCGGAGTGCAGACGGT----C--AGGGCCAGAAACTCACGCT	184
    SSR_ZA12_CONTIG_261_p4  TTTCGTCCAGCACCAGGTTGAGCTCTTTGGAGCGTTTGAGGTTCTCTTGCTCAGCGGAGTGCAGAC------------GGCCAGAAACTCACGCT	178
    SSR_ZA12_CONTIG_269_p5  TTTCGTCCAGCACCAGGTTGAGCTCTTTGGAGCGTTTGAGGTTCTCTTGCTCAGCGGAGTGCAGACGGTCTCA----AGGCCAGAAACTCACGCT	186
    SSR_ZA12_CONTIG_228_p6  TTTCGTCCAGCACCAGGTTGAGCTGAA---------------------------------------------ACTCAGGGCCAGAAACTCACGCT	145
    SSR_ZA12_CONTIG_268_p7  TTTCGTCCAGCACCAGGTTGAGCTCTTTGGAGCGTTTGAGGTTCTCTTGCTCAGCGGAGTGCAGACGGT---GC--AGGGCCAGAAACTCACGCT	185
    SSR_ZA12_CONTIG_259_p8  TTTCGTCCAGCACCAGGTTGAGCTCTTTGGAGCGTTTGAGGTTCTCTTGCTCAGCGGAGTGCA--------------GGGCCAGAAACTCACGCT	176
    SSR_ZA12_CONTIG_270_p9  TTTCGTCCAGCACCAGGTTGAGCTCTTTGGAGCGTTTGAGGTTCTCTTGCTCAGCGGAGTGCAGACGGTCTT-C--AGGGCCAGAAACTCACGCT	187
    SSR_ZA12_CONTIG_273_p10 TTTCGTCCAGCACCAGGTTGAGCTCTTTGGAGCGTTTGAGGTTCTCTTGCTCAGCGGAGTGCAGACGGTCTCAGAAAGGGCCAGAAACTCACGCT	190
                            ************************ ::                                                   *****************
    
    SSR_ZA12_CONTIG_271_p1  GGTAAATGTCCACCACATTACCTACAAGCAGAAGACACAGAACTCTATTAATTGTTGGTTTTAAAACCCACACTTGTATAGAA	271
    SSR_ZA12_CONTIG_269_p2  GGTAAATGTCCACCACATTACCTACAAGCAGAAGACACAGAACTCTATTAATTGTTGGTTTTAAAACCCACACTTGTATAGAA	269
    SSR_ZA12_CONTIG_267_p3  GGTAAATGTCCACCACATTACCTACAAGCAGAAGACACAGAACTCTATTAATTGTTGGTTTTAAAACCCACACTTGTATAGAA	267
    SSR_ZA12_CONTIG_261_p4  GGTAAATGTCCACCACATTACCTACAAGCAGAAGACACAGAACTCTATTAATTGTTGGTTTTAAAACCCACACTTGTATAGAA	261
    SSR_ZA12_CONTIG_269_p5  GGTAAATGTCCACCACATTACCTACAAGCAGAAGACACAGAACTCTATTAATTGTTGGTTTTAAAACCCACACTTGTATAGAA	269
    SSR_ZA12_CONTIG_228_p6  GGTAAATGTCCACCACATTACCTACAAGCAGAAGACACAGAACTCTATTAATTGTTGGTTTTAAAACCCACACTTGTATAGAA	228
    SSR_ZA12_CONTIG_268_p7  GGTAAATGTCCACCACATTACCTACAAGCAGAAGACACAGAACTCTATTAATTGTTGGTTTTAAAACCCACACTTGTATAGAA	268
    SSR_ZA12_CONTIG_259_p8  GGTAAATGTCCACCACATTACCTACAAGCAGAAGACACAGAACTCTATTAATTGTTGGTTTTAAAACCCACACTTGTATAGAA	259
    SSR_ZA12_CONTIG_270_p9  GGTAAATGTCCACCACATTACCTACAAGCAGAAGACACAGAACTCTATTAATTGTTGGTTTTAAAACCCACACTTGTATAGAA	270
    SSR_ZA12_CONTIG_273_p10 GGTAAATGTCCACCACATTACCTACAAGCAGAAGACACAGAACTCTATTAATTGTTGGTTTTAAAACCCACACTTGTATAGAA	273
                            ***********************************************************************************
    
    
  • a Coverage file:
    This file, named SSR_987654w_ZA12_coverage.xlsx, is an EXCEL file providing depth and coverage information for each individual base, which is valuable for ascribing a sequence quality to the read at that position. If two potential contigs differ just by 1 to 3 SNPs, these two sequences will be consolidated. The positions of the possible SNPs are indicated in the Excel file, though most of the "SNPs" are obviously due to sequencing errors. A descriptive example of a coverage file can be reviewed here. This current example is specific for our Complete Plasmid Sequencing service; the corresponding Complete Amplicon Sequencing file will look very similar.