MGH DNA Core

Research Services

Home > Viral_Genome_Sequencing > current

Viral Genome Sequencing: Data Retrieval

Upon completion of the NGS run, data are analyzed, demultiplexed and subsequently entered into our automated de novo assembly pipeline. Viral genomes are assembled using MGH CCIB's de novo assembler UltraCycler v1.0. (Brian Seed and Huajun Wang, unpublished). Once the assembly output has been manually inspected and passed our QC standards, results are made available through our secure data server. Respective researchers will receive an automated email notification as soon as the data can be accessed through our website.

Please note that your data will only be available for three months after it is released! We strongly encourage our users to download their data as soon as they are available. Our data server is only a temporary storage site which does not allow long-term archiving of NGS data. All Viral Genome Sequencing data generated at our core facility is subject to deletion without notice after three months.

Accessing Your Data:

To access our data server, please log into your account and click on the My Results button. You have the following options to download your data files:

Download an uncompressed file *(.seq) with a single concatenated text file (in FASTA format) of all nucleotide sequences generated for the corresponding order.
Please Note: For a fully finished genome, the FASTA file header will say "CONTIG". If we have sequenced a circular genome, the FASTA file header will say "CIRCLE". In the latter case, the first 52 base pairs of the sequence are repeated at the end of the sequence, a result of our final assembly quality control. Prior to importing the complete genome sequence into another sequence analysis program, please remove the last 52 base pairs at the end of the sequence.
Download a compressed file (*.sit) containing one FASTA format sequence file (*.seq) and one EXCEL file (with coverage information) for each sample of the corresponding order. For each sample, the raw NGS data (in FASTQ format) are also provided. Please note that there will be one FASTQ file for each sample. Paired-end reads (2 x 150 b) can be read from a single FASTQ file in which the entries for the first read (1) and second read (2) from each pair alternate. The first read in each pair comes before the second.

IMPORTANT:

The EXCEL file provides depth and coverage information for each individual genome base that is valuable for ascribing a sequence quality to the read at that position. A descriptive example of a coverage file can be downloaded here.
Due to the typically short read-length provided by the current generation of high-throughput sequencing instrumentation, long repeat structures present a challenge with mixed success in de novo sequence assembly. In these cases, the availability of a reference sequence can aid the generation of a fully assembled viral genome sequence. If a complete assembly cannot be produced, sequence and coverage information for the individual contigs will be provided as multiple entries in the FASTA format sequencing file (.seq) and multiple sheets in the respective EXCEL file.

Decompression Software:
To open compressed *.sit files (which are actually .zip files) on your computer, you can use Stuffit Expander (free) or WinZip (trial version). To download, please select the appropriate link below. You can also use the Linux unzip command.

Stuffit Expander
WinZip for Windows
WinZip for Mac

Please note:
*.seq files are plain text files containing your sequence in FASTA format and can be opened with any software capable of viewing plain text or FASTA format files (text editor software such as Word, NotePad, etc.). You may also change the file extension from *.seq to *.txt.