MGH   
CCIB
 
AAV Genome Sequencing: Data Retrieval Details

Explanation of Delivered Data Files

For each sample you will receive nine data files, all beginning with the TUBE_ID specified in your sample submission form. These files collectively provide raw data, processed results, and interpretive summaries to support comprehensive rAAV quality control and production assessment. Below is an overview of each file, its contents, and recommended uses.
  1. prefix_combined_reference.fasta
    Multi-FASTA file that includes all reference sequences associated with the production of the AAV vector:
    • transgene plasmid
    • Rep-Cap plasmid
    • helper plasmid
    • host genome

    Suggested Use:
    Load into genome viewers like IGV alongside the tagged BAM files to visualize how reads align across all reference components
  2. prefix_output_cat.fastq
    Raw, untrimmed FASTQ reads after demultiplexing
    No trimming or filtering has been applied
  3. prefix_tagged_bams
    BAM files containing reads mapped to the combined reference and classified by AAV subgenome type
    • Only primary alignments are shown (produced by minimap2).
    • Supplementary alignments indicate split reads (single reads mapping to multiple, non-contiguous regions of the genome).

    Suggested Use:
    Open in IGV along with the combined reference to:
    • Visualize alignment by subgenome class
    • Inspect coverage, variants and structural features
  4. prefix_trimmed_aav_per_read_info.tsv
    TSV file listing each read and its classified AAV subgenome type
    • Subgenome types include:
      • Full ssAAV
      • Partial ssAAV, including:
        • Genome Duplication Mutants (GDM)
        • 3' Incomplete Genome types (3' ICG)
        • 5' Incomplete Genome types (5' ICG)
        • Partial ICG (lacking ITRs)
      • Full scAAV
      • Partial scAAV, including:
        • 3' Snapback Genomes (3' SBG)
        • 5' Snapback Genomes (5' SBG)
        • SBG (unresolved orientation)
      • Backbone Contamination
      • Complex
      • Unknown

      Reference diagrams are available in the : EPI2ME AAV Workflow GitHub repository

    Suggested Use:
    • Assess rAAV quality and subgenomic structures
    • Filter or summarize with spreadsheet tools or programmatically (e.g. Python/R)
  5. prefix_trimmed_bam_info.tsv
    Per-read alignment summary produced by seqkit bam. Includes:
    • Read mapping reference, position, orientation
    • Mapping quality, read length, alignment accuracy
    • Clipping information and alignment flags

    Suggested Use:
    • Detect low-quality or split reads
    • Analyze structural features or mapping artifacts
    • Group by reference to compare across subgenomes
  6. prefix_trimmed_nanostat_output.txt
    Read quality summary after trimming, including:
    • Total reads and bases
    • Mean read length and quality
    • N50 statistics
  7. prefix_trimmed.transgene_plasmid_consensus.fasta
    Consensus sequence of the transgene plasmid, polished using medaka
  8. prefix_trimmed.transgene_plasmid_sorted.vcf
    Variant calls showing differences between the input transgene plasmid and the transgene consensus sequence

    Suggested Use:
    • Validate plasmid integrity of the transgene
    • Identify mutations, insertions, or deletions
    • Visualize alongside tagged BAM file(s) and combined reference in IGV
  9. prefix_wf-aav-qc-report.html
    Comprehensive interactive HTML report summarizing the AAv analysis workflow.

    Includes:
    • Read quality: yield, length, and quality scores
    • Contamination assessment: mapped vs. unmapped reads; breakdown by reference (host, helper, Rep-Cap, transgene)
    • Truncation analysis: start/end mapping positions within the ITR-to-ITR region
    • AAV subgenome summary: frequency of each subgenome class (from the per-read classification)