Here I outline a simple algorithm that will detect possible CRISPR variants from NGS reads. The algorithm is developed for paired-end NGS sequencing of a CRISPR sample that might contain several variants. The amplicons are between 200 bp and 300 bp, and we typically generate between 70,000 and 100,000 NGS reads for each sample (assuming the sample is of high quality and has the recommended DNA concentration). The following figure shows the results for a typical sample.
(The number of NGS reads matched to the reference sequences. Black dots represent reads that match in the forward direction while red dots represent those that match in the reverse direction).
The key point to notice is that, for such a short amplicon, the data is highly redundant. Although we have several hundred thousand reads, the number of unique sequences is actually much smaller. As the above figure demonstrates, most of the NGS reads have hundreds, if not thousands of exact copies. To make the following description simple, we have to make a clear distinction between "sequences" and "reads". Sequences are some abstract mathematical entities. "ATCGATCG"" is a sequence. "ATCGATCC" is another, a different sequence. There is only one sequence that is "ATCGATCG". Reads are what we get from the NGS process. For each read, there is exactly one corresponding sequence. Many different reads, however, can correspond to the same sequence.
The algorithm is as follows:
ATCGATCG TCGATCGC CGATCGCCOr two sequences that are different at only one position, while the differences of their frequencies suggest that the low frequency one is the result of a sequencing error. We can cluster these sequences into an even smaller number of clusters, the HFSC (High Frequency Sequence Clusters). Each of these clusters contains one or more HFS. All the sequences are very similar to each other within the cluster.
Questions to hwang12@mgh.harvard.edu