Seminar: Resolving Real Biological Sequences with Accurate Abundance Estimation from Noisy Illumina Amplicon Data

Seminar: Resolving Real Biological Sequences with Accurate Abundance Estimation from Noisy Illumina Amplicon Data

Aug 19, 2021 - 1:00 PM
to , -

Abstract: Amplicon sequencing has been widely applied to explore heterogeneity and rare variants in genetic populations. Resolving true biological variants and accurately quantifying their abundance from noisy amplicon sequence data is crucial for downstream analyses, but measured abundances are distorted by stochasticity and bias in amplification, along with errors generated during Polymerase Chain Reaction (PCR) and sequencing.

In the first half of the presentation, we aim to correct errors. We introduce a reference-free, model-based clustering method to rapidly resolve the number, abundance, and identity of real biological sequences in massive Illumina amplicon datasets. It estimates a mixture model, using a greedy strategy to gradually select error-free sequences while approximately maximizing the likelihood. In the second half of the presentation, we further address the amplification bias. We propose a deduplication method to estimate absolute molecular counts from amplicon sequence data with Unique Molecular Identifiers (UMIs). Both errors in the UMIs and sampled sequences can be detected and corrected, and our method can recognize UMI collisions. We benchmark our approaches and demonstrate our approaches have better performance than other competing methods.