Ph.D. Seminar: Yudi Zhang, "Unsupervised learning with high-throughput sequencing data"
Title: Unsupervised learning with high-throughput sequencing data
Development in high-throughput next-generation sequencing (NGS) technologies has produced massive amounts of nucleotide sequences in a short time and at affordable costs. Such massive and high-dimensional data trigger great computational and statistical challenges. This presentation takes on two data analysis challenges related to high-throughput sequencing data, unsupervised clustering of amplicon sequences, and genotyping and phasing in allopolyploids, using a spectrum of statistical techniques, a classic, "model-free'' algorithmic approach, and a traditional Hidden Markov Model (HMM).
Abstract: Amplicon sequencing, where one or a few small fragments of genomes are amplified and sequenced from complex populations, is used in many applications. A common goal is to cluster and extract a representative sequence for each cluster, at a minimum to eliminate ubiquitous sequencing errors, but sometimes to identify biologically meaningful groups. We adapt the k-modes method to optimize an uncertainty-aware objective function. The new method can use quality scores provided by NGS data or other measures of observation uncertainty and includes the regular k-modes method for categorical data without uncertainty measures as a special case.
High-throughput sequencing technology also allows genome-wide analysis to be performed at a much finer resolution. Genotyping and phasing are important for investigating fine-scale genetic variation in diploids and polyploids. While allopolyploid plants, such as peanut and cotton, are common and often economically important, available genotyping tools perform poorly since it is challenging to distinguish allelic single-nucleotide polymorphisms (SNP) from homoeologous SNPs. We propose an inhomogeneous Hidden Markov Model, where the hidden variables are the four underlying haplotypes, which emit entire reads, modeled as conditionally independent bases regressed on covariates via a multinomial logistic regression. This joint genotyping and phasing model considers linkage structure and recovers complicated allopolyploid structures successfully.