Ph.D. Seminar: Yichuan Bai, A Graph-based Approach to Estimating the Number of Clusters
Speaker: Yichuan Bai, PhD Candidate, Department of Statistics, Iowa State University
Title: A Graph-based Approach to Estimating the Number of Clusters
Abstract: Clustering is a fundamental unsupervised learning technique and a critical component of many statistics and machine learning pipelines. Many clustering approaches require the number of groups k to be pre-specified, which can be challenging in the absence of knowledge about the true number of groups. We consider the problem of estimating the number of clusters in a dataset, and propose a non-parametric approach to the problem that utilizes similarity graphs to construct a robust statistic that effectively captures similarity information among observations. This graph-based statistic is applicable to datasets of any dimension, is computationally efficient to obtain, and can be paired with any kind of clustering technique. Asymptotic theory is developed to establish the selection consistency of the proposed approach. Simulation studies demonstrate that the graph-based statistic outperforms existing methods for estimating the number of clusters, especially in the high-dimensional setting. We illustrate its utility on an imaging dataset and an RNA-seq dataset.