Wen Zhou, Colorado State: Integrative Group Factor Model for Variable Clustering on Temporally Dependent Date: Optimality and Algorithm

Apr 18, 2022 - 11:00 AM

to , -

Presenter: Dr. Wen Zhou, Colorado State University

Time: 11:00 AM Central Time, Monday, April 18, 2022

Title: Integrative Group Factor Model for Variable Clustering on Temporally Dependent Date: Optimality and Algorithm

Abstract: Clustering a large number of variables is fast emerging in a va- riety of areas, and has become a fundamental problem in statistics and machine learning. Though many algorithmic approaches scatter across the literature, their interpretation is limited and the outputs usually lack guarantees. Furthermore, their explicit and implicit as- sumptions such as the independence of data and the well-separation between clusters are rather restricted if not unrealistic. In this work, we take the view of model-based clustering, in which the population level clusters are clearly interpreted statistically, to cluster a larger number of variables. The proposed integrative group factor model (iGFM) is compatible with temporally dependent data and allows connections across the variable clusters. In this model, two types of latent factors, the common and unique factors are introduced to model the cross-cluster connection and the within-cluster similarity among variables. We quantify the difficulty of clustering variables based on the iGFM in terms of a permutation-invariant clustering risk and derive the minimax signal threshold, below which no algo- rithms can cluster variables successfully. Such a threshold is driven by the competition between common and unique factors in the model and does nor request the well-separation of clusters to guarantee a perfect recovery. Based on the spectral decomposition and the idea of linear search, we develop a fast and minimax-optimal algorithm to cluster variables. An interesting phase transition of the clustering per- formance has been discovered, for which the model parameter space is partitioned into three regions corresponding to cases of impossible to cluster perfectly, possible with guarantees on the optimality, and possible with no provable guarantees, respectively. In addition, we compare our method with another popular model-based method, the G-block model and associated COD algorithm. Extensive simulation studies, as well as careful data analyses on the macroeconomics in- dex data, confirm the advantage of our approach. Finally, we also discuss how to characterize the unknown number of clusters and the extension of our method with a divergent number of clusters.

Zoom link:

Please click this URL to start or join. https://iastate.zoom.us/j/98782130243?pwd=ZlppMEpHZGZ1VEErZTdvL3F5OXVaZz09

Or, go to https://iastate.zoom.us/join and enter meeting ID: 987 8213 0243 and password: 2022

Join from dial-in phone line:

Dial: +1 312 626 6799 or +1 646 876 9923

Meeting ID: 987 8213 0243