Principal Component Analysis of Discrete Datasets

Principal Component Analysis of Discrete Datasets

Aug 12, 2021 - 2:00 PM
to , -

Abstract:  We propose a Gaussian copula based method to perform principal component analysis for discrete data. By assuming the data are from a discrete distributions in the Gaussian copula family, we can consider the discrete random vectors are generated from a latent multivariate normal random vector. So we first obtain an estimate of the correlation matrix of latent multivariate normal distribution, then we use the estimated latent correlation matrix to get the estimates of principal components. In estimating the correlation matrix, our method used generalized distributional transform to transform the discrete data to continuous ones, and with Kendall's tau of the transformed data, we obtain the estimation of the correlation matrix by solving an equation involving the Kendall's tau. This method can be easily parallelized and convergence rate of the estimate is discussed along with a simulation study. Although this method is developed for numeric discrete data, we showed it also works for ordered categorical data by mapping every category to an integer. We also focus on the case when we have categorical sequence data where each observation is a random sequence with categorical marginal distributions in each term of the sequence. In this case the marginal distribution is in fact not univariate and thus the usual Gaussian copula does not fit here. The optimal mapping method is proposed to convert such data to the mapped data with univariate marginals by mapping each category to an integer. Then the usual Gaussian copula can be used to model the mapped data and thereby we can apply the established discrete principal component analysis to the mapped data. The senators' voting data was used in the experiment as an example.