Statistical analysis using finite mixtures of normal linear models
Jianlin Cheng
Iowa State University
Finite mixture models are often used in statistical applications when the
population under study is believed to consist of a number of
heterogeneous subpopulations, but it is not possible to identify the
subpopulation to which an individual belongs. Finite mixtures of normal
linear regression models are explored as a class of models for relating a
response variable to a set of predictor variables. We consider two
classes of mixture models: those in which the proportion of the
population in each subpopulation is independent of the measured predictor
variables, and a second in which the mixture proportions are allowed to
depend on the predictor variables.
Conditions are determined under which the parameters of the finite
mixture model are identifiable. Two approaches to statistical inference
for the model parameters are reviewed: maximum likelihood estimation and
the associated large sample theory, and Bayesian inference. There are
several complications that arise in practice when analyzing data with
finite mixture models including multiple modes of the likelihood
function, degenerate modes corresponding to small subpopulations with
apparently zero variance, and the failure of traditional large sample
results. Simulations are used to investigate the performance of the two
approaches to inference. It is important that a statistical analysis go
beyond just fitting a model to data and include some model assessment.
We explore the use of posterior predictive model checks for this purpose.
In particular a posterior predictive method is proposed for comparing the
mixture of regressions with constant proportions to the mixture of
regressions with nonconstant proportions.
The various approaches to inference and model assessment are applied
to an example concerning household expenditures in Bangladesh. An
economic hypothesis there suggests that more resources are spent
ensuring the health of male rather than female children. A simple linear
regression explaining the difference between male and female child health
finds no significant predictors. One plausible explanation is that the
population consists of two types of households, those that do not
discriminate based on gender and those that do. The finite mixture of
regressions allows us to address this hypothesis.