Statistical analysis using finite mixtures of normal linear models

                         Jianlin Cheng 
                      Iowa State University 


Finite mixture models are often used in statistical applications when the 
population under study is believed to consist of a number of 
heterogeneous subpopulations, but it is not possible to identify the 
subpopulation to which an individual belongs. Finite mixtures of normal 
linear regression models are explored as a class of models for relating a 
response variable to a set of predictor variables. We consider two 
classes of mixture models: those in which the proportion of the 
population in each subpopulation is independent of the measured predictor 
variables, and a second in which the mixture proportions are allowed to 
depend on the predictor variables.

Conditions are determined under which the parameters of the finite
mixture model are identifiable.  Two approaches to statistical inference 
for the model parameters are reviewed: maximum likelihood estimation and 
the associated large sample theory, and Bayesian inference. There are 
several complications that arise in practice when analyzing data with 
finite mixture models including multiple modes of the likelihood 
function, degenerate modes corresponding to small subpopulations with
apparently zero variance, and the failure of traditional large sample   
results. Simulations are used to investigate the performance of the two 
approaches to inference. It is important that a statistical analysis go 
beyond just fitting a model to data and include some model assessment.
We explore the use of posterior predictive model checks for this purpose. 
In particular a posterior predictive method is proposed for comparing the 
mixture of regressions with constant proportions to the mixture of 
regressions with nonconstant proportions.

The various approaches to inference and model assessment are applied
to an example concerning household expenditures in Bangladesh. An 
economic hypothesis there suggests that more resources are spent     
ensuring the health of male rather than female children. A simple linear 
regression explaining the difference between male and female child health 
finds no significant predictors. One plausible explanation is that the 
population consists of two types of households, those that do not 
discriminate based on gender and those that do.  The finite mixture of 
regressions allows us to address this hypothesis.