Design-unbiased statistical learning in survey sampling - Li-Chun Zhang
Dr. Li-Chun Zhang
University of Southampton
Design-unbiased statistical learning in survey sampling
A basic problem with supervised machine learning (ML) is that one needs to be able to ‘extrapolate’ the model learned from the available sample to the out-of-sample units, in order for supervised learning to have any value at all. No matter how it is organized within the sample, one cannot ensure valid learning for out-of-sample units, unless the sample is selected from the entire reference set of units (i.e. the population) in some controlled manner. This well-known problem in statistical inference is sometimes recast as the problem of concept drift in the ML literature.
We develop a subsampling Rao-Blackwell method. Under the combined probability sampling-subsampling (pq-design), exactly pq-unbiased estimation can be achieved at the population level using any chosen ML technique. Our approach makes use of three classic ideas from ML and Statistical Science: the training-test split of the sample, Rao-Blackwellisation and model-assisted sampling estimation.