Random forests for spatially or serially correlated data - Abhirup Datta
Speaker: Dr. Abhirup Datta
Johns Hopkins University, Department of Biostatistics
Random forests for spatially or serially correlated data
Random forests (RF) are widely popular for estimating regression functions but little attention has been paid to impact of spatial/serial data correlation on RF. RF uses intra-node means and variances to create decision trees ignoring dependence of data across nodes. Also, under correlation, resampling used in RF violates the principles of bootstrap. These shortcomings affect the performance of RF for dependent data.
We propose RF-GLS, a novel and well-principled extension of RF for dependent data in the same way GLS fundamentally extends OLS for linear models. Exploiting the representation of regression trees as recursive OLS optimization, we propose using GLS loss that explicitly accounts for spatial/serial autocorrelation in estimation. GLS loss also ensures resampling of uncorrelated contrasts. RF becomes a special case of RF-GLS with an identity working covariance matrix.
For spatial data, RF-GLS combines with Gaussian Processes (GP) for spatial prediction (kriging) and avoids big GP computations by using sparse GP to ensure linear time-complexity. We demonstrate, using extensive numerical experiments, the benefits of RF-GLS over RF in both estimation and prediction under dependence. We also establish consistency of RF-GLS under beta-mixing dependence that subsumes spatial Matern GP and autoregressive time-series.