Random forests for spatially or serially correlated data - Abhirup Datta

Random forests for spatially or serially correlated data - Abhirup Datta

Oct 5, 2020 - 11:00 AM
to , -

Speaker:  Dr. Abhirup Datta

Johns Hopkins University, Department of Biostatistics 

Random forests for spatially or serially correlated data

Random forests (RF) are widely popular for estimating regression functions but little attention has been paid to impact of spatial/serial data correlation on RF. RF uses intra-node means and variances to create decision trees ignoring dependence of data across nodes. Also, under correlation, resampling used in RF violates the principles of bootstrap. These shortcomings affect the performance of RF for dependent data.

We propose RF-GLS, a novel and well-principled extension of RF for dependent data in the same way GLS fundamentally extends OLS for linear models. Exploiting the representation of regression trees as recursive OLS optimization, we propose using GLS loss that explicitly accounts for spatial/serial autocorrelation in estimation. GLS loss also ensures resampling of uncorrelated contrasts. RF becomes a special case of RF-GLS with an identity working covariance matrix.

For spatial data, RF-GLS combines with Gaussian Processes (GP) for spatial prediction (kriging) and avoids big GP computations by using sparse GP to ensure linear time-complexity. We demonstrate, using extensive numerical experiments, the benefits of RF-GLS over RF in both estimation and prediction under dependence. We also establish consistency of RF-GLS under beta-mixing dependence that subsumes spatial Matern GP and autoregressive time-series.