Skip to main content

Dept. Seminar - Giles Hooker

Oct 22, 2018 - 4:10 PM
to Oct 22, 2018 - 5:00 PM

Giles Hooker
Cornell University

 

Decision Trees and CLT's: Inference and Machine Learning

This talk develops methods of statistical inference based around the popular machine learning methods of bagging and Random Forests. Our goal is to provide a limiting normal distribution for the prediction made by these methods. This result can then be used to provide a formalized statistical test of the relevance of particular input features, or for the structure of the underlying relationship more generally. We show that when the bootstrap procedure in ensemble methods is replaced by sub-sampling, predictions from these methods can be analyzed using the theory of U-statistics. Moreover, the limiting normal distribution has a variance that can be estimated within the sub-sampling structure. Using this result, we can compare the predictions made by a model learned with a feature of interest, to those made by a model learned without it and ask whether the differences between these could have arisen by chance.

By evaluating the model at a structured set of points we can also ask whether it differs significantly from an additive model. We demonstrate these results in an application to citizen-science data collected by Cornell's Laboratory of Ornithology.  Given time, extensions to gradient boosting will be discussed.

 


Refreshments at 3:45pm in Snedecor 2101.