Daniela Witten, University of Washington: Selective inference for trees
Abstract: As datasets grow in size, the focus of data collection has increasingly shifted away from testing pre-specified hypotheses, and towards hypothesis generation. Researchers are often interested in performing an exploratory data analysis to generate hypotheses, and then testing those hypotheses on the same data. Unfortunately, this type of 'double dipping' can lead to highly-inflated Type 1 errors. In this talk, I will consider double-dipping on trees. First, I will focus on trees generated via hierarchical clustering, and will consider testing the null hypothesis of equality of cluster means. I will propose a test for a difference in means between estimated clusters that accounts for the cluster estimation process, using a selective inference framework. Second, I'll consider trees generated using the CART procedure, and will again use selective inference to conduct inference on the means of the terminal nodes. Applications include single-cell RNA-sequencing data and the Box Lunch Study. This is collaborative work with Lucy Gao (U. Waterloo), Anna Neufeld (U. Washington), and Jacob Bien (USC).