Seminar, Edgar Dobriban, Leveraging synthetic data in statistical inference

Speaker: Edgar Dobriban, Associate Professor of Statistics and Data Science, University of Pennsylvania
Title: Leveraging synthetic data in statistical inference
Abstract: Synthetic data, for instance generated by foundation models, may offer great opportunities to boost sample sizes in statistical analysis. However, the distribution of synthetic data may not be exactly the same as that of the real data, thus incurring the risk of faulty inferences. Motivated by these observations, we study how to use synthetic or auxiliary data in statistical inference problems ranging from predictive inference (conformal prediction) to hypothesis testing. We develop methods that are able to leverage synthetic or auxiliary data in addition to real data. If the synthetic data distribution is similar to that of the real data, our methods improve precision. At the same time, our methods maintain a guardrail level of coverage even if the synthetic data distribution is arbitrarily bad. We illustrate our methods with a variety of examples ranging from AI to the medical domain.