Lecture or Panel

Biostatistics Dept. Seminar: Efficient Evaluation and Monitoring of Machine Learning Systems

Biostatistics Department Seminar

Title: Efficient Evaluation and Monitoring of Machine Learning Systems

Abstract: Rigorous evaluation is central to the safe and effective deployment of machine-learning (ML) systems in healthcare, but these evaluations are often time-intensive and expensive to conduct. In this talk, I will discuss recent methodological research aimed at making rigorous evaluations more efficient, without sacrificing statistical rigor.

First, I will discuss the challenge of scaling expert evaluation: "gold standard" labels are often required to monitor ML performance, but are expensive to obtain. A popular recent approach, "Power-Tuned Prediction-Powered Inference" (PPI++), tries to improve efficiency by combining gold-standard labels with "pseudo-labels" obtained from an automated system (e.g., an LLM). Prior research claims a kind of "free lunch", showing that PPI++ is asymptotically at least as efficient as relying solely on expert-provided labels. In recent work, we demystify this claim, showing that no such "free lunch" exists in finite samples. Rather, we derive a threshold of psuedo-label accuracy that is required for PPI++ to improve efficiency of statistical estimation, providing clear guidance to practioners on the assumptions required for utilizing this approach.

Second, I will focus on the efficient re-evaluation of the causal impact of deploying ML systems. Randomized clinical trials (RCTs) remain the gold standard for assessing how ML systems affect clinical outcomes, but often take substantial time and resources to conduct. Since ML models often need to be updated over time, iteratively validating updates via RCTs is impractical. In recent work, we propose a framework that, under certain assumptions, allows for re-using data from a prior RCT to estimate or bound the causal impact of updated models, even without their inclusion in the original RCT. Our approach accounts for deterministic ML predictions and the role of user trust in determining impact.

This talk will be based on the following papers:

Speaker

Michael Oberst is an Assistant Professor of Computer Science at Johns Hopkins University. His research focuses on safe and effective machine learning in healthcare, using tools from causal inference and statistics. His work has been published at machine learning venues such as NeurIPS, ICML, EMNLP, AISTATS, and UAI. Prior to joining Johns Hopkins, he was a postdoctoral researcher at Carnegie Mellon University in the Machine Learning Department, and obtained his PhD in Computer Science at the Massachusetts Institute of Technology.

Zoom Registration

If you would like to join via Zoom, please register here.

2025-2026 Monday Seminar Series

All seminars are held at 12:05 PM via Zoom and onsite. View all seminar information here.

Hosted By

Department of Biostatistics

Back to Calendar