STATS 357 / MS&E 330: Reliability and Validity in Artificial Intelligence

Tijana Zrnic, Stanford University, Spring 2026

Announcements

[5/28] Please sign up for a final project presentation slot on 6/5 or 6/9.
[5/21] No class on Monday, 5/25, because of Memorial Day!
[5/18] Reese’s office hours until the end of the quarter will be on Wed, 5pm-6pm, in Sequoia 235.
[5/13] Homework 4 is up! It’s due 5/27 EOD.
[5/7] Please send me your brief project descriptions by Fri, 5/8 (see Ed for details).
[5/7] No lecture on Monday, 5/11. I’ll have extra office hours (see Ed for details).
[4/29] Homework 3 is up! It’s due on 5/13 EOD.
[4/26] I posted annotated versions of all lecture slides on Canvas.
[4/15] Homework 2 is up! It’s due on 4/29 EOD.
[4/8] I adjusted the class schedule. On 5/11 we won’t have a lecture, but we’ll have extended office hours that week.
[4/3] We expanded enrollment. There are a handful of people still on the waitlist, but I expect everyone will be able to enroll.
[4/1] Homework 1 is up! It’s due on 4/15 EOD.
[3/31] Slides for lectures 1 & 2 have been posted on Canvas.
Welcome to STATS 357 / MS&E 330!

Lectures

Mon/Wed 9:30am-10:50am, McCullough 122

Staff

Instructor: Tijana Zrnic

Office hours: Mon 11am-12pm, CoDa E248

Teaching assistant: Reese Feldmeier

Office hours: Thu 3pm-4:30pm, CoDa B06

Course description

This course examines the principles and methods required to make artificial intelligence (AI) systems reliable and scientifically sound. Topics include evaluation and benchmarking, notions of validity, distribution shift, predictive inference, AI-assisted statistical inference, data attribution, and beyond. Problem sets will involve both mathematical components and coding projects.

Prerequisites include mathematical maturity in probability, statistics, and optimization, and proficiency in Python.

Syllabus

Lecture	Date	Topics	Reading
1	Mar 30	Benchmarks; Holdout method	[1, Ch. 3 & 4]
2	Apr 1	Cross-validation; Bootstrap	[1, Ch. 4; 2; 3]
3	Apr 6	Model selection & selection bias; Overfitting pt. 1: reward hacking	[1, Ch. 4 & 5; 4]
4	Apr 8	Overfitting pt. 2: benchmark contamination	[5]
5	Apr 13	Internal, external, & construct validity	[6; 7; 8]
6	Apr 15	Frontier lecture	see signup sheet
7	Apr 20	Distribution shift	[9, Ch. 1 & 6; 10]
8	Apr 22	Predictive inference; Conformal prediction	[11]
9	Apr 27	Predictive inference under distribution shift	[12; 13; 14]
10	Apr 29	Calibration	[15; 16; 17]
11	May 4	Multicalibration; Multiaccuracy	[18; 19; 20]
12	May 6	Frontier lecture	see signup sheet
	May 11	No class; Extended office hours instead
13	May 13	AI for science; Prediction-powered inference (PPI)	[21; 22; 23]
14	May 18	AI-assisted annotation	[24; 25]
15	May 20	Data attribution	[26; 27]
	May 25	Memorial Day (no class)
16	May 27	Frontier lecture	see signup sheet

“Frontier lectures” will consist of student presentations of frontier papers related to the class topics.

[1] M. Hardt. The Emerging Science of Machine Learning Benchmarks. https://mlbenchmarks.org, 2025.
[2] S. Bates, T. Hasie, R. Tibshirani. Cross-Validation: What Does It Estimate and How Well Does It Do It? Journal of the American Statistical Association (JASA), 2024.
[3] S. Wager. Cross-Validation, Risk Estimation, and Model Selection: Comment on a Paper by Rosset and Tibshirani. Journal of the American Statistical Association (JASA), 2020.
[4] L. Gao, J. Schulman, J. Hilton. Scaling Laws for Reward Model Overoptimization. International Conference on Machine Learning (ICML), 2023.
[5] Y. Oren, N. Meister, N. Chatterji, F. Ladhak, T. B. Hashimoto. Proving Test Set Contamination in Black Box Language Models. International Conference on Learning Representations (ICLR), 2024.
[6] O. Salaudeen, A. Reuel, A. Ahmed, S. Bedi, Z. Robertson, S. Sundar, B. Domingue, A. Wang, S. Koyejo. Measurement to Meaning: A Validity-Centered Framework for AI Evaluation. 2025.
[7] M. Mancoridis, K. Vafa, B. Weeks, S. Mullainathan. Potemkin Understanding in Large Language Models. International Conference on Machine Learning (ICML), 2025.
[8] J. D. Gaebler, C. Isley, C. Avery, S. Goel. Reassessing the Role of Standardized Tests in University Admissions. 2026.
[9] J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, N. D. Lawrence. Dataset Shift in Machine Learning. MIT Press, 2008.
[10] Z. Lipton, Y.-X. Wang, A. Smola. Detecting and Correcting for Label Shift with Black Box Predictors. International Conference on Machine Learning (ICML), 2018.
[11] A. N. Angelopoulos, S. Bates. A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. Foundations and Trends in Machine Learning, 2023.
[12] I. Gibbs, E. J. Candès. Adaptive Conformal Inference Under Distribution Shift. Advances in Neural Information Processing Systems (NeurIPS), 2021.
[13] A. N. Angelopoulos, E. J. Candès, R. J. Tibshirani. Conformal PID Control for Time Series Prediction. Advances in Neural Information Processing Systems (NeurIPS), 2023.
[14] R. F. Barber, E. J. Candès, A. Ramdas, R. J. Tibshirani. Conformal Prediction Beyond Exchangeability. Annals of Statistics (AoS), 2023.
[15] C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger. On Calibration of Modern Neural Networks. International Conference on Machine Learning (ICML), 2017.
[16] A. Kumar, P. Liang, T. Ma. Verified Uncertainty Calibration. Advances in Neural Information Processing Systems (NeurIPS), 2019.
[17] K. Tian, E. Mitchell, A. Zhou, A. Sharma, R. Rafailov, H. Yao, C. Finn, C. Manning. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
[18] U. Hebert-Johnson, M. Kim, O. Reingold, G. Rothblum. Multicalibration: Calibration for the (Computationally-Identifiable) Masses. International Conference on Machine Learning (ICML), 2018.
[19] M. Kim, A. Ghorbani, J. Zou. Multiaccuracy: Black-Box Post-Processing for Fairness in Classification. AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2019.
[20] I. Gibbs, J. J. Cherian, E. J. Candès. Conformal Prediction with Conditional Guarantees. Journal of the Royal Statistical Society: Series B (JRSSB), 2025.
[21] A. N. Angelopoulos, S. Bates, C. Fannjiang, M. I. Jordan, T. Zrnic. Prediction-Powered Inference. Science, 2023.
[22] A. N. Angelopoulos, J. C. Duchi, T. Zrnic. PPI++: Efficient Prediction-Powered Inference. 2023.
[23] T. Zrnic, E. J. Candès. Cross-Prediction-Powered Inference. Proceedings of the National Academy of Sciences (PNAS), 2024.
[24] A. P. Dawid, A. M. Skene. Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm. Journal of the Royal Statistical Society: Series C (JRSSC), 1979.
[25] E. J. Candès, A. Ilyas, T. Zrnic. Probably Approximately Correct Labels. International Conference on Machine Learning (ICML), 2026.
[26] P. W. Koh, P. Liang. Understanding Black-box Predictions via Influence Functions. International Conference on Machine Learning (ICML), 2017.
[27] A. Ilyas, S. M. Park, L. Engstrom, G. Leclerc, A. Madry. Datamodels: Predicting Predictions from Training Data. International Conference on Machine Learning (ICML), 2022.

Grading

If you are taking the class for a letter grade:
Homework (40%): four biweekly problem sets with math and coding problems
Frontier lecture presentation (10%): one ~10 minute group presentation of a paper you will choose
Final project (50%): group final project on a topic broadly related to the class

If you are taking the class for CR/NC, you don’t have to do a frontier lecture presentation. Only homework and final project.

Logistics

Homework and lecture slides will be distributed on Canvas. For homework submissions, we will use Gradescope. For Q&A and class-related discussions, we will use Ed.