STATS 357 / MS&E 330: Reliability and Validity in Artificial Intelligence

Tijana Zrnic, Stanford University, Spring 2026


Announcements


Lectures

Mon/Wed 9:30am-10:50am, McCullough 122


Staff

Instructor: Tijana Zrnic

Teaching assistant: Reese Feldmeier


Course description

This course examines the principles and methods required to make artificial intelligence (AI) systems reliable and scientifically sound. Topics include evaluation and benchmarking, notions of validity, distribution shift, predictive inference, AI-assisted statistical inference, data attribution, and beyond. Problem sets will involve both mathematical components and coding projects.

Prerequisites include mathematical maturity in probability, statistics, and optimization, and proficiency in Python.


Syllabus

Lecture Date Topics Reading
1 Mar 30 Benchmarks; Holdout method [1, Ch. 3 & 4]
2 Apr 1 Cross-validation; Bootstrap [1, Ch. 4; 2; 3]
3 Apr 6 Model selection & selection bias; Overfitting pt. 1: reward hacking [1, Ch. 4 & 5; 4]
4 Apr 8 Overfitting pt. 2: benchmark contamination [5]
5 Apr 13 Internal, external, & construct validity [6; 7; 8]
6 Apr 15 Frontier lecture see signup sheet
7 Apr 20 Distribution shift [9, Ch. 1 & 6; 10]
8 Apr 22 Predictive inference; Conformal prediction [11]
9 Apr 27 Predictive inference under distribution shift TBD
10 Apr 29 Calibration TBD
11 May 4 Multicalibration TBD
12 May 6 Frontier lecture TBD
  May 11 No class; Extended office hours instead  
13 May 13 AI for science; Prediction-powered inference (PPI) TBD
14 May 28 AI-assisted annotation TBD
15 May 20 Data attribution TBD
  May 25 Memorial Day (no class)  
16 May 27 Frontier lecture TBD

“Frontier lectures” will consist of student presentations of frontier papers related to the class topics.

[1] M. Hardt. The Emerging Science of Machine Learning Benchmarks. https://mlbenchmarks.org, 2025.
[2] S. Bates, T. Hasie, R. Tibshirani. Cross-Validation: What Does It Estimate and How Well Does It Do It? Journal of the American Statistical Association, 2024.
[3] S. Wager. Cross-Validation, Risk Estimation, and Model Selection: Comment on a Paper by Rosset and Tibshirani. Journal of the American Statistical Association, 2020.
[4] L. Gao, J. Schulman, J. Hilton. Scaling Laws for Reward Model Overoptimization. International Conference on Machine Learning (ICML), 2023.
[5] Y. Oren, N. Meister, N. Chatterji, F. Ladhak, T. B. Hashimoto. Proving Test Set Contamination in Black Box Language Models. International Conference on Learning Representations (ICLR), 2024.
[6] O. Salaudeen, A. Reuel, A. Ahmed, S. Bedi, Z. Robertson, S. Sundar, B. Domingue, A. Wang, S. Koyejo. Measurement to Meaning: A Validity-Centered Framework for AI Evaluation. 2025.
[7] M. Mancoridis, K. Vafa, B. Weeks, S. Mullainathan. Potemkin Understanding in Large Language Models. International Conference on Machine Learning (ICML), 2025.
[8] J. D. Gaebler, C. Isley, C. Avery, S. Goel. Reassessing the Role of Standardized Tests in University Admissions. 2026.
[9] J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, N. D. Lawrence. Dataset Shift in Machine Learning. MIT Press, 2008.
[10] Z. Lipton, Y.-X. Wang, A. Smola. Detecting and Correcting for Label Shift with Black Box Predictors. International Conference on Machine Learning (ICML), 2018.
[11] A. N. Angelopoulos, S. Bates. A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. Foundations and Trends in Machine Learning, 2023.


Grading

If you are taking the class for a letter grade:
Homework (40%): four biweekly problem sets with math and coding problems
Frontier lecture presentation (10%): one ~10 minute group presentation of a paper you will choose
Final project (50%): group final project on a topic broadly related to the class

If you are taking the class for CR/NC, you don’t have to do a frontier lecture presentation. Only homework and final project.


Logistics

Homework and lecture slides will be distributed on Canvas. For homework submissions, we will use Gradescope. For Q&A and class-related discussions, we will use Ed.