Tijana Zrnic, Stanford University, Spring 2026
Mon/Wed 9:30am-10:50am, McCullough 122
Instructor: Tijana Zrnic
Teaching assistant: Reese Feldmeier
This course examines the principles and methods required to make artificial intelligence (AI) systems reliable and scientifically sound. Topics include evaluation and benchmarking, notions of validity, distribution shift, predictive inference, AI-assisted statistical inference, data attribution, and beyond. Problem sets will involve both mathematical components and coding projects.
Prerequisites include mathematical maturity in probability, statistics, and optimization, and proficiency in Python.
| Lecture | Date | Topics | Reading |
|---|---|---|---|
| 1 | Mar 30 | Benchmarks; Holdout method | [1, Ch. 3 & 4] |
| 2 | Apr 1 | Cross-validation; Bootstrap | [1, Ch. 4; 2; 3] |
| 3 | Apr 6 | Model selection & selection bias; Overfitting pt. 1: reward hacking | [1, Ch. 4 & 5; 4] |
| 4 | Apr 8 | Overfitting pt. 2: benchmark contamination | [5] |
| 5 | Apr 13 | Internal, external, & construct validity | [6; 7; 8] |
| 6 | Apr 15 | Frontier lecture | see signup sheet |
| 7 | Apr 20 | Distribution shift | [9, Ch. 1 & 6; 10] |
| 8 | Apr 22 | Predictive inference; Conformal prediction | [11] |
| 9 | Apr 27 | Predictive inference under distribution shift | TBD |
| 10 | Apr 29 | Calibration | TBD |
| 11 | May 4 | Multicalibration | TBD |
| 12 | May 6 | Frontier lecture | TBD |
| May 11 | No class; Extended office hours instead | ||
| 13 | May 13 | AI for science; Prediction-powered inference (PPI) | TBD |
| 14 | May 28 | AI-assisted annotation | TBD |
| 15 | May 20 | Data attribution | TBD |
| May 25 | Memorial Day (no class) | ||
| 16 | May 27 | Frontier lecture | TBD |
“Frontier lectures” will consist of student presentations of frontier papers related to the class topics.
[1] M. Hardt. The Emerging Science of Machine Learning Benchmarks. https://mlbenchmarks.org, 2025.
[2] S. Bates, T. Hasie, R. Tibshirani. Cross-Validation: What Does It Estimate and How Well Does It Do It? Journal of the American Statistical Association, 2024.
[3] S. Wager. Cross-Validation, Risk Estimation, and Model Selection: Comment on a Paper by Rosset and Tibshirani. Journal of the American Statistical Association, 2020.
[4] L. Gao, J. Schulman, J. Hilton. Scaling Laws for Reward Model Overoptimization. International Conference on Machine Learning (ICML), 2023.
[5] Y. Oren, N. Meister, N. Chatterji, F. Ladhak, T. B. Hashimoto. Proving Test Set Contamination in Black Box Language Models. International Conference on Learning Representations (ICLR), 2024.
[6] O. Salaudeen, A. Reuel, A. Ahmed, S. Bedi, Z. Robertson, S. Sundar, B. Domingue, A. Wang, S. Koyejo. Measurement to Meaning: A Validity-Centered Framework for AI Evaluation. 2025.
[7] M. Mancoridis, K. Vafa, B. Weeks, S. Mullainathan. Potemkin Understanding in Large Language Models. International Conference on Machine Learning (ICML), 2025.
[8] J. D. Gaebler, C. Isley, C. Avery, S. Goel. Reassessing the Role of Standardized Tests in University Admissions. 2026.
[9] J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, N. D. Lawrence. Dataset Shift in Machine Learning. MIT Press, 2008.
[10] Z. Lipton, Y.-X. Wang, A. Smola. Detecting and Correcting for Label Shift with Black Box Predictors. International Conference on Machine Learning (ICML), 2018.
[11] A. N. Angelopoulos, S. Bates. A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. Foundations and Trends in Machine Learning, 2023.
If you are taking the class for a letter grade:
Homework (40%): four biweekly problem sets with math and coding problems
Frontier lecture presentation (10%): one ~10 minute group presentation of a paper you will choose
Final project (50%): group final project on a topic broadly related to the class
If you are taking the class for CR/NC, you don’t have to do a frontier lecture presentation. Only homework and final project.
Homework and lecture slides will be distributed on Canvas. For homework submissions, we will use Gradescope. For Q&A and class-related discussions, we will use Ed.