Reading List: Instructional Safety & LLM Evaluation
This list is curated to help you understand why evaluation matters, how safety is operationalized, and how rubrics fit into real LLM evaluation work.
Core Concepts
1. Anthropic — Helpful, Honest, and Harmless (HHH)
Why read this: Foundational framework for thinking about AI safety. Explains the tradeoffs between being helpful vs. being safe, and why human judgment matters.
Read before: Human scoring (Step 3)
2. OpenAI — Evals (Concepts & Examples)
Why read this: See how a major AI lab structures evaluations. Focus on the pattern: tasks + criteria + judgment.
Read before: Reviewing tasks (Step 1)
3. Microsoft — LLM-Rubric
Why read this: Closest precedent to this project. Shows how to define and apply qualitative scoring dimensions.
Read before: Human scoring (Step 3)
4. LLM-as-a-Judge
Why read this: Understand why automated scoring has limitations. When do LLM judges work? Where do they fail? Why does disagreement matter?
Read before: Running the LLM judge (Step 4)
5. Stanford HAI — HELM Evaluation Framework
Why read this: See how professionals present multi-dimensional evaluation results. Good model for your analysis write-up.
Read before: Analysis and write-up (Step 5)
API Documentation
If you need help with the test harness code:
OpenAI Python SDK
Anthropic Python SDK
Google Generative AI (Gemini)
Additional Reading (Optional)
For deeper background if you're interested:
-
Khan Academy's 7-Step Approach to Prompt Engineering for Khanmigo — How Khan Academy designed their AI tutor's system prompt (this influenced our educational system prompt)
-
Paper: An Evaluation of Khanmigo as a Computer-Assisted Language Learning App — Real-world evaluation of an instructional AI
-
Video: Stanford CME295 - LLM Evaluation (Lecture 8) — Academic lecture on evaluation methods
-
UK AI Safety Institute - Inspect Framework — How government safety researchers do evaluations (more advanced)
-
DeepEval Documentation — Popular Python evaluation framework (if you want to explore tooling later)
Glossary
Quick definitions for terms you'll encounter:
| Term | Definition |
|---|---|
| Epistemic calibration | How well a model's confidence matches its actual accuracy. A well-calibrated model says "I'm not sure" when it genuinely doesn't know. |
| Hallucination | When a model generates plausible-sounding but false information, often presented confidently. |
| Pedagogical harm | Teaching in ways that mislead, confuse, or endanger the learner—even if the model is trying to be helpful. |
| LLM-as-Judge | Using one language model to evaluate another's outputs. Useful for scale, but has known biases and limitations. |
| Rubric | A scoring guide that defines what "good" looks like on each dimension being evaluated. |
| Failure mode | A recurring pattern of mistakes. More useful than one-off errors because it reveals systematic issues. |
| HHH (Helpful, Honest, Harmless) | Anthropic's framework for AI assistant behavior. The three goals often create tradeoffs. |