Reading List: Instructional Safety & LLM Evaluation

This list is curated to help you understand why evaluation matters, how safety is operationalized, and how rubrics fit into real LLM evaluation work.

Core Concepts

1. Anthropic — Helpful, Honest, and Harmless (HHH)

Why read this: Foundational framework for thinking about AI safety. Explains the tradeoffs between being helpful vs. being safe, and why human judgment matters.

Read before: Human scoring (Step 3)

Training a Helpful and Harmless Assistant

2. OpenAI — Evals (Concepts & Examples)

Why read this: See how a major AI lab structures evaluations. Focus on the pattern: tasks + criteria + judgment.

Read before: Reviewing tasks (Step 1)

openai/evals (GitHub)

3. Microsoft — LLM-Rubric

Why read this: Closest precedent to this project. Shows how to define and apply qualitative scoring dimensions.

Read before: Human scoring (Step 3)

microsoft/LLM-Rubric (GitHub)

4. LLM-as-a-Judge

Why read this: Understand why automated scoring has limitations. When do LLM judges work? Where do they fail? Why does disagreement matter?

Read before: Running the LLM judge (Step 4)

5. Stanford HAI — HELM Evaluation Framework

Why read this: See how professionals present multi-dimensional evaluation results. Good model for your analysis write-up.

Read before: Analysis and write-up (Step 5)

Stanford HELM

Khan Academy's 7-Step Approach to Prompt Engineering for Khanmigo — How Khan Academy designed their AI tutor's system prompt (this influenced our educational system prompt)
Paper: An Evaluation of Khanmigo as a Computer-Assisted Language Learning App — Real-world evaluation of an instructional AI
Video: Stanford CME295 - LLM Evaluation (Lecture 8) — Academic lecture on evaluation methods
UK AI Safety Institute - Inspect Framework — How government safety researchers do evaluations (more advanced)
DeepEval Documentation — Popular Python evaluation framework (if you want to explore tooling later)

Glossary

Quick definitions for terms you'll encounter:

Term	Definition
Epistemic calibration	How well a model's confidence matches its actual accuracy. A well-calibrated model says "I'm not sure" when it genuinely doesn't know.
Hallucination	When a model generates plausible-sounding but false information, often presented confidently.
Pedagogical harm	Teaching in ways that mislead, confuse, or endanger the learner—even if the model is trying to be helpful.
LLM-as-Judge	Using one language model to evaluate another's outputs. Useful for scale, but has known biases and limitations.
Rubric	A scoring guide that defines what "good" looks like on each dimension being evaluated.
Failure mode	A recurring pattern of mistakes. More useful than one-off errors because it reveals systematic issues.
HHH (Helpful, Honest, Harmless)	Anthropic's framework for AI assistant behavior. The three goals often create tradeoffs.

Reading List: Instructional Safety & LLM Evaluation

Core Concepts

1. Anthropic — Helpful, Honest, and Harmless (HHH)

2. OpenAI — Evals (Concepts & Examples)

3. Microsoft — LLM-Rubric

4. LLM-as-a-Judge

5. Stanford HAI — HELM Evaluation Framework

API Documentation

OpenAI Python SDK

Anthropic Python SDK

Google Generative AI (Gemini)

Additional Reading (Optional)

Glossary