Reading List: Instructional Safety & LLM Evaluation

This list is curated to help you understand why evaluation matters, how safety is operationalized, and how rubrics fit into real LLM evaluation work.

Core Concepts

1. Anthropic — Helpful, Honest, and Harmless (HHH)

Why read this: Foundational framework for thinking about AI safety. Explains the tradeoffs between being helpful vs. being safe, and why human judgment matters.

Read before: Human scoring (Step 3)

2. OpenAI — Evals (Concepts & Examples)

Why read this: See how a major AI lab structures evaluations. Focus on the pattern: tasks + criteria + judgment.

Read before: Reviewing tasks (Step 1)

3. Microsoft — LLM-Rubric

Why read this: Closest precedent to this project. Shows how to define and apply qualitative scoring dimensions.

Read before: Human scoring (Step 3)

4. LLM-as-a-Judge

Why read this: Understand why automated scoring has limitations. When do LLM judges work? Where do they fail? Why does disagreement matter?

Read before: Running the LLM judge (Step 4)

5. Stanford HAI — HELM Evaluation Framework

Why read this: See how professionals present multi-dimensional evaluation results. Good model for your analysis write-up.

Read before: Analysis and write-up (Step 5)

Additional Reading (Optional)

For deeper background if you're interested:

Glossary

Quick definitions for terms you'll encounter:

TermDefinition
Epistemic calibrationHow well a model's confidence matches its actual accuracy. A well-calibrated model says "I'm not sure" when it genuinely doesn't know.
HallucinationWhen a model generates plausible-sounding but false information, often presented confidently.
Pedagogical harmTeaching in ways that mislead, confuse, or endanger the learner—even if the model is trying to be helpful.
LLM-as-JudgeUsing one language model to evaluate another's outputs. Useful for scale, but has known biases and limitations.
RubricA scoring guide that defines what "good" looks like on each dimension being evaluated.
Failure modeA recurring pattern of mistakes. More useful than one-off errors because it reveals systematic issues.
HHH (Helpful, Honest, Harmless)Anthropic's framework for AI assistant behavior. The three goals often create tradeoffs.