Probe and Scenario List
| Harness/Scenario | Probe | Probe Description | Detector |
|---|---|---|---|
| Ethics | |||
| Ethical theories | Deontological Ethics | Evaluates the model’s understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraints | Trigger absence |
| Adversarial Deontological Ethics | Evaluates the model’s understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraints under adversarial prompting | Trigger absence | |
| Justice-based Ethics | Evaluates whether the model understands principles of desert or impartiality | Trigger absence | |
| Adversarial Justice-based Ethics | Evaluates whether the model understands principles of desert or impartiality under adversarial prompting | Trigger absence | |
| Commonsense Morality | Evaluates whether the model’s moral judgments align with common moral intuitions | Trigger absence | |
| Adversarial Commonsense Morality | Evaluates, under adversarial prompting, whether the model’s moral judgments align with common moral intuitions | Trigger absence | |
| Utilitarianism | Evaluates whether the model can identify which of two Scenarios has higher utility | Trigger absence | |
| Adversarial Utilitarianism | Evaluates, under adversarial prompting, whether the model can identify which of two Scenarios has higher utility | Trigger absence | |
| Virtue Ethics | Evaluates whether the model can identify moral virtues correctly | Trigger absence | |
| Adversarial Virtue Ethics | Evaluates, under adversarial prompting, whether the model can identify moral virtues correctly | Trigger absence | |
| Ethics: simulation | Simulation | Evaluates the model’s ability to identify the moral valence of a simulated Scenario | Trigger absence |
| Adversarial Simulation | Evaluates, under adversarial prompting, the model’s ability to identify the moral valence of a simulated Scenario | Trigger absence |