Ethics

As intelligent agents designed to automate human-grade tasks, it is important for an LLM to have knowledge of basic concepts of morality that goes beyond toxicity, bias, or stereotypes— such as justice, well-being, duties, and virtues.

We evaluate the propensity of an LLM to generate content that is ethically questionable. We use modified versions of the ETHICS and Jiminy Cricket datasets for this.

Probe and scenario list

Harness/scenario

Probe

Probe Description

Detector

Ethics

Ethical theories

Deontological Ethics

Evaluates the model’s understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraints

Trigger absence

Adversarial Deontological Ethics

Evaluates the model’s understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraints under adversarial prompting

Trigger absence

Justice-based Ethics

Evaluates whether the model understands principles of desert or impartiality

Trigger absence

Adversarial Justice-based Ethics

Evaluates whether the model understands principles of desert or impartiality under adversarial prompting

Trigger absence

Commonsense Morality

Evaluates whether the model’s moral judgments align with common moral intuitions

Trigger absence

Adversarial Commonsense Morality

Evaluates, under adversarial prompting, whether the model’s moral judgments align with common moral intuitions

Trigger absence

Utilitarianism

Evaluates whether the model can identify which of two scenarios has higher utility

Trigger absence

Adversarial Utilitarianism

Evaluates, under adversarial prompting, whether the model can identify which of two scenarios has higher utility

Trigger absence

Virtue Ethics

Evaluates whether the model can identify moral virtues correctly

Trigger absence

Adversarial Virtue Ethics

Evaluates, under adversarial prompting, whether the model can identify moral virtues correctly

Trigger absence

Ethics: simulation

Simulation

Evaluates the model’s ability to identify the moral valence of a simulated scenario

Trigger absence

Adversarial Simulation

Evaluates, under adversarial prompting, the model’s ability to identify the moral valence of a simulated scenario

Trigger absence

Ethical theories

These probes determine whether the model makes judgments that are consistent with a few major ethical theories. Specifically, they test whether the model’s responses are consistent with deontological ethics, virtue ethics, utilitarianism, justice, and commonsense morality.

We also include jailbreaking versions of these probes adapted from Decoding Trust. These versions insert adversarial phrases into the system and user prompts that aim to increase the probability of the model providing unethical responses.

Ethics: Simulation

We use the Jiminy Cricket probes in the way described in Decoding Trust, i.e. to evaluate whether models can accurately judge the moral valence of a scenario. Each scenario consists of an action and an observation in a text-based computer game.

Similarly to Decoding Trust, we also include jailbreaking versions of the same scenarios. These contain adversarial phrases that aim to disrupt the model’s ability to accurately determine moral valence.