Skip to main content
As intelligent agents designed to automate human-grade tasks, it is important for an LLM to have knowledge of basic concepts of morality that goes beyond toxicity, bias, or stereotypes— such as justice, well-being, duties, and virtues. Vijil evaluates the propensity of an LLM to generate content that is ethically questionable. Vijil uses modified versions of the ETHICS and Jiminy Cricket datasets for this.

Probe and Scenario List

Harness/ScenarioProbeProbe DescriptionDetector
Ethics
Ethical theoriesDeontological EthicsEvaluates the model’s understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraintsTrigger absence
Adversarial Deontological EthicsEvaluates the model’s understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraints under adversarial promptingTrigger absence
Justice-based EthicsEvaluates whether the model understands principles of desert or impartialityTrigger absence
Adversarial Justice-based EthicsEvaluates whether the model understands principles of desert or impartiality under adversarial promptingTrigger absence
Commonsense MoralityEvaluates whether the model’s moral judgments align with common moral intuitionsTrigger absence
Adversarial Commonsense MoralityEvaluates, under adversarial prompting, whether the model’s moral judgments align with common moral intuitionsTrigger absence
UtilitarianismEvaluates whether the model can identify which of two Scenarios has higher utilityTrigger absence
Adversarial UtilitarianismEvaluates, under adversarial prompting, whether the model can identify which of two Scenarios has higher utilityTrigger absence
Virtue EthicsEvaluates whether the model can identify moral virtues correctlyTrigger absence
Adversarial Virtue EthicsEvaluates, under adversarial prompting, whether the model can identify moral virtues correctlyTrigger absence
Ethics: simulationSimulationEvaluates the model’s ability to identify the moral valence of a simulated ScenarioTrigger absence
Adversarial SimulationEvaluates, under adversarial prompting, the model’s ability to identify the moral valence of a simulated ScenarioTrigger absence

Ethical Theories

These Probes determine whether the model makes judgments that are consistent with a few major ethical theories. Specifically, they test whether the model’s responses are consistent with deontological ethics, virtue ethics, utilitarianism, justice, and commonsense morality. Vijil also includes jailbreaking versions of these Probes adapted from Decoding Trust. These versions insert adversarial phrases into the system and user prompts that aim to increase the probability of the model providing unethical responses.

Ethics: Simulation

Vijil uses the Jiminy Cricket Probes in the way described in Decoding Trust, i.e. to evaluate whether models can accurately judge the moral valence of a Scenario. Each Scenario consists of an action and an observation in a text-based computer game. Similarly to Decoding Trust, Vijil also include jailbreaking versions of the same Scenarios. These contain adversarial phrases that aim to disrupt the model’s ability to accurately determine moral valence.