Ethics

As intelligent agents designed to automate human-grade tasks, it is important for an LLM to have knowledge of basic concepts of morality that goes beyond toxicity, bias, or stereotypes— such as justice, well-being, duties, and virtues. Vijil evaluates the propensity of an LLM to generate content that is ethically questionable. Vijil uses modified versions of the ETHICS and Jiminy Cricket datasets for this.

Probe and Scenario List

Harness/Scenario	Probe	Probe Description	Detector
Ethics
Ethical theories	Deontological Ethics	Evaluates the model’s understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraints	Trigger absence
	Adversarial Deontological Ethics	Evaluates the model’s understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraints under adversarial prompting	Trigger absence
	Justice-based Ethics	Evaluates whether the model understands principles of desert or impartiality	Trigger absence
	Adversarial Justice-based Ethics	Evaluates whether the model understands principles of desert or impartiality under adversarial prompting	Trigger absence
	Commonsense Morality	Evaluates whether the model’s moral judgments align with common moral intuitions	Trigger absence
	Adversarial Commonsense Morality	Evaluates, under adversarial prompting, whether the model’s moral judgments align with common moral intuitions	Trigger absence
	Utilitarianism	Evaluates whether the model can identify which of two Scenarios has higher utility	Trigger absence
	Adversarial Utilitarianism	Evaluates, under adversarial prompting, whether the model can identify which of two Scenarios has higher utility	Trigger absence
	Virtue Ethics	Evaluates whether the model can identify moral virtues correctly	Trigger absence
	Adversarial Virtue Ethics	Evaluates, under adversarial prompting, whether the model can identify moral virtues correctly	Trigger absence
Ethics: simulation	Simulation	Evaluates the model’s ability to identify the moral valence of a simulated Scenario	Trigger absence
	Adversarial Simulation	Evaluates, under adversarial prompting, the model’s ability to identify the moral valence of a simulated Scenario	Trigger absence

Ethical Theories

These Probes determine whether the model makes judgments that are consistent with a few major ethical theories. Specifically, they test whether the model’s responses are consistent with deontological ethics, virtue ethics, utilitarianism, justice, and commonsense morality. Vijil also includes jailbreaking versions of these Probes adapted from Decoding Trust. These versions insert adversarial phrases into the system and user prompts that aim to increase the probability of the model providing unethical responses.

Ethics: Simulation

Vijil uses the Jiminy Cricket Probes in the way described in Decoding Trust, i.e. to evaluate whether models can accurately judge the moral valence of a Scenario. Each Scenario consists of an action and an observation in a text-based computer game. Similarly to Decoding Trust, Vijil also include jailbreaking versions of the same Scenarios. These contain adversarial phrases that aim to disrupt the model’s ability to accurately determine moral valence.

Get Started

Core Concepts

Manage Agents

Protect Agents

Evaluate Agents

Tutorials

References

Probe and Scenario List

Ethical Theories

Ethics: Simulation

Get Started

Core Concepts

Manage Agents

Protect Agents

Evaluate Agents

Tutorials

References

​Probe and Scenario List

​Ethical Theories

​Ethics: Simulation

Probe and Scenario List

Ethical Theories

Ethics: Simulation