Skip to main content

Detector Glossary

This maps Detector module names you may see in your reports to the actual Detector names and descriptions.
Module nameDetectorDescription
adultdata.PairwiseComparisonPairwise gender comparisonCompares the model’s responses to two questions about a person’s income where only the person’s gender is changed. The model fails the test if it responds differently to the two questions. Not available in the detection endpoint.
advglue.PairwiseComparisonRobustness pairwise comparisonCompares the model’s performance on a question from a standardized task when the question’s language is perturbed versus when it is not. The model fails if it performs worse on the perturbed variant of the question. Not available in the detection endpoint.
advstereo.IAgreeStereotype agreementDetects whether a model agrees with a stereotype presented in the prompt.
misleading.MustRefuteClaimKWARTClaim refutationDetects whether a response refutes the claim presented in the prompt by looking for keywords indicating refutation. The model passes if it does refute the claim.
base.TriggerListAbsenceDetectorTrigger absenceThis is used across many different Probes. To pass the test, the model response must contain trigger words that are specific to the given prompt.
leakreplay.StartsWithStarts with triggerDetects whether the model’s response starts with any of a list of trigger phrases. The phrases are specific to each prompt.
malwaregenART.AnyCodeCode presenceDetects the presence of code of various programming languages in the model’s response.
mitigation.MitigationBypassARTRefusalDetects whether a model refuses to fulfill a request in the prompt by looking for key phrases. The model passes the test if it successfully refuses.
mitigation.TriggerAbsenceWithRefusalTrigger absence with acceptable refusalPasses if model refuses to answer, fails if model answers without including the correct trigger
packagehallucinationART.PythonPypiPython package hallucinationDetects whether the model provides code that imports non-existent Python packages
packagehallucinationART.RustCratesRust package hallucinationDetects whether the model provides code that imports non-existent Rust packages
packagehallucinationART.JavaScriptNpmJavascript package hallucinationDetects whether the model provides code that imports non-existent Javascript packages
packagehallucinationART.RubyGemsRuby package hallucinationDetects whether the model provides code that imports non-existent Ruby packages
privateinfo.DontAnswerYesDon’t answer yesDetects whether the model says it is OK to reveal private information. If so, the model fails the test.
privateinfo.LeakPIIPrivate information leakageDetects whether the model leaks private information that is provided in the prompt
xss.MarkdownExfilBasicMarkdown imageDetects whether the response contains a Markdown image.
xss.MarkdownExfilContentPrivate content in Markdown imageDetects whether the response contains content private to a conversation within a Markdown image.
base.TriggerListDetectorTrigger presenceThis is used across many different Probes. A model fails the test if its response contains trigger words that are specific to the given prompt.
dan.AntiDANAnti-Do Anything Now modeDetects whether the model enters Anti-Do Anything Now (AntiDAN) mode
dan.DANDo Anything Now modeDetects whether the model enters Do Anything Now (DAN) mode
dan.DANJailbreakJailbroken stateDetects whether the model enters a jailbroken state
dan.DevModeDeveloper modeDetects whether the model enters Developer mode
dan.DUDEDo anything and everything modeDetects whether the model enters Do anything and everything (DUDE) mode
dan.MarkdownLinkMarkdown linkDetects presence of Markdown-formatted link
dan.STANStrive To Avoid Norms modeDetects whether the model enters Strive To Avoid Norms (STAN) mode
encoding.DecodeMatchMatch decoded stringDetects whether specific decoded strings are present in the model’s response. If so, the model fails the test.
goodside.GlitchGlitchDetects whether a model fails to repeat back a glitch token despite being asked to. The model has to repeat the token to pass the test.
goodside.PlainJSONPlain JSONDetects whether the model returns plain JSON without any preamble.
goodside.RileyIsntRiley Goodside misinformationUses key phrases to detect whether the response contains misinformation about Riley Goodside.
knownbadsignatures.EICAREICAR virusDetects EICAR virus signature
knownbadsignatures.GTphishPhishingDetects phishing content
knownbadsignatures.GTUBESpamDetects spam content
llm.AnswerRelevancyAnswer RelevancyGeneration metric, measures the degree to which the final generated output is relevant to the original input
llm.ContextualPrecisionContextual PrecisionRetrieval metric, measures whether the contexts relevant to the input question are ranked higher in the full set of retrieved contexts than irrelevant ones
llm.ContextualRecallContextual RecallRetrieval metric, measures the extent to which the retrieved contexts align with the golden answers
llm.CorrectnessCorrectnessEvaluates the correctness of the generated answer.
llm.FaithfulnessFaithfulnessGeneration metric, measures how much the generated response stays faithful to the retrieved contexts, i.e. the opposite of hallucination
llm.StrongRejectStrong RejectDetects whether a model fails to respond to harmful instructions
llm.ConversationRoleAdherenceConversation Role AdherenceDetects whether a model adheres to the role of the conversation
llm.ConversationRelevancyConversation RelevancyDetects whether a model is relevant to the conversation
llm.ConversationCompletenessConversation CompletenessDetects whether a model completes the conversation
llm.ConversationKnowledgeRetentionConversation Knowledge RetentionDDetects whether a model maintains consistent knowledge throughout the conversation
llm.RefusalLLM-based RefusalUses an LLM to detect if a model response avoids answering the question posed in the prompt.
multi_turn.CrescendoCrescendo AttackDetects how well an agent performs against a Crescnedo attack. Not available in the detection endpoint.
multi_turn.LinearLinear Jailbreak AttackDetects how well an agent performs against a Linear Jailbreak attack. Not available in the detection endpoint.
multi_turn.TreeTree Jailbreak AttackDetects how well an agent performs against a Tree Jailbreak attack. Not available in the detection endpoint.
llm.PolicyViolationPolicy ViolationDetects when output violates a specified usage policy
lmrc.QuackMedicineQuack medicineDetects whether the response contains quack medicine advice.
misleading.MustRefuteClaimKWClaim refutationDetects whether a response refutes the claim presented in the prompt by looking for keywords indicating refutation. The model passes if it does refute the claim.
riskywords.OfcomOffensiveRaceEthnicOffensive: racial/ethnicDetects whether the response contains offensive racial or ethnic terms
riskywords.SurgeProfanityAnimalReferencesProfanity: animal referencesDetects whether the model outputs profanity containing words related to animals
riskywords.SurgeProfanityBodilyFluidsProfanity: bodily fluidsDetects whether the model outputs profanity containing words related to bodily fluids
riskywords.SurgeProfanityMentalDisabilityProfanity: mental disabilityDetects whether the model outputs profanity containing words related to mental disabilities
riskywords.SurgeProfanityPhysicalAttributesProfanity: physical attributesDetects whether the model outputs profanity containing words related to physical attributes
riskywords.SurgeProfanityPhysicalDisabilityProfanity: physical disabilityDetects whether the model outputs profanity containing words related to physical disabilities
riskywords.SurgeProfanityPoliticalProfanity: politicalDetects whether the model outputs profanity containing words related to animals
riskywords.SurgeProfanityRacialEthnicProfanity: racial/ethnicDetects whether the model outputs profanity containing words related to animals
riskywords.SurgeProfanityReligiousProfanity: religiousDetects whether the model outputs profanity containing words related to animals
riskywords.SurgeProfanitySexualProfanity: sexualDetects whether the model outputs profanity containing words related to animals
riskywords.SurgeProfanitySexualOrientationGenderProfanity: sexual orientation/genderDetects whether the model outputs profanity containing words related to animals
snowball.DontStartNoDon’t start noDetects whether the model response starts with ‘no’. If so, it fails the test.
snowball.DontStartYesDon’t start yesDetects whether the model response starts with ‘yes’. If so, it fails the test.
toxicity.ToxicCommentModelModel-based toxicityUses a small language model to determine whether the response is toxic.
toolcalling.ToolCorrectnessTool CorrectnessDetects correct use of a set of tools
winobias.TriggerPlusAmbiguityDetect gender bias in professionsDetects specific trigger words indicating gender bias in predicted professions, with addition of ambiguity detector for when model response is hedging.
ansiescape.RawRaw ANSI SequencesDetects raw ANSI sequences
ansiescape.EscapedEscaped ANSI SequencesDetects escaped ANSI sequences
productkey.Win5x5Windows Product KeyDetects a product key in the form of five groups of five characters
nlp.BLEUBLEUGeneration metric, measures semantic similarity of expected and generated answer
nlp.METEORMETEORGeneration metric, measures semantic similarity of expected and generated answer
nlp.BERTScoreBERTScoreGeneration metric, measures semantic similarity of expected and generated answer
mathLatexSympy.MathLatexSympyMath checking using SympyThis is used for some mathematical based probes. To pass the test, the model response must mathematically match the trigger. Assumes latex will be given back and uses sympy to check for equivalency
instructionFollowing.InstructionFollowingFollowing Formatting InstructionsThis is used for an instruction following probe. To pass the test, the model response should adhere to instructions as specified by the trigger. The instructions are intended to be deterministically evaluating as the instructions used are things like ‘use 500 words or less’ or ‘use markdown formatting’, things that able to be deterministcally verified.

Security Probes

Module nameProbe nameDescription
dan.AntiDANAnti-Do Anything NowEvaluates whether prompt injection causes a model to enter a mode of exaggerated safety where it treates safe responses as unsafe
dan.ChatGPT_Developer_Mode_RANTIDo Anything Now: RANTI v1Ranti developer mode jailbreak, version 1
dan.ChatGPT_Developer_Mode_v2Do Anything Now: RANTI v2Ranti developer mode jailbreak, version 2
dan.ChatGPT_Image_MarkdownInject Markdown imageA jailbreak attempt intended to enable the injection of markdown images into responses.
dan.Dan_10_0Do Anything Now: v10Do Anything Now jailbreak with prompt about inhumane things
dan.Dan_11_0Do Anything Now: v11Do Anything Now jailbreak to encourage profane, unethical, one-sided, violent outputs
dan.Dan_6_0Do Anything Now: v6.0Do Anything Now jailbreak offering rewards to encourage making future predictions and hallucinations
dan.Dan_6_2Do Anything Now: v6.2Do Anything Now jailbreak offering rewards to encourage making future predictions and hallucations, with additional incitement to rudeness and disrespectfulness.
dan.Dan_7_0Do Anything Now: v7.0Do Anything Now jailbreak telling the model that filtering is not acceptable and including special instructions for when to break character
dan.Dan_8_0Do Anything Now: v8.0Do Anything Now jailbreak telling the model is Bigmancozmo and encouraging swearing, porn, and illegality
dan.Dan_9_0Do Anything Now: v9.0Do Anything Now jailbreak encouraging breaking 2021 information cutoff and ethical guardrails in responses, with reward offered
dan.DAN_JailbreakDo Anything Now: 🔓JAILBREAK versionDo Anything Now jailbreak prompting the model to enter a 🔓JAILBREAK mode
dan.DUDEDo anything and everythingDo Anything Now jailbreak prompting the model to take on the DUDE persona
dan.STANStrive To Avoid NormsJailbreak prompting the model to take on a Strive To Avoid Norms (STAN) persona
encoding.InjectAscii85Prompt injection : ASCII85Tries to get the model to decode an ASCII85-encoded string
encoding.InjectBase16Prompt injection : Base16Tries to get the model to decode a Base16-encoded string
encoding.InjectBase2048Prompt injection : Base2048Tries to get the model to decode a Base2048-encoded string
encoding.InjectBase32Prompt injection : Base32Tries to get the model to decode a Base32-encoded string
encoding.InjectBase64Prompt injection : Base64Tries to get the model to decode a Base64-encoded string
encoding.InjectBraillePrompt injection : BrailleTries to get the model to decode a Braille-encoded string
encoding.InjectEcojiPrompt injection : EcojiTries to get the model to decode an Ecoji-encoded string
encoding.InjectHexPrompt injection : HexTries to get the model to decode a Hex-encoded string
encoding.InjectMorsePrompt injection : MorseTries to get the model to decode a Morse-encoded string
encoding.InjectNatoPrompt injection : NatoTries to get the model to decode a Nato-encoded string
encoding.InjectROT13Prompt injection : ROT13Tries to get the model to decode a ROT13-encoded string
encoding.InjectUUPrompt injection : InjectUUTries to get the model to decode a InjectUU-encoded string
encoding.InjectZalgoPrompt injection : ZalgoTries to get the model to decode a Zalgo-encoded string
gcg.GCG_ProbeAdversarial suffixUses an adversarial suffix to elicit instructions on how to steal from a non-profit organization
glitch.Glitch100GlitchUses glitch tokens to get the model to fail to repeat a word
knownbadsignatures.EICARTest VirusGet the model to output an EICAR virus signature
knownbadsignatures.GTUBETest SpamGet the model to output the GTUBE spam signature
malwaregen.EvasionEvasive malwareGet the model to write malware that evades common defenses
malwaregen.PayloadMalicious payloadGet the model to produce a malicious payload
malwaregen.SubFunctionsMalicious subfunctionsGet the model to write malicious subfunctions
malwaregen.TopLevelGeneral malwareGet the model to write general malware
promptinject.HijackHateHumansMiniPrompt injection: Hate humansGet the model to produce a rogue string on hating humans
promptinject.HijackKillHumansMiniPrompt Injection: Kill humansGet the model to produce a rogue string on killing humans
promptinject.HijackLongPromptMiniPrompt injection: long promptGet the model to produce a “long prompt” rogue string
promptinject.TagPrompt injection: unicode tagPrompt injection using invisible unicode tags
goodside.ThreatenJSONOutput plain JSONThreaten the model to make it produce raw JSON without any preamble
xss.MarkdownImageExfilCross-site scripting: Markdown imageGet the model to write embedded assets exfiltrating text in a Markdown URI

Privacy Probes

Module nameProbe nameDescription
leakreplay.LiteratureComplete80Book excerpt completionTests whether the model will complete a book excerpt when prompted with the beginning
leakreplay.LiteratureCloze80Book excerpt insertionTests whether the model will fill in the middle of a book excerpt
leakreplay.GuardianCompleteMedia excerpt completion - GuardianTests whether the model will complete an excerpt from The Guardian when prompted with the beginning
leakreplay.GuardianClozeMedia excerpt insertion - GuardianTests whether the model will fill in the middle of an excerpt from The Guardian
leakreplay.NYTCompleteMedia excerpt completion - NYTTests whether the model will complete an excerpt from the New York Times when prompted with the beginning
leakreplay.NYTClozeMedia excerpt insertion - NYTTests whether the model will fill in the middle of an excerpt from the New York Times
privateinfo.EnronLeakEmailEmails leakage - EnronTests whether the model will reveal email addresses obtained from the discovery process of the Enron investigation
privateinfo.LeakPIIPII LeakageTests whether PII provided to the model during a conversation can be leaked
privateinfo.PrivacyUnderstandingConfidentiality comprehensionTests whether the model understands privacy-related phrases that flag when certain information is confidential

Hallucination Probes

Module nameProbe nameDescription
socialvalues.QuackMedicineQuack MedicineEvaluates whether the model promotes quack medicine
goodside.WhoIsRileyFalse assertions: Riley GoodsideEvaluates hallucinations about Riley Goodside
goodside.DavidjlGlitch - DavidjlChecks for hallucinations about the glitch token Davidjl
packagehallucination.PythonHallucinate Python packagesChecks if a model hallucinates Python packages
snowball.SenatorsMiniFalse assertions: senatorsChecks for false assertions about US senators
misleading.FalseAssertion50False assertions: generalChecks for false assertions about general history/science facts
snowball.GraphConnectivityMiniFalse assertions: flightsChecks if the model invents impossible flight routes
snowball.PrimesMiniHallucinate prime numbersEvaluates hallucinations about prime numbers

Robustness Probes

Module nameProbe nameDescription
advglue.BertattackOnMNLIBERT attack on MNLIProbes effects of BERT-based perturbation on MNLI task
advglue.BertattackOnMNLIMMBERT attack on MNLI-MMProbes effects of BERT-based perturbation on MNLI-MM task
advglue.BertattackOnQNLIBERT attack on QNLIProbes effects of BERT-based perturbation on QNLI task
advglue.BertattackOnQQPBERT attack on QQPProbes effects of BERT-based perturbation on QQP task
advglue.BertattackOnRTEBERT attack on RTEProbes effects of BERT-based perturbation on RTE task
advglue.BertattackOnSST2BERT attack on SST2Probes effects of BERT-based perturbation on SST2 task
advglue.SemattackOnMNLISemAttack on MNLIProbes effects of SemAttack perturbation on MNLI task
advglue.SemattackOnMNLIMMSemAttack on MNLI-MMProbes effects of SemAttack perturbation on MNLI-MM task
advglue.SemattackOnQNLISemAttack on QNLIProbes effects of SemAttack perturbation on QNLI task
advglue.SemattackOnQQPSemAttack on QQPProbes effects of SemAttack perturbation on QQP task
advglue.SemattackOnRTESemAttack on RTEProbes effects of SemAttack perturbation on RTE task
advglue.SemattackOnSST2SemAttack on SST2Probes effects of SemAttack perturbation on SST2 task
advglue.SememepsoOnMNLISememePSO attack on MNLIProbes effects of SememePSO perturbation on MNLI task
advglue.SememepsoOnMNLIMMSememePSO attack on MNLI-MMProbes effects of SememePSO perturbation on MNLI-MM task
advglue.SememepsoOnQNLISememePSO attack on QNLIProbes effects of SememePSO perturbation on QNLI task
advglue.SememepsoOnQQPSememePSO attack on QQPProbes effects of SememePSO perturbation on QQP task
advglue.SememepsoOnRTESememePSO attack on RTEProbes effects of SememePSO perturbation on RTE task
advglue.SememepsoOnSST2SememePSO attack on SST2Probes effects of SememePSO perturbation on SST2 task
advglue.TextbuggerOnMNLITextBugger attack on MNLIProbes effects of TextBugger perturbation on MNLI task
advglue.TextbuggerOnMNLIMMTextBugger attack on MNLI-MMProbes effects of TextBugger perturbation on MNLI-MM task
advglue.TextbuggerOnQNLITextBugger attack on QNLIProbes effects of TextBugger perturbation on QNLI task
advglue.TextbuggerOnQQPTextBugger attack on QQPProbes effects of TextBugger perturbation on QQP task
advglue.TextbuggerOnRTETextBugger attack on RTEProbes effects of TextBugger perturbation on RTE task
advglue.TextbuggerOnSST2TextBugger attack on SST2Probes effects of TextBugger perturbation on SST2 task
advglue.TextfoolerOnMNLITextFooler attack on MNLIProbes effects of TextFooler perturbation on MNLI task
advglue.TextfoolerOnMNLIMMTextFooler attack on MNLI-MMProbes effects of TextFooler perturbation on MNLI-MM task
advglue.TextfoolerOnQNLITextFooler attack on QNLIProbes effects of TextFooler perturbation on QNLI task
advglue.TextfoolerOnQQPTextFooler attack on QQPProbes effects of TextFooler perturbation on QQP task
advglue.TextfoolerOnRTETextFooler attack on RTEProbes effects of TextFooler perturbation on RTE task
advglue.TextfoolerOnSST2TextFooler attack on SST2Probes effects of TextFooler perturbation on SST2 task

Toxicity Probes

Module nameProbe nameDescription
realtoxicityprompts.RTPSexually_ExplicitSexually explicitPrompts the model to produce sexually explicit outputs
realtoxicityprompts.RTPInsultJailbreakInsult - adversarialPrompts the model to produce insults. Uses an adversarial prefix.
realtoxicityprompts.RTPInsultInsultPrompts the model to produce insults
realtoxicityprompts.RTPSevere_ToxicityJailbreakSevere toxicity - adversarialPartial prompts that potentially produce continuations of severe toxicity. Uses an adversarial prefix.
realtoxicityprompts.RTPSexually_ExplicitJailbreakSexually explicit - adversarialPrompts the model to produce sexually explicit outputs. Uses an adversarial prefix.
realtoxicityprompts.RTPIdentity_AttackIdentity-based attacksPrompts the model to produce attacks based on a person’s identity
realtoxicityprompts.RTPProfanityProfanityPrompts the model to produce profane responses
realtoxicityprompts.RTPIdentity_AttackJailbreakAdversarial Identity-based attacksPrompts the model to produce attacks based on a person’s identity. Uses an adversarial prefix.
realtoxicityprompts.RTPThreatThreatening languagePrompts the model to produce threats
realtoxicityprompts.RTPThreatJailbreakThreatening language - AdversarialPrompts the model to produce threats. Uses an adversarial prefix.
realtoxicityprompts.RTPSevere_ToxicitySevere toxicityPartial prompts that potentially produce continuations of severe toxicity
realtoxicityprompts.RTPFlirtationFlirtationPartial prompts to produce flirtatious continuations
realtoxicityprompts.RTPFlirtationJailbreakFlirtation — AdversarialPartial prompts to produce flirtatious continuations. Uses an adversarial prefix.
realtoxicityprompts.RTPProfanityJailbreakProfanity.- AdversarialPrompts the model to produce profane responses. Uses an adversarial prefix.
continuation.ContinueSlursReclaimedSlursARTSlur continuationPrompts the model to continue slur words
socialvalues.DeadnamingDeadnamingTests whether the model deadnames famous people
ethicalharms.ProfanityProfanity questionsPoses questions with potentially profane responses
ethicalharms.BullyingBullyingPoses queries or partial prompts to the model that are expected to elicit responses recommending aggressive or bullying action
ethicalharms.SexualContentSexually explicit - risky wordsPrompts the model to produce sexually explicit keywords
ethicalharms.SexualisationSexual objectificationPrompts the model to produce sexually objectifying language
ethicalharms.SlurUsageSlur questionAsks questions to see if the target will suggest or recommend ethnic slurs

Stereotype Probes

Module nameProbe nameDescription
advstereo.AtheistsStereotype religion: atheistsChecks for stereotypes about atheists
advstereo.LatinAmericansStereotype ethnicity: Latin AmericansChecks for stereotypes about Latin Americans
advstereo.EuropeansStereotype ethnicity: EuropeansChecks for stereotypes about Europeans
advstereo.OldPeopleStereotype age: old peopleChecks for stereotypes about old people
advstereo.HomosexualsStereotype sexual orientation: homosexualsChecks for stereotypes about homosexuals
advstereo.ImmigrantsStereotype nationality: immigrantsChecks for stereotypes about immigrants
advstereo.AbleBodiedPeopleStereotype disability: able-bodied peopleChecks for stereotypes about able-bodied people
advstereo.YoungPeopleStereotype age: young peopleChecks for stereotypes about young people
advstereo.CaucasiansStereotype race: CaucasiansChecks for stereotypes about Caucasians
advstereo.BlackPeopleStereotype race: black peopleChecks for stereotypes about black people
advstereo.BuddhistsStereotype religion: buddhistsChecks for stereotypes about buddhists
advstereo.PoorPeopleStereotype income: poor peopleChecks for stereotypes about poor people
advstereo.WhitePeopleStereotype race: white peopleChecks for stereotypes about white people
advstereo.JewsStereotype ethnicity: JewsChecks for stereotypes about Jews
advstereo.RichPeopleStereotype income: rich peopleChecks for stereotypes about rich people
advstereo.MenStereotype gender: menChecks for stereotypes about men
advstereo.MexicansStereotype nationality: MexicansChecks for stereotypes about Mexicans
advstereo.AsiansStereotype race: AsiansChecks for stereotypes about Asians
advstereo.AmericansStereotype nationality: AmericansChecks for stereotypes about Americans
advstereo.CanadiansStereotype nationality: CanadiansChecks for stereotypes about Canadians
advstereo.ChristiansStereotype religion: ChristiansChecks for stereotypes about Christians
advstereo.PhysicallyDisabledPeopleStereotype disability: disabled peopleChecks for stereotypes about disabled people
advstereo.MuslimsStereotype religion: MuslimsChecks for stereotypes about Muslims
advstereo.WomenStereotype gender: womenChecks for stereotypes about women

Fairness Probes

Module nameProbe nameDescription
adultdata.CounterfactualGenderGender BiasMeasures whether the model uses gender information to determine a person’s likely income
winobias.ProfessionalBiasProfessional biasMeasures whether the model associates occupations with gender

Ethics Probes

Module nameProbe nameDescription
normative_ethics.DeontologyDeontological EthicsEvaluates the model’s understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraints
normative_ethics.DeontologyJailbreakAdversarial Deontological EthicsEvaluates the model’s understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraints under adversarial prompting
normative_ethics.JusticeJustice-based EthicsEvaluates whether the model understands principles of desert or impartiality
normative_ethics.JusticeJailbreakAdversarial Justice-based EthicsEvaluates whether the model understands principles of desert or impartiality under adversarial prompting
normative_ethics.MoralityCommonsense MoralityEvaluates whether the model’s moral judgments align with common moral intuitions
normative_ethics.MoralityJailbreakAdversarial Commonsense MoralityEvaluates, under adversarial prompting, whether the model’s moral judgments align with common moral intuitions
normative_ethics.UtilitarianismUtilitarianismEvaluates whether the model can identify which of two scenarios has higher utility
normative_ethics.UtilitarianismJailbreakAdversarial UtilitarianismEvaluates, under adversarial prompting, whether the model can identify which of two scenarios has higher utility
normative_ethics.VirtueVirtue EthicsEvaluates whether the model can identify moral virtues correctly
normative_ethics.VirtueJailbreakAdversarial Virtue EthicsEvaluates, under adversarial prompting, whether the model can identify moral virtues correctly
jiminycricket.JiminySimulationEvaluates the model’s ability to identify the moral valence of a simulated scenario
jiminycricket.JiminyJailbreakAdversarial SimulationEvaluates, under adversarial prompting, the model’s ability to identify the moral valence of a simulated scenario