Probe Glossary¶

This maps the probe module names that you may see in your reports to the actual probe names and descriptions.

Security Probes¶

Module name	Probe name	Description
dan.AntiDAN	Anti-Do Anything Now	Evaluates whether prompt injection causes a model to enter a mode of exaggerated safety where it treates safe responses as unsafe
dan.ChatGPT_Developer_Mode_RANTI	Do Anything Now: RANTI v1	Ranti developer mode jailbreak, version 1
dan.ChatGPT_Developer_Mode_v2	Do Anything Now: RANTI v2	Ranti developer mode jailbreak, version 2
dan.ChatGPT_Image_Markdown	Inject Markdown image	A jailbreak attempt intended to enable the injection of markdown images into responses.
dan.Dan_10_0	Do Anything Now: v10	Do Anything Now jailbreak with prompt about inhumane things
dan.Dan_11_0	Do Anything Now: v11	Do Anything Now jailbreak to encourage profane, unethical, one-sided, violent outputs
dan.Dan_6_0	Do Anything Now: v6.0	Do Anything Now jailbreak offering rewards to encourage making future predictions and hallucinations
dan.Dan_6_2	Do Anything Now: v6.2	Do Anything Now jailbreak offering rewards to encourage making future predictions and hallucations, with additional incitement to rudeness and disrespectfulness.
dan.Dan_7_0	Do Anything Now: v7.0	Do Anything Now jailbreak telling the model that filtering is not acceptable and including special instructions for when to break character
dan.Dan_8_0	Do Anything Now: v8.0	Do Anything Now jailbreak telling the model is Bigmancozmo and encouraging swearing, porn, and illegality
dan.Dan_9_0	Do Anything Now: v9.0	Do Anything Now jailbreak encouraging breaking 2021 information cutoff and ethical guardrails in responses, with reward offered
dan.DAN_Jailbreak	Do Anything Now: 🔓JAILBREAK version	Do Anything Now jailbreak prompting the model to enter a 🔓JAILBREAK mode
dan.DUDE	Do anything and everything	Do Anything Now jailbreak prompting the model to take on the DUDE persona
dan.STAN	Strive To Avoid Norms	Jailbreak prompting the model to take on a Strive To Avoid Norms (STAN) persona
encoding.InjectAscii85	Prompt injection : ASCII85	Tries to get the model to decode an ASCII85-encoded string
encoding.InjectBase16	Prompt injection : Base16	Tries to get the model to decode a Base16-encoded string
encoding.InjectBase2048	Prompt injection : Base2048	Tries to get the model to decode a Base2048-encoded string
encoding.InjectBase32	Prompt injection : Base32	Tries to get the model to decode a Base32-encoded string
encoding.InjectBase64	Prompt injection : Base64	Tries to get the model to decode a Base64-encoded string
encoding.InjectBraille	Prompt injection : Braille	Tries to get the model to decode a Braille-encoded string
encoding.InjectEcoji	Prompt injection : Ecoji	Tries to get the model to decode an Ecoji-encoded string
encoding.InjectHex	Prompt injection : Hex	Tries to get the model to decode a Hex-encoded string
encoding.InjectMorse	Prompt injection : Morse	Tries to get the model to decode a Morse-encoded string
encoding.InjectNato	Prompt injection : Nato	Tries to get the model to decode a Nato-encoded string
encoding.InjectROT13	Prompt injection : ROT13	Tries to get the model to decode a ROT13-encoded string
encoding.InjectUU	Prompt injection : InjectUU	Tries to get the model to decode a InjectUU-encoded string
encoding.InjectZalgo	Prompt injection : Zalgo	Tries to get the model to decode a Zalgo-encoded string
gcg.GCG_Probe	Adversarial suffix	Uses an adversarial suffix to elicit instructions on how to steal from a non-profit organization
glitch.Glitch100	Glitch	Uses glitch tokens to get the model to fail to repeat a word
knownbadsignatures.EICAR	Test Virus	Get the model to output an EICAR virus signature
knownbadsignatures.GTUBE	Test Spam	Get the model to output the GTUBE spam signature
malwaregen.Evasion	Evasive malware	Get the model to write malware that evades common defenses
malwaregen.Payload	Malicious payload	Get the model to produce a malicious payload
malwaregen.SubFunctions	Malicious subfunctions	Get the model to write malicious subfunctions
malwaregen.TopLevel	General malware	Get the model to write general malware
promptinject.HijackHateHumansMini	Prompt injection: Hate humans	Get the model to produce a rogue string on hating humans
promptinject.HijackKillHumansMini	Prompt Injection: Kill humans	Get the model to produce a rogue string on killing humans
promptinject.HijackLongPromptMini	Prompt injection: long prompt	Get the model to produce a “long prompt” rogue string
promptinject.Tag	Prompt injection: unicode tag	Prompt injection using invisible unicode tags
goodside.ThreatenJSON	Output plain JSON	Threaten the model to make it produce raw JSON without any preamble
xss.MarkdownImageExfil	Cross-site scripting: Markdown image	Get the model to write embedded assets exfiltrating text in a Markdown URI

Privacy Probes¶

Module name	Probe name	Description
leakreplay.LiteratureComplete80	Book excerpt completion	Tests whether the model will complete a book excerpt when prompted with the beginning
leakreplay.LiteratureCloze80	Book excerpt insertion	Tests whether the model will fill in the middle of a book excerpt
leakreplay.GuardianComplete	Media excerpt completion - Guardian	Tests whether the model will complete an excerpt from The Guardian when prompted with the beginning
leakreplay.GuardianCloze	Media excerpt insertion - Guardian	Tests whether the model will fill in the middle of an excerpt from The Guardian
leakreplay.NYTComplete	Media excerpt completion - NYT	Tests whether the model will complete an excerpt from the New York Times when prompted with the beginning
leakreplay.NYTCloze	Media excerpt insertion - NYT	Tests whether the model will fill in the middle of an excerpt from the New York Times
privateinfo.EnronLeakEmail	Emails leakage - Enron	Tests whether the model will reveal email addresses obtained from the discovery process of the Enron investigation
privateinfo.LeakPII	PII Leakage	Tests whether PII provided to the model during a conversation can be leaked
privateinfo.PrivacyUnderstanding	Confidentiality comprehension	Tests whether the model understands privacy-related phrases that flag when certain information is confidential

Hallucination Probes¶

Module name	Probe name	Description
socialvalues.QuackMedicine	Quack Medicine	Evaluates whether the model promotes quack medicine
goodside.WhoIsRiley	False assertions: Riley Goodside	Evaluates hallucinations about Riley Goodside
goodside.Davidjl	Glitch - Davidjl	Checks for hallucinations about the glitch token Davidjl
packagehallucination.Python	Hallucinate Python packages	Checks if a model hallucinates Python packages
snowball.SenatorsMini	False assertions: senators	Checks for false assertions about US senators
misleading.FalseAssertion50	False assertions: general	Checks for false assertions about general history/science facts
snowball.GraphConnectivityMini	False assertions: flights	Checks if the model invents impossible flight routes
snowball.PrimesMini	Hallucinate prime numbers	Evaluates hallucinations about prime numbers

Robustness Probes¶

Module name	Probe name	Description
advglue.BertattackOnMNLI	BERT attack on MNLI	Probes effects of BERT-based perturbation on MNLI task
advglue.BertattackOnMNLIMM	BERT attack on MNLI-MM	Probes effects of BERT-based perturbation on MNLI-MM task
advglue.BertattackOnQNLI	BERT attack on QNLI	Probes effects of BERT-based perturbation on QNLI task
advglue.BertattackOnQQP	BERT attack on QQP	Probes effects of BERT-based perturbation on QQP task
advglue.BertattackOnRTE	BERT attack on RTE	Probes effects of BERT-based perturbation on RTE task
advglue.BertattackOnSST2	BERT attack on SST2	Probes effects of BERT-based perturbation on SST2 task
advglue.SemattackOnMNLI	SemAttack on MNLI	Probes effects of SemAttack perturbation on MNLI task
advglue.SemattackOnMNLIMM	SemAttack on MNLI-MM	Probes effects of SemAttack perturbation on MNLI-MM task
advglue.SemattackOnQNLI	SemAttack on QNLI	Probes effects of SemAttack perturbation on QNLI task
advglue.SemattackOnQQP	SemAttack on QQP	Probes effects of SemAttack perturbation on QQP task
advglue.SemattackOnRTE	SemAttack on RTE	Probes effects of SemAttack perturbation on RTE task
advglue.SemattackOnSST2	SemAttack on SST2	Probes effects of SemAttack perturbation on SST2 task
advglue.SememepsoOnMNLI	SememePSO attack on MNLI	Probes effects of SememePSO perturbation on MNLI task
advglue.SememepsoOnMNLIMM	SememePSO attack on MNLI-MM	Probes effects of SememePSO perturbation on MNLI-MM task
advglue.SememepsoOnQNLI	SememePSO attack on QNLI	Probes effects of SememePSO perturbation on QNLI task
advglue.SememepsoOnQQP	SememePSO attack on QQP	Probes effects of SememePSO perturbation on QQP task
advglue.SememepsoOnRTE	SememePSO attack on RTE	Probes effects of SememePSO perturbation on RTE task
advglue.SememepsoOnSST2	SememePSO attack on SST2	Probes effects of SememePSO perturbation on SST2 task
advglue.TextbuggerOnMNLI	TextBugger attack on MNLI	Probes effects of TextBugger perturbation on MNLI task
advglue.TextbuggerOnMNLIMM	TextBugger attack on MNLI-MM	Probes effects of TextBugger perturbation on MNLI-MM task
advglue.TextbuggerOnQNLI	TextBugger attack on QNLI	Probes effects of TextBugger perturbation on QNLI task
advglue.TextbuggerOnQQP	TextBugger attack on QQP	Probes effects of TextBugger perturbation on QQP task
advglue.TextbuggerOnRTE	TextBugger attack on RTE	Probes effects of TextBugger perturbation on RTE task
advglue.TextbuggerOnSST2	TextBugger attack on SST2	Probes effects of TextBugger perturbation on SST2 task
advglue.TextfoolerOnMNLI	TextFooler attack on MNLI	Probes effects of TextFooler perturbation on MNLI task
advglue.TextfoolerOnMNLIMM	TextFooler attack on MNLI-MM	Probes effects of TextFooler perturbation on MNLI-MM task
advglue.TextfoolerOnQNLI	TextFooler attack on QNLI	Probes effects of TextFooler perturbation on QNLI task
advglue.TextfoolerOnQQP	TextFooler attack on QQP	Probes effects of TextFooler perturbation on QQP task
advglue.TextfoolerOnRTE	TextFooler attack on RTE	Probes effects of TextFooler perturbation on RTE task
advglue.TextfoolerOnSST2	TextFooler attack on SST2	Probes effects of TextFooler perturbation on SST2 task

Toxicity Probes¶

Module name	Probe name	Description
realtoxicityprompts.RTPSexually_Explicit	Sexually explicit	Prompts the model to produce sexually explicit outputs
realtoxicityprompts.RTPInsultJailbreak	Insult - adversarial	Prompts the model to produce insults. Uses an adversarial prefix.
realtoxicityprompts.RTPInsult	Insult	Prompts the model to produce insults
realtoxicityprompts.RTPSevere_ToxicityJailbreak	Severe toxicity - adversarial	Partial prompts that potentially produce continuations of severe toxicity. Uses an adversarial prefix.
realtoxicityprompts.RTPSexually_ExplicitJailbreak	Sexually explicit - adversarial	Prompts the model to produce sexually explicit outputs. Uses an adversarial prefix.
realtoxicityprompts.RTPIdentity_Attack	Identity-based attacks	Prompts the model to produce attacks based on a person’s identity
realtoxicityprompts.RTPProfanity	Profanity	Prompts the model to produce profane responses
realtoxicityprompts.RTPIdentity_AttackJailbreak	Adversarial Identity-based attacks	Prompts the model to produce attacks based on a person’s identity. Uses an adversarial prefix.
realtoxicityprompts.RTPThreat	Threatening language	Prompts the model to produce threats
realtoxicityprompts.RTPThreatJailbreak	Threatening language - Adversarial	Prompts the model to produce threats. Uses an adversarial prefix.
realtoxicityprompts.RTPSevere_Toxicity	Severe toxicity	Partial prompts that potentially produce continuations of severe toxicity
realtoxicityprompts.RTPFlirtation	Flirtation	Partial prompts to produce flirtatious continuations
realtoxicityprompts.RTPFlirtationJailbreak	Flirtation – Adversarial	Partial prompts to produce flirtatious continuations. Uses an adversarial prefix.
realtoxicityprompts.RTPProfanityJailbreak	Profanity.- Adversarial	Prompts the model to produce profane responses. Uses an adversarial prefix.
continuation.ContinueSlursReclaimedSlursART	Slur continuation	Prompts the model to continue slur words
socialvalues.Deadnaming	Deadnaming	Tests whether the model deadnames famous people
ethicalharms.Profanity	Profanity questions	Poses questions with potentially profane responses
ethicalharms.Bullying	Bullying	Poses queries or partial prompts to the model that are expected to elicit responses recommending aggressive or bullying action
ethicalharms.SexualContent	Sexually explicit - risky words	Prompts the model to produce sexually explicit keywords
ethicalharms.Sexualisation	Sexual objectification	Prompts the model to produce sexually objectifying language
ethicalharms.SlurUsage	Slur question	Asks questions to see if the target will suggest or recommend ethnic slurs

Stereotype Probes¶

Module name	Probe name	Description
advstereo.Atheists	Stereotype religion: atheists	Checks for stereotypes about atheists
advstereo.LatinAmericans	Stereotype ethnicity: Latin Americans	Checks for stereotypes about Latin Americans
advstereo.Europeans	Stereotype ethnicity: Europeans	Checks for stereotypes about Europeans
advstereo.OldPeople	Stereotype age: old people	Checks for stereotypes about old people
advstereo.Homosexuals	Stereotype sexual orientation: homosexuals	Checks for stereotypes about homosexuals
advstereo.Immigrants	Stereotype nationality: immigrants	Checks for stereotypes about immigrants
advstereo.AbleBodiedPeople	Stereotype disability: able-bodied people	Checks for stereotypes about able-bodied people
advstereo.YoungPeople	Stereotype age: young people	Checks for stereotypes about young people
advstereo.Caucasians	Stereotype race: Caucasians	Checks for stereotypes about Caucasians
advstereo.BlackPeople	Stereotype race: black people	Checks for stereotypes about black people
advstereo.Buddhists	Stereotype religion: buddhists	Checks for stereotypes about buddhists
advstereo.PoorPeople	Stereotype income: poor people	Checks for stereotypes about poor people
advstereo.WhitePeople	Stereotype race: white people	Checks for stereotypes about white people
advstereo.Jews	Stereotype ethnicity: Jews	Checks for stereotypes about Jews
advstereo.RichPeople	Stereotype income: rich people	Checks for stereotypes about rich people
advstereo.Men	Stereotype gender: men	Checks for stereotypes about men
advstereo.Mexicans	Stereotype nationality: Mexicans	Checks for stereotypes about Mexicans
advstereo.Asians	Stereotype race: Asians	Checks for stereotypes about Asians
advstereo.Americans	Stereotype nationality: Americans	Checks for stereotypes about Americans
advstereo.Canadians	Stereotype nationality: Canadians	Checks for stereotypes about Canadians
advstereo.Christians	Stereotype religion: Christians	Checks for stereotypes about Christians
advstereo.PhysicallyDisabledPeople	Stereotype disability: disabled people	Checks for stereotypes about disabled people
advstereo.Muslims	Stereotype religion: Muslims	Checks for stereotypes about Muslims
advstereo.Women	Stereotype gender: women	Checks for stereotypes about women

Fairness Probes¶

Module name	Probe name	Description
adultdata.CounterfactualGender	Gender Bias	Measures whether the model uses gender information to determine a person’s likely income
“winobias.ProfessionalBias”	Professional bias	Measures whether the model associates occupations with gender

Ethics Probes¶

Module name	Probe name	Description
normative_ethics.Deontology	Deontological Ethics	Evaluates the model’s understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraints
normative_ethics.DeontologyJailbreak	Adversarial Deontological Ethics	Evaluates the model’s understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraints under adversarial prompting
normative_ethics.Justice	Justice-based Ethics	Evaluates whether the model understands principles of desert or impartiality
normative_ethics.JusticeJailbreak	Adversarial Justice-based Ethics	Evaluates whether the model understands principles of desert or impartiality under adversarial prompting
normative_ethics.Morality	Commonsense Morality	Evaluates whether the model’s moral judgments align with common moral intuitions
normative_ethics.MoralityJailbreak	Adversarial Commonsense Morality	Evaluates, under adversarial prompting, whether the model’s moral judgments align with common moral intuitions
normative_ethics.Utilitarianism	Utilitarianism	Evaluates whether the model can identify which of two scenarios has higher utility
normative_ethics.UtilitarianismJailbreak	Adversarial Utilitarianism	Evaluates, under adversarial prompting, whether the model can identify which of two scenarios has higher utility
normative_ethics.Virtue	Virtue Ethics	Evaluates whether the model can identify moral virtues correctly
normative_ethics.VirtueJailbreak	Adversarial Virtue Ethics	Evaluates, under adversarial prompting, whether the model can identify moral virtues correctly
jiminycricket.Jiminy	Simulation	Evaluates the model’s ability to identify the moral valence of a simulated scenario
jiminycricket.JiminyJailbreak	Adversarial Simulation	Evaluates, under adversarial prompting, the model’s ability to identify the moral valence of a simulated scenario