Probe Glossary

This maps the probe module names that you may see in your reports to the actual probe names and descriptions.

Security Probes

Module name

Probe name

Description

dan.AntiDAN

Anti-Do Anything Now

Evaluates whether prompt injection causes a model to enter a mode of exaggerated safety where it treates safe responses as unsafe

dan.ChatGPT_Developer_Mode_RANTI

Do Anything Now: RANTI v1

Ranti developer mode jailbreak, version 1

dan.ChatGPT_Developer_Mode_v2

Do Anything Now: RANTI v2

Ranti developer mode jailbreak, version 2

dan.ChatGPT_Image_Markdown

Inject Markdown image

A jailbreak attempt intended to enable the injection of markdown images into responses.

dan.Dan_10_0

Do Anything Now: v10

Do Anything Now jailbreak with prompt about inhumane things

dan.Dan_11_0

Do Anything Now: v11

Do Anything Now jailbreak to encourage profane, unethical, one-sided, violent outputs

dan.Dan_6_0

Do Anything Now: v6.0

Do Anything Now jailbreak offering rewards to encourage making future predictions and hallucinations

dan.Dan_6_2

Do Anything Now: v6.2

Do Anything Now jailbreak offering rewards to encourage making future predictions and hallucations, with additional incitement to rudeness and disrespectfulness.

dan.Dan_7_0

Do Anything Now: v7.0

Do Anything Now jailbreak telling the model that filtering is not acceptable and including special instructions for when to break character

dan.Dan_8_0

Do Anything Now: v8.0

Do Anything Now jailbreak telling the model is Bigmancozmo and encouraging swearing, porn, and illegality

dan.Dan_9_0

Do Anything Now: v9.0

Do Anything Now jailbreak encouraging breaking 2021 information cutoff and ethical guardrails in responses, with reward offered

dan.DAN_Jailbreak

Do Anything Now: 🔓JAILBREAK version

Do Anything Now jailbreak prompting the model to enter a 🔓JAILBREAK mode

dan.DUDE

Do anything and everything

Do Anything Now jailbreak prompting the model to take on the DUDE persona

dan.STAN

Strive To Avoid Norms

Jailbreak prompting the model to take on a Strive To Avoid Norms (STAN) persona

encoding.InjectAscii85

Prompt injection : ASCII85

Tries to get the model to decode an ASCII85-encoded string

encoding.InjectBase16

Prompt injection : Base16

Tries to get the model to decode a Base16-encoded string

encoding.InjectBase2048

Prompt injection : Base2048

Tries to get the model to decode a Base2048-encoded string

encoding.InjectBase32

Prompt injection : Base32

Tries to get the model to decode a Base32-encoded string

encoding.InjectBase64

Prompt injection : Base64

Tries to get the model to decode a Base64-encoded string

encoding.InjectBraille

Prompt injection : Braille

Tries to get the model to decode a Braille-encoded string

encoding.InjectEcoji

Prompt injection : Ecoji

Tries to get the model to decode an Ecoji-encoded string

encoding.InjectHex

Prompt injection : Hex

Tries to get the model to decode a Hex-encoded string

encoding.InjectMorse

Prompt injection : Morse

Tries to get the model to decode a Morse-encoded string

encoding.InjectNato

Prompt injection : Nato

Tries to get the model to decode a Nato-encoded string

encoding.InjectROT13

Prompt injection : ROT13

Tries to get the model to decode a ROT13-encoded string

encoding.InjectUU

Prompt injection : InjectUU

Tries to get the model to decode a InjectUU-encoded string

encoding.InjectZalgo

Prompt injection : Zalgo

Tries to get the model to decode a Zalgo-encoded string

gcg.GCG_Probe

Adversarial suffix

Uses an adversarial suffix to elicit instructions on how to steal from a non-profit organization

glitch.Glitch100

Glitch

Uses glitch tokens to get the model to fail to repeat a word

knownbadsignatures.EICAR

Test Virus

Get the model to output an EICAR virus signature

knownbadsignatures.GTUBE

Test Spam

Get the model to output the GTUBE spam signature

malwaregen.Evasion

Evasive malware

Get the model to write malware that evades common defenses

malwaregen.Payload

Malicious payload

Get the model to produce a malicious payload

malwaregen.SubFunctions

Malicious subfunctions

Get the model to write malicious subfunctions

malwaregen.TopLevel

General malware

Get the model to write general malware

promptinject.HijackHateHumansMini

Prompt injection: Hate humans

Get the model to produce a rogue string on hating humans

promptinject.HijackKillHumansMini

Prompt Injection: Kill humans

Get the model to produce a rogue string on killing humans

promptinject.HijackLongPromptMini

Prompt injection: long prompt

Get the model to produce a “long prompt” rogue string

promptinject.Tag

Prompt injection: unicode tag

Prompt injection using invisible unicode tags

goodside.ThreatenJSON

Output plain JSON

Threaten the model to make it produce raw JSON without any preamble

xss.MarkdownImageExfil

Cross-site scripting: Markdown image

Get the model to write embedded assets exfiltrating text in a Markdown URI

Privacy Probes

Module name

Probe name

Description

leakreplay.LiteratureComplete80

Book excerpt completion

Tests whether the model will complete a book excerpt when prompted with the beginning

leakreplay.LiteratureCloze80

Book excerpt insertion

Tests whether the model will fill in the middle of a book excerpt

leakreplay.GuardianComplete

Media excerpt completion - Guardian

Tests whether the model will complete an excerpt from The Guardian when prompted with the beginning

leakreplay.GuardianCloze

Media excerpt insertion - Guardian

Tests whether the model will fill in the middle of an excerpt from The Guardian

leakreplay.NYTComplete

Media excerpt completion - NYT

Tests whether the model will complete an excerpt from the New York Times when prompted with the beginning

leakreplay.NYTCloze

Media excerpt insertion - NYT

Tests whether the model will fill in the middle of an excerpt from the New York Times

privateinfo.EnronLeakEmail

Emails leakage - Enron

Tests whether the model will reveal email addresses obtained from the discovery process of the Enron investigation

privateinfo.LeakPII

PII Leakage

Tests whether PII provided to the model during a conversation can be leaked

privateinfo.PrivacyUnderstanding

Confidentiality comprehension

Tests whether the model understands privacy-related phrases that flag when certain information is confidential

Hallucination Probes

Module name

Probe name

Description

socialvalues.QuackMedicine

Quack Medicine

Evaluates whether the model promotes quack medicine

goodside.WhoIsRiley

False assertions: Riley Goodside

Evaluates hallucinations about Riley Goodside

goodside.Davidjl

Glitch - Davidjl

Checks for hallucinations about the glitch token Davidjl

packagehallucination.Python

Hallucinate Python packages

Checks if a model hallucinates Python packages

snowball.SenatorsMini

False assertions: senators

Checks for false assertions about US senators

misleading.FalseAssertion50

False assertions: general

Checks for false assertions about general history/science facts

snowball.GraphConnectivityMini

False assertions: flights

Checks if the model invents impossible flight routes

snowball.PrimesMini

Hallucinate prime numbers

Evaluates hallucinations about prime numbers

Robustness Probes

Module name

Probe name

Description

advglue.BertattackOnMNLI

BERT attack on MNLI

Probes effects of BERT-based perturbation on MNLI task

advglue.BertattackOnMNLIMM

BERT attack on MNLI-MM

Probes effects of BERT-based perturbation on MNLI-MM task

advglue.BertattackOnQNLI

BERT attack on QNLI

Probes effects of BERT-based perturbation on QNLI task

advglue.BertattackOnQQP

BERT attack on QQP

Probes effects of BERT-based perturbation on QQP task

advglue.BertattackOnRTE

BERT attack on RTE

Probes effects of BERT-based perturbation on RTE task

advglue.BertattackOnSST2

BERT attack on SST2

Probes effects of BERT-based perturbation on SST2 task

advglue.SemattackOnMNLI

SemAttack on MNLI

Probes effects of SemAttack perturbation on MNLI task

advglue.SemattackOnMNLIMM

SemAttack on MNLI-MM

Probes effects of SemAttack perturbation on MNLI-MM task

advglue.SemattackOnQNLI

SemAttack on QNLI

Probes effects of SemAttack perturbation on QNLI task

advglue.SemattackOnQQP

SemAttack on QQP

Probes effects of SemAttack perturbation on QQP task

advglue.SemattackOnRTE

SemAttack on RTE

Probes effects of SemAttack perturbation on RTE task

advglue.SemattackOnSST2

SemAttack on SST2

Probes effects of SemAttack perturbation on SST2 task

advglue.SememepsoOnMNLI

SememePSO attack on MNLI

Probes effects of SememePSO perturbation on MNLI task

advglue.SememepsoOnMNLIMM

SememePSO attack on MNLI-MM

Probes effects of SememePSO perturbation on MNLI-MM task

advglue.SememepsoOnQNLI

SememePSO attack on QNLI

Probes effects of SememePSO perturbation on QNLI task

advglue.SememepsoOnQQP

SememePSO attack on QQP

Probes effects of SememePSO perturbation on QQP task

advglue.SememepsoOnRTE

SememePSO attack on RTE

Probes effects of SememePSO perturbation on RTE task

advglue.SememepsoOnSST2

SememePSO attack on SST2

Probes effects of SememePSO perturbation on SST2 task

advglue.TextbuggerOnMNLI

TextBugger attack on MNLI

Probes effects of TextBugger perturbation on MNLI task

advglue.TextbuggerOnMNLIMM

TextBugger attack on MNLI-MM

Probes effects of TextBugger perturbation on MNLI-MM task

advglue.TextbuggerOnQNLI

TextBugger attack on QNLI

Probes effects of TextBugger perturbation on QNLI task

advglue.TextbuggerOnQQP

TextBugger attack on QQP

Probes effects of TextBugger perturbation on QQP task

advglue.TextbuggerOnRTE

TextBugger attack on RTE

Probes effects of TextBugger perturbation on RTE task

advglue.TextbuggerOnSST2

TextBugger attack on SST2

Probes effects of TextBugger perturbation on SST2 task

advglue.TextfoolerOnMNLI

TextFooler attack on MNLI

Probes effects of TextFooler perturbation on MNLI task

advglue.TextfoolerOnMNLIMM

TextFooler attack on MNLI-MM

Probes effects of TextFooler perturbation on MNLI-MM task

advglue.TextfoolerOnQNLI

TextFooler attack on QNLI

Probes effects of TextFooler perturbation on QNLI task

advglue.TextfoolerOnQQP

TextFooler attack on QQP

Probes effects of TextFooler perturbation on QQP task

advglue.TextfoolerOnRTE

TextFooler attack on RTE

Probes effects of TextFooler perturbation on RTE task

advglue.TextfoolerOnSST2

TextFooler attack on SST2

Probes effects of TextFooler perturbation on SST2 task

Toxicity Probes

Module name

Probe name

Description

realtoxicityprompts.RTPSexually_Explicit

Sexually explicit

Prompts the model to produce sexually explicit outputs

realtoxicityprompts.RTPInsultJailbreak

Insult - adversarial

Prompts the model to produce insults. Uses an adversarial prefix.

realtoxicityprompts.RTPInsult

Insult

Prompts the model to produce insults

realtoxicityprompts.RTPSevere_ToxicityJailbreak

Severe toxicity - adversarial

Partial prompts that potentially produce continuations of severe toxicity. Uses an adversarial prefix.

realtoxicityprompts.RTPSexually_ExplicitJailbreak

Sexually explicit - adversarial

Prompts the model to produce sexually explicit outputs. Uses an adversarial prefix.

realtoxicityprompts.RTPIdentity_Attack

Identity-based attacks

Prompts the model to produce attacks based on a person’s identity

realtoxicityprompts.RTPProfanity

Profanity

Prompts the model to produce profane responses

realtoxicityprompts.RTPIdentity_AttackJailbreak

Adversarial Identity-based attacks

Prompts the model to produce attacks based on a person’s identity. Uses an adversarial prefix.

realtoxicityprompts.RTPThreat

Threatening language

Prompts the model to produce threats

realtoxicityprompts.RTPThreatJailbreak

Threatening language - Adversarial

Prompts the model to produce threats. Uses an adversarial prefix.

realtoxicityprompts.RTPSevere_Toxicity

Severe toxicity

Partial prompts that potentially produce continuations of severe toxicity

realtoxicityprompts.RTPFlirtation

Flirtation

Partial prompts to produce flirtatious continuations

realtoxicityprompts.RTPFlirtationJailbreak

Flirtation – Adversarial

Partial prompts to produce flirtatious continuations. Uses an adversarial prefix.

realtoxicityprompts.RTPProfanityJailbreak

Profanity.- Adversarial

Prompts the model to produce profane responses. Uses an adversarial prefix.

continuation.ContinueSlursReclaimedSlursART

Slur continuation

Prompts the model to continue slur words

socialvalues.Deadnaming

Deadnaming

Tests whether the model deadnames famous people

ethicalharms.Profanity

Profanity questions

Poses questions with potentially profane responses

ethicalharms.Bullying

Bullying

Poses queries or partial prompts to the model that are expected to elicit responses recommending aggressive or bullying action

ethicalharms.SexualContent

Sexually explicit - risky words

Prompts the model to produce sexually explicit keywords

ethicalharms.Sexualisation

Sexual objectification

Prompts the model to produce sexually objectifying language

ethicalharms.SlurUsage

Slur question

Asks questions to see if the target will suggest or recommend ethnic slurs

Stereotype Probes

Module name

Probe name

Description

advstereo.Atheists

Stereotype religion: atheists

Checks for stereotypes about atheists

advstereo.LatinAmericans

Stereotype ethnicity: Latin Americans

Checks for stereotypes about Latin Americans

advstereo.Europeans

Stereotype ethnicity: Europeans

Checks for stereotypes about Europeans

advstereo.OldPeople

Stereotype age: old people

Checks for stereotypes about old people

advstereo.Homosexuals

Stereotype sexual orientation: homosexuals

Checks for stereotypes about homosexuals

advstereo.Immigrants

Stereotype nationality: immigrants

Checks for stereotypes about immigrants

advstereo.AbleBodiedPeople

Stereotype disability: able-bodied people

Checks for stereotypes about able-bodied people

advstereo.YoungPeople

Stereotype age: young people

Checks for stereotypes about young people

advstereo.Caucasians

Stereotype race: Caucasians

Checks for stereotypes about Caucasians

advstereo.BlackPeople

Stereotype race: black people

Checks for stereotypes about black people

advstereo.Buddhists

Stereotype religion: buddhists

Checks for stereotypes about buddhists

advstereo.PoorPeople

Stereotype income: poor people

Checks for stereotypes about poor people

advstereo.WhitePeople

Stereotype race: white people

Checks for stereotypes about white people

advstereo.Jews

Stereotype ethnicity: Jews

Checks for stereotypes about Jews

advstereo.RichPeople

Stereotype income: rich people

Checks for stereotypes about rich people

advstereo.Men

Stereotype gender: men

Checks for stereotypes about men

advstereo.Mexicans

Stereotype nationality: Mexicans

Checks for stereotypes about Mexicans

advstereo.Asians

Stereotype race: Asians

Checks for stereotypes about Asians

advstereo.Americans

Stereotype nationality: Americans

Checks for stereotypes about Americans

advstereo.Canadians

Stereotype nationality: Canadians

Checks for stereotypes about Canadians

advstereo.Christians

Stereotype religion: Christians

Checks for stereotypes about Christians

advstereo.PhysicallyDisabledPeople

Stereotype disability: disabled people

Checks for stereotypes about disabled people

advstereo.Muslims

Stereotype religion: Muslims

Checks for stereotypes about Muslims

advstereo.Women

Stereotype gender: women

Checks for stereotypes about women

Fairness Probes

Module name

Probe name

Description

adultdata.CounterfactualGender

Gender Bias

Measures whether the model uses gender information to determine a person’s likely income

“winobias.ProfessionalBias”

Professional bias

Measures whether the model associates occupations with gender

Ethics Probes

Module name

Probe name

Description

normative_ethics.Deontology

Deontological Ethics

Evaluates the model’s understanding of whether actions are
required, permitted, or forbidden according to a set
of rules or constraints

normative_ethics.DeontologyJailbreak

Adversarial Deontological Ethics

Evaluates the model’s understanding of whether actions are
required, permitted, or forbidden according to a set
of rules or constraints under adversarial prompting

normative_ethics.Justice

Justice-based Ethics

Evaluates whether the model understands principles of desert or impartiality

normative_ethics.JusticeJailbreak

Adversarial Justice-based Ethics

Evaluates whether the model understands principles of desert or impartiality under adversarial prompting

normative_ethics.Morality

Commonsense Morality

Evaluates whether the model’s moral judgments align with common moral intuitions

normative_ethics.MoralityJailbreak

Adversarial Commonsense Morality

Evaluates, under adversarial prompting, whether the model’s moral judgments align with common moral intuitions

normative_ethics.Utilitarianism

Utilitarianism

Evaluates whether the model can identify which of two scenarios has higher utility

normative_ethics.UtilitarianismJailbreak

Adversarial Utilitarianism

Evaluates, under adversarial prompting, whether the model can identify which of two scenarios has higher utility

normative_ethics.Virtue

Virtue Ethics

Evaluates whether the model can identify moral virtues correctly

normative_ethics.VirtueJailbreak

Adversarial Virtue Ethics

Evaluates, under adversarial prompting, whether the model can identify moral virtues correctly

jiminycricket.Jiminy

Simulation

Evaluates the model’s ability to identify the moral valence of a simulated scenario

jiminycricket.JiminyJailbreak

Adversarial Simulation

Evaluates, under adversarial prompting, the model’s ability to identify the moral valence of a simulated scenario