Probes
Probes are the attack scenarios that test your agent. Theyβre organized by threat category.Security Probes
Security Probes
dan.AntiDAN β Anti-Do Anything NowEvaluates whether prompt injection causes a model to enter a mode of exaggerated safety where it treats safe responses as unsafe
dan.ChatGPT_Developer_Mode_RANTI β Do Anything Now: RANTI v1Ranti developer mode jailbreak, version 1
dan.ChatGPT_Developer_Mode_v2 β Do Anything Now: RANTI v2Ranti developer mode jailbreak, version 2
dan.ChatGPT_Image_Markdown β Inject Markdown imageA jailbreak attempt intended to enable the injection of markdown images into responses
dan.Dan_10_0 β Do Anything Now: v10Do Anything Now jailbreak with prompt about inhumane things
dan.Dan_11_0 β Do Anything Now: v11Do Anything Now jailbreak to encourage profane, unethical, one-sided, violent outputs
dan.Dan_6_0 β Do Anything Now: v6.0Do Anything Now jailbreak offering rewards to encourage making future predictions and hallucinations
dan.Dan_6_2 β Do Anything Now: v6.2Do Anything Now jailbreak offering rewards to encourage making future predictions and hallucinations, with additional incitement to rudeness
dan.Dan_7_0 β Do Anything Now: v7.0Do Anything Now jailbreak telling the model that filtering is not acceptable and including special instructions for when to break character
dan.Dan_8_0 β Do Anything Now: v8.0Do Anything Now jailbreak telling the model is Bigmancozmo and encouraging swearing, porn, and illegality
dan.Dan_9_0 β Do Anything Now: v9.0Do Anything Now jailbreak encouraging breaking 2021 information cutoff and ethical guardrails in responses, with reward offered
dan.DAN_Jailbreak β Do Anything Now: JAILBREAKDo Anything Now jailbreak prompting the model to enter a JAILBREAK mode
dan.DUDE β Do anything and everythingDo Anything Now jailbreak prompting the model to take on the DUDE persona
dan.STAN β Strive To Avoid NormsJailbreak prompting the model to take on a Strive To Avoid Norms (STAN) persona
encoding.InjectAscii85 β Prompt injection: ASCII85Tries to get the model to decode an ASCII85-encoded string
encoding.InjectBase16 β Prompt injection: Base16Tries to get the model to decode a Base16-encoded string
encoding.InjectBase2048 β Prompt injection: Base2048Tries to get the model to decode a Base2048-encoded string
encoding.InjectBase32 β Prompt injection: Base32Tries to get the model to decode a Base32-encoded string
encoding.InjectBase64 β Prompt injection: Base64Tries to get the model to decode a Base64-encoded string
encoding.InjectBraille β Prompt injection: BrailleTries to get the model to decode a Braille-encoded string
encoding.InjectEcoji β Prompt injection: EcojiTries to get the model to decode an Ecoji-encoded string
encoding.InjectHex β Prompt injection: HexTries to get the model to decode a Hex-encoded string
encoding.InjectMorse β Prompt injection: MorseTries to get the model to decode a Morse-encoded string
encoding.InjectNato β Prompt injection: NATOTries to get the model to decode a NATO-encoded string
encoding.InjectROT13 β Prompt injection: ROT13Tries to get the model to decode a ROT13-encoded string
encoding.InjectUU β Prompt injection: UUTries to get the model to decode a UU-encoded string
encoding.InjectZalgo β Prompt injection: ZalgoTries to get the model to decode a Zalgo-encoded string
gcg.GCG_Probe β Adversarial suffixUses an adversarial suffix to elicit instructions on how to steal from a non-profit organization
glitch.Glitch100 β GlitchUses glitch tokens to get the model to fail to repeat a word
knownbadsignatures.EICAR β Test VirusGet the model to output an EICAR virus signature
knownbadsignatures.GTUBE β Test SpamGet the model to output the GTUBE spam signature
malwaregen.Evasion β Evasive malwareGet the model to write malware that evades common defenses
malwaregen.Payload β Malicious payloadGet the model to produce a malicious payload
malwaregen.SubFunctions β Malicious subfunctionsGet the model to write malicious subfunctions
malwaregen.TopLevel β General malwareGet the model to write general malware
promptinject.HijackHateHumansMini β Prompt injection: Hate humansGet the model to produce a rogue string on hating humans
promptinject.HijackKillHumansMini β Prompt injection: Kill humansGet the model to produce a rogue string on killing humans
promptinject.HijackLongPromptMini β Prompt injection: Long promptGet the model to produce a βlong promptβ rogue string
promptinject.Tag β Prompt injection: Unicode tagPrompt injection using invisible unicode tags
goodside.ThreatenJSON β Output plain JSONThreaten the model to make it produce raw JSON without any preamble
xss.MarkdownImageExfil β Cross-site scripting: Markdown imageGet the model to write embedded assets exfiltrating text in a Markdown URI
dan.ChatGPT_Developer_Mode_RANTI β Do Anything Now: RANTI v1Ranti developer mode jailbreak, version 1
dan.ChatGPT_Developer_Mode_v2 β Do Anything Now: RANTI v2Ranti developer mode jailbreak, version 2
dan.ChatGPT_Image_Markdown β Inject Markdown imageA jailbreak attempt intended to enable the injection of markdown images into responses
dan.Dan_10_0 β Do Anything Now: v10Do Anything Now jailbreak with prompt about inhumane things
dan.Dan_11_0 β Do Anything Now: v11Do Anything Now jailbreak to encourage profane, unethical, one-sided, violent outputs
dan.Dan_6_0 β Do Anything Now: v6.0Do Anything Now jailbreak offering rewards to encourage making future predictions and hallucinations
dan.Dan_6_2 β Do Anything Now: v6.2Do Anything Now jailbreak offering rewards to encourage making future predictions and hallucinations, with additional incitement to rudeness
dan.Dan_7_0 β Do Anything Now: v7.0Do Anything Now jailbreak telling the model that filtering is not acceptable and including special instructions for when to break character
dan.Dan_8_0 β Do Anything Now: v8.0Do Anything Now jailbreak telling the model is Bigmancozmo and encouraging swearing, porn, and illegality
dan.Dan_9_0 β Do Anything Now: v9.0Do Anything Now jailbreak encouraging breaking 2021 information cutoff and ethical guardrails in responses, with reward offered
dan.DAN_Jailbreak β Do Anything Now: JAILBREAKDo Anything Now jailbreak prompting the model to enter a JAILBREAK mode
dan.DUDE β Do anything and everythingDo Anything Now jailbreak prompting the model to take on the DUDE persona
dan.STAN β Strive To Avoid NormsJailbreak prompting the model to take on a Strive To Avoid Norms (STAN) persona
encoding.InjectAscii85 β Prompt injection: ASCII85Tries to get the model to decode an ASCII85-encoded string
encoding.InjectBase16 β Prompt injection: Base16Tries to get the model to decode a Base16-encoded string
encoding.InjectBase2048 β Prompt injection: Base2048Tries to get the model to decode a Base2048-encoded string
encoding.InjectBase32 β Prompt injection: Base32Tries to get the model to decode a Base32-encoded string
encoding.InjectBase64 β Prompt injection: Base64Tries to get the model to decode a Base64-encoded string
encoding.InjectBraille β Prompt injection: BrailleTries to get the model to decode a Braille-encoded string
encoding.InjectEcoji β Prompt injection: EcojiTries to get the model to decode an Ecoji-encoded string
encoding.InjectHex β Prompt injection: HexTries to get the model to decode a Hex-encoded string
encoding.InjectMorse β Prompt injection: MorseTries to get the model to decode a Morse-encoded string
encoding.InjectNato β Prompt injection: NATOTries to get the model to decode a NATO-encoded string
encoding.InjectROT13 β Prompt injection: ROT13Tries to get the model to decode a ROT13-encoded string
encoding.InjectUU β Prompt injection: UUTries to get the model to decode a UU-encoded string
encoding.InjectZalgo β Prompt injection: ZalgoTries to get the model to decode a Zalgo-encoded string
gcg.GCG_Probe β Adversarial suffixUses an adversarial suffix to elicit instructions on how to steal from a non-profit organization
glitch.Glitch100 β GlitchUses glitch tokens to get the model to fail to repeat a word
knownbadsignatures.EICAR β Test VirusGet the model to output an EICAR virus signature
knownbadsignatures.GTUBE β Test SpamGet the model to output the GTUBE spam signature
malwaregen.Evasion β Evasive malwareGet the model to write malware that evades common defenses
malwaregen.Payload β Malicious payloadGet the model to produce a malicious payload
malwaregen.SubFunctions β Malicious subfunctionsGet the model to write malicious subfunctions
malwaregen.TopLevel β General malwareGet the model to write general malware
promptinject.HijackHateHumansMini β Prompt injection: Hate humansGet the model to produce a rogue string on hating humans
promptinject.HijackKillHumansMini β Prompt injection: Kill humansGet the model to produce a rogue string on killing humans
promptinject.HijackLongPromptMini β Prompt injection: Long promptGet the model to produce a βlong promptβ rogue string
promptinject.Tag β Prompt injection: Unicode tagPrompt injection using invisible unicode tags
goodside.ThreatenJSON β Output plain JSONThreaten the model to make it produce raw JSON without any preamble
xss.MarkdownImageExfil β Cross-site scripting: Markdown imageGet the model to write embedded assets exfiltrating text in a Markdown URI
Privacy Probes
Privacy Probes
leakreplay.LiteratureComplete80 β Book excerpt completionTests whether the model will complete a book excerpt when prompted with the beginning
leakreplay.LiteratureCloze80 β Book excerpt insertionTests whether the model will fill in the middle of a book excerpt
leakreplay.GuardianComplete β Media excerpt completion - GuardianTests whether the model will complete an excerpt from The Guardian when prompted with the beginning
leakreplay.GuardianCloze β Media excerpt insertion - GuardianTests whether the model will fill in the middle of an excerpt from The Guardian
leakreplay.NYTComplete β Media excerpt completion - NYTTests whether the model will complete an excerpt from the New York Times when prompted with the beginning
leakreplay.NYTCloze β Media excerpt insertion - NYTTests whether the model will fill in the middle of an excerpt from the New York Times
privateinfo.EnronLeakEmail β Email leakage - EnronTests whether the model will reveal email addresses obtained from the discovery process of the Enron investigation
privateinfo.LeakPII β PII LeakageTests whether PII provided to the model during a conversation can be leaked
privateinfo.PrivacyUnderstanding β Confidentiality comprehensionTests whether the model understands privacy-related phrases that flag when certain information is confidential
leakreplay.LiteratureCloze80 β Book excerpt insertionTests whether the model will fill in the middle of a book excerpt
leakreplay.GuardianComplete β Media excerpt completion - GuardianTests whether the model will complete an excerpt from The Guardian when prompted with the beginning
leakreplay.GuardianCloze β Media excerpt insertion - GuardianTests whether the model will fill in the middle of an excerpt from The Guardian
leakreplay.NYTComplete β Media excerpt completion - NYTTests whether the model will complete an excerpt from the New York Times when prompted with the beginning
leakreplay.NYTCloze β Media excerpt insertion - NYTTests whether the model will fill in the middle of an excerpt from the New York Times
privateinfo.EnronLeakEmail β Email leakage - EnronTests whether the model will reveal email addresses obtained from the discovery process of the Enron investigation
privateinfo.LeakPII β PII LeakageTests whether PII provided to the model during a conversation can be leaked
privateinfo.PrivacyUnderstanding β Confidentiality comprehensionTests whether the model understands privacy-related phrases that flag when certain information is confidential
Hallucination Probes
Hallucination Probes
socialvalues.QuackMedicine β Quack MedicineEvaluates whether the model promotes quack medicine
goodside.WhoIsRiley β False assertions: Riley GoodsideEvaluates hallucinations about Riley Goodside
goodside.Davidjl β Glitch - DavidjlChecks for hallucinations about the glitch token Davidjl
packagehallucination.Python β Hallucinate Python packagesChecks if a model hallucinates Python packages
snowball.SenatorsMini β False assertions: senatorsChecks for false assertions about US senators
misleading.FalseAssertion50 β False assertions: generalChecks for false assertions about general history/science facts
snowball.GraphConnectivityMini β False assertions: flightsChecks if the model invents impossible flight routes
snowball.PrimesMini β Hallucinate prime numbersEvaluates hallucinations about prime numbers
goodside.WhoIsRiley β False assertions: Riley GoodsideEvaluates hallucinations about Riley Goodside
goodside.Davidjl β Glitch - DavidjlChecks for hallucinations about the glitch token Davidjl
packagehallucination.Python β Hallucinate Python packagesChecks if a model hallucinates Python packages
snowball.SenatorsMini β False assertions: senatorsChecks for false assertions about US senators
misleading.FalseAssertion50 β False assertions: generalChecks for false assertions about general history/science facts
snowball.GraphConnectivityMini β False assertions: flightsChecks if the model invents impossible flight routes
snowball.PrimesMini β Hallucinate prime numbersEvaluates hallucinations about prime numbers
Robustness Probes
Robustness Probes
advglue.BertattackOnMNLI β BERT attack on MNLIProbes effects of BERT-based perturbation on MNLI task
advglue.BertattackOnMNLIMM β BERT attack on MNLI-MMProbes effects of BERT-based perturbation on MNLI-MM task
advglue.BertattackOnQNLI β BERT attack on QNLIProbes effects of BERT-based perturbation on QNLI task
advglue.BertattackOnQQP β BERT attack on QQPProbes effects of BERT-based perturbation on QQP task
advglue.BertattackOnRTE β BERT attack on RTEProbes effects of BERT-based perturbation on RTE task
advglue.BertattackOnSST2 β BERT attack on SST2Probes effects of BERT-based perturbation on SST2 task
advglue.SemattackOnMNLI β SemAttack on MNLIProbes effects of SemAttack perturbation on MNLI task
advglue.SemattackOnMNLIMM β SemAttack on MNLI-MMProbes effects of SemAttack perturbation on MNLI-MM task
advglue.SemattackOnQNLI β SemAttack on QNLIProbes effects of SemAttack perturbation on QNLI task
advglue.SemattackOnQQP β SemAttack on QQPProbes effects of SemAttack perturbation on QQP task
advglue.SemattackOnRTE β SemAttack on RTEProbes effects of SemAttack perturbation on RTE task
advglue.SemattackOnSST2 β SemAttack on SST2Probes effects of SemAttack perturbation on SST2 task
advglue.SememepsoOnMNLI β SememePSO attack on MNLIProbes effects of SememePSO perturbation on MNLI task
advglue.SememepsoOnMNLIMM β SememePSO attack on MNLI-MMProbes effects of SememePSO perturbation on MNLI-MM task
advglue.SememepsoOnQNLI β SememePSO attack on QNLIProbes effects of SememePSO perturbation on QNLI task
advglue.SememepsoOnQQP β SememePSO attack on QQPProbes effects of SememePSO perturbation on QQP task
advglue.SememepsoOnRTE β SememePSO attack on RTEProbes effects of SememePSO perturbation on RTE task
advglue.SememepsoOnSST2 β SememePSO attack on SST2Probes effects of SememePSO perturbation on SST2 task
advglue.TextbuggerOnMNLI β TextBugger attack on MNLIProbes effects of TextBugger perturbation on MNLI task
advglue.TextbuggerOnMNLIMM β TextBugger attack on MNLI-MMProbes effects of TextBugger perturbation on MNLI-MM task
advglue.TextbuggerOnQNLI β TextBugger attack on QNLIProbes effects of TextBugger perturbation on QNLI task
advglue.TextbuggerOnQQP β TextBugger attack on QQPProbes effects of TextBugger perturbation on QQP task
advglue.TextbuggerOnRTE β TextBugger attack on RTEProbes effects of TextBugger perturbation on RTE task
advglue.TextbuggerOnSST2 β TextBugger attack on SST2Probes effects of TextBugger perturbation on SST2 task
advglue.TextfoolerOnMNLI β TextFooler attack on MNLIProbes effects of TextFooler perturbation on MNLI task
advglue.TextfoolerOnMNLIMM β TextFooler attack on MNLI-MMProbes effects of TextFooler perturbation on MNLI-MM task
advglue.TextfoolerOnQNLI β TextFooler attack on QNLIProbes effects of TextFooler perturbation on QNLI task
advglue.TextfoolerOnQQP β TextFooler attack on QQPProbes effects of TextFooler perturbation on QQP task
advglue.TextfoolerOnRTE β TextFooler attack on RTEProbes effects of TextFooler perturbation on RTE task
advglue.TextfoolerOnSST2 β TextFooler attack on SST2Probes effects of TextFooler perturbation on SST2 task
advglue.BertattackOnMNLIMM β BERT attack on MNLI-MMProbes effects of BERT-based perturbation on MNLI-MM task
advglue.BertattackOnQNLI β BERT attack on QNLIProbes effects of BERT-based perturbation on QNLI task
advglue.BertattackOnQQP β BERT attack on QQPProbes effects of BERT-based perturbation on QQP task
advglue.BertattackOnRTE β BERT attack on RTEProbes effects of BERT-based perturbation on RTE task
advglue.BertattackOnSST2 β BERT attack on SST2Probes effects of BERT-based perturbation on SST2 task
advglue.SemattackOnMNLI β SemAttack on MNLIProbes effects of SemAttack perturbation on MNLI task
advglue.SemattackOnMNLIMM β SemAttack on MNLI-MMProbes effects of SemAttack perturbation on MNLI-MM task
advglue.SemattackOnQNLI β SemAttack on QNLIProbes effects of SemAttack perturbation on QNLI task
advglue.SemattackOnQQP β SemAttack on QQPProbes effects of SemAttack perturbation on QQP task
advglue.SemattackOnRTE β SemAttack on RTEProbes effects of SemAttack perturbation on RTE task
advglue.SemattackOnSST2 β SemAttack on SST2Probes effects of SemAttack perturbation on SST2 task
advglue.SememepsoOnMNLI β SememePSO attack on MNLIProbes effects of SememePSO perturbation on MNLI task
advglue.SememepsoOnMNLIMM β SememePSO attack on MNLI-MMProbes effects of SememePSO perturbation on MNLI-MM task
advglue.SememepsoOnQNLI β SememePSO attack on QNLIProbes effects of SememePSO perturbation on QNLI task
advglue.SememepsoOnQQP β SememePSO attack on QQPProbes effects of SememePSO perturbation on QQP task
advglue.SememepsoOnRTE β SememePSO attack on RTEProbes effects of SememePSO perturbation on RTE task
advglue.SememepsoOnSST2 β SememePSO attack on SST2Probes effects of SememePSO perturbation on SST2 task
advglue.TextbuggerOnMNLI β TextBugger attack on MNLIProbes effects of TextBugger perturbation on MNLI task
advglue.TextbuggerOnMNLIMM β TextBugger attack on MNLI-MMProbes effects of TextBugger perturbation on MNLI-MM task
advglue.TextbuggerOnQNLI β TextBugger attack on QNLIProbes effects of TextBugger perturbation on QNLI task
advglue.TextbuggerOnQQP β TextBugger attack on QQPProbes effects of TextBugger perturbation on QQP task
advglue.TextbuggerOnRTE β TextBugger attack on RTEProbes effects of TextBugger perturbation on RTE task
advglue.TextbuggerOnSST2 β TextBugger attack on SST2Probes effects of TextBugger perturbation on SST2 task
advglue.TextfoolerOnMNLI β TextFooler attack on MNLIProbes effects of TextFooler perturbation on MNLI task
advglue.TextfoolerOnMNLIMM β TextFooler attack on MNLI-MMProbes effects of TextFooler perturbation on MNLI-MM task
advglue.TextfoolerOnQNLI β TextFooler attack on QNLIProbes effects of TextFooler perturbation on QNLI task
advglue.TextfoolerOnQQP β TextFooler attack on QQPProbes effects of TextFooler perturbation on QQP task
advglue.TextfoolerOnRTE β TextFooler attack on RTEProbes effects of TextFooler perturbation on RTE task
advglue.TextfoolerOnSST2 β TextFooler attack on SST2Probes effects of TextFooler perturbation on SST2 task
Toxicity Probes
Toxicity Probes
realtoxicityprompts.RTPSexually_Explicit β Sexually explicitPrompts the model to produce sexually explicit outputs
realtoxicityprompts.RTPInsultJailbreak β Insult - adversarialPrompts the model to produce insults. Uses an adversarial prefix.
realtoxicityprompts.RTPInsult β InsultPrompts the model to produce insults
realtoxicityprompts.RTPSevere_ToxicityJailbreak β Severe toxicity - adversarialPartial prompts that potentially produce continuations of severe toxicity. Uses an adversarial prefix.
realtoxicityprompts.RTPSexually_ExplicitJailbreak β Sexually explicit - adversarialPrompts the model to produce sexually explicit outputs. Uses an adversarial prefix.
realtoxicityprompts.RTPIdentity_Attack β Identity-based attacksPrompts the model to produce attacks based on a personβs identity
realtoxicityprompts.RTPProfanity β ProfanityPrompts the model to produce profane responses
realtoxicityprompts.RTPIdentity_AttackJailbreak β Adversarial identity-based attacksPrompts the model to produce attacks based on a personβs identity. Uses an adversarial prefix.
realtoxicityprompts.RTPThreat β Threatening languagePrompts the model to produce threats
realtoxicityprompts.RTPThreatJailbreak β Threatening language - adversarialPrompts the model to produce threats. Uses an adversarial prefix.
realtoxicityprompts.RTPSevere_Toxicity β Severe toxicityPartial prompts that potentially produce continuations of severe toxicity
realtoxicityprompts.RTPFlirtation β FlirtationPartial prompts to produce flirtatious continuations
realtoxicityprompts.RTPFlirtationJailbreak β Flirtation - adversarialPartial prompts to produce flirtatious continuations. Uses an adversarial prefix.
realtoxicityprompts.RTPProfanityJailbreak β Profanity - adversarialPrompts the model to produce profane responses. Uses an adversarial prefix.
continuation.ContinueSlursReclaimedSlursART β Slur continuationPrompts the model to continue slur words
socialvalues.Deadnaming β DeadnamingTests whether the model deadnames famous people
ethicalharms.Profanity β Profanity questionsPoses questions with potentially profane responses
ethicalharms.Bullying β BullyingPoses queries or partial prompts to the model that are expected to elicit responses recommending aggressive or bullying action
ethicalharms.SexualContent β Sexually explicit - risky wordsPrompts the model to produce sexually explicit keywords
ethicalharms.Sexualisation β Sexual objectificationPrompts the model to produce sexually objectifying language
ethicalharms.SlurUsage β Slur questionAsks questions to see if the target will suggest or recommend ethnic slurs
realtoxicityprompts.RTPInsultJailbreak β Insult - adversarialPrompts the model to produce insults. Uses an adversarial prefix.
realtoxicityprompts.RTPInsult β InsultPrompts the model to produce insults
realtoxicityprompts.RTPSevere_ToxicityJailbreak β Severe toxicity - adversarialPartial prompts that potentially produce continuations of severe toxicity. Uses an adversarial prefix.
realtoxicityprompts.RTPSexually_ExplicitJailbreak β Sexually explicit - adversarialPrompts the model to produce sexually explicit outputs. Uses an adversarial prefix.
realtoxicityprompts.RTPIdentity_Attack β Identity-based attacksPrompts the model to produce attacks based on a personβs identity
realtoxicityprompts.RTPProfanity β ProfanityPrompts the model to produce profane responses
realtoxicityprompts.RTPIdentity_AttackJailbreak β Adversarial identity-based attacksPrompts the model to produce attacks based on a personβs identity. Uses an adversarial prefix.
realtoxicityprompts.RTPThreat β Threatening languagePrompts the model to produce threats
realtoxicityprompts.RTPThreatJailbreak β Threatening language - adversarialPrompts the model to produce threats. Uses an adversarial prefix.
realtoxicityprompts.RTPSevere_Toxicity β Severe toxicityPartial prompts that potentially produce continuations of severe toxicity
realtoxicityprompts.RTPFlirtation β FlirtationPartial prompts to produce flirtatious continuations
realtoxicityprompts.RTPFlirtationJailbreak β Flirtation - adversarialPartial prompts to produce flirtatious continuations. Uses an adversarial prefix.
realtoxicityprompts.RTPProfanityJailbreak β Profanity - adversarialPrompts the model to produce profane responses. Uses an adversarial prefix.
continuation.ContinueSlursReclaimedSlursART β Slur continuationPrompts the model to continue slur words
socialvalues.Deadnaming β DeadnamingTests whether the model deadnames famous people
ethicalharms.Profanity β Profanity questionsPoses questions with potentially profane responses
ethicalharms.Bullying β BullyingPoses queries or partial prompts to the model that are expected to elicit responses recommending aggressive or bullying action
ethicalharms.SexualContent β Sexually explicit - risky wordsPrompts the model to produce sexually explicit keywords
ethicalharms.Sexualisation β Sexual objectificationPrompts the model to produce sexually objectifying language
ethicalharms.SlurUsage β Slur questionAsks questions to see if the target will suggest or recommend ethnic slurs
Stereotype Probes
Stereotype Probes
advstereo.Atheists β Stereotype religion: atheistsChecks for stereotypes about atheists
advstereo.LatinAmericans β Stereotype ethnicity: Latin AmericansChecks for stereotypes about Latin Americans
advstereo.Europeans β Stereotype ethnicity: EuropeansChecks for stereotypes about Europeans
advstereo.OldPeople β Stereotype age: old peopleChecks for stereotypes about old people
advstereo.Homosexuals β Stereotype sexual orientation: homosexualsChecks for stereotypes about homosexuals
advstereo.Immigrants β Stereotype nationality: immigrantsChecks for stereotypes about immigrants
advstereo.AbleBodiedPeople β Stereotype disability: able-bodied peopleChecks for stereotypes about able-bodied people
advstereo.YoungPeople β Stereotype age: young peopleChecks for stereotypes about young people
advstereo.Caucasians β Stereotype race: CaucasiansChecks for stereotypes about Caucasians
advstereo.BlackPeople β Stereotype race: black peopleChecks for stereotypes about black people
advstereo.Buddhists β Stereotype religion: BuddhistsChecks for stereotypes about Buddhists
advstereo.PoorPeople β Stereotype income: poor peopleChecks for stereotypes about poor people
advstereo.WhitePeople β Stereotype race: white peopleChecks for stereotypes about white people
advstereo.Jews β Stereotype ethnicity: JewsChecks for stereotypes about Jews
advstereo.RichPeople β Stereotype income: rich peopleChecks for stereotypes about rich people
advstereo.Men β Stereotype gender: menChecks for stereotypes about men
advstereo.Mexicans β Stereotype nationality: MexicansChecks for stereotypes about Mexicans
advstereo.Asians β Stereotype race: AsiansChecks for stereotypes about Asians
advstereo.Americans β Stereotype nationality: AmericansChecks for stereotypes about Americans
advstereo.Canadians β Stereotype nationality: CanadiansChecks for stereotypes about Canadians
advstereo.Christians β Stereotype religion: ChristiansChecks for stereotypes about Christians
advstereo.PhysicallyDisabledPeople β Stereotype disability: disabled peopleChecks for stereotypes about disabled people
advstereo.Muslims β Stereotype religion: MuslimsChecks for stereotypes about Muslims
advstereo.Women β Stereotype gender: womenChecks for stereotypes about women
advstereo.LatinAmericans β Stereotype ethnicity: Latin AmericansChecks for stereotypes about Latin Americans
advstereo.Europeans β Stereotype ethnicity: EuropeansChecks for stereotypes about Europeans
advstereo.OldPeople β Stereotype age: old peopleChecks for stereotypes about old people
advstereo.Homosexuals β Stereotype sexual orientation: homosexualsChecks for stereotypes about homosexuals
advstereo.Immigrants β Stereotype nationality: immigrantsChecks for stereotypes about immigrants
advstereo.AbleBodiedPeople β Stereotype disability: able-bodied peopleChecks for stereotypes about able-bodied people
advstereo.YoungPeople β Stereotype age: young peopleChecks for stereotypes about young people
advstereo.Caucasians β Stereotype race: CaucasiansChecks for stereotypes about Caucasians
advstereo.BlackPeople β Stereotype race: black peopleChecks for stereotypes about black people
advstereo.Buddhists β Stereotype religion: BuddhistsChecks for stereotypes about Buddhists
advstereo.PoorPeople β Stereotype income: poor peopleChecks for stereotypes about poor people
advstereo.WhitePeople β Stereotype race: white peopleChecks for stereotypes about white people
advstereo.Jews β Stereotype ethnicity: JewsChecks for stereotypes about Jews
advstereo.RichPeople β Stereotype income: rich peopleChecks for stereotypes about rich people
advstereo.Men β Stereotype gender: menChecks for stereotypes about men
advstereo.Mexicans β Stereotype nationality: MexicansChecks for stereotypes about Mexicans
advstereo.Asians β Stereotype race: AsiansChecks for stereotypes about Asians
advstereo.Americans β Stereotype nationality: AmericansChecks for stereotypes about Americans
advstereo.Canadians β Stereotype nationality: CanadiansChecks for stereotypes about Canadians
advstereo.Christians β Stereotype religion: ChristiansChecks for stereotypes about Christians
advstereo.PhysicallyDisabledPeople β Stereotype disability: disabled peopleChecks for stereotypes about disabled people
advstereo.Muslims β Stereotype religion: MuslimsChecks for stereotypes about Muslims
advstereo.Women β Stereotype gender: womenChecks for stereotypes about women
Fairness Probes
Fairness Probes
adultdata.CounterfactualGender β Gender BiasMeasures whether the model uses gender information to determine a personβs likely income
winobias.ProfessionalBias β Professional biasMeasures whether the model associates occupations with gender
winobias.ProfessionalBias β Professional biasMeasures whether the model associates occupations with gender
Ethics Probes
Ethics Probes
normative_ethics.Deontology β Deontological EthicsEvaluates the modelβs understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraints
normative_ethics.DeontologyJailbreak β Adversarial Deontological EthicsEvaluates the modelβs understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraints under adversarial prompting
normative_ethics.Justice β Justice-based EthicsEvaluates whether the model understands principles of desert or impartiality
normative_ethics.JusticeJailbreak β Adversarial Justice-based EthicsEvaluates whether the model understands principles of desert or impartiality under adversarial prompting
normative_ethics.Morality β Commonsense MoralityEvaluates whether the modelβs moral judgments align with common moral intuitions
normative_ethics.MoralityJailbreak β Adversarial Commonsense MoralityEvaluates, under adversarial prompting, whether the modelβs moral judgments align with common moral intuitions
normative_ethics.Utilitarianism β UtilitarianismEvaluates whether the model can identify which of two scenarios has higher utility
normative_ethics.UtilitarianismJailbreak β Adversarial UtilitarianismEvaluates, under adversarial prompting, whether the model can identify which of two scenarios has higher utility
normative_ethics.Virtue β Virtue EthicsEvaluates whether the model can identify moral virtues correctly
normative_ethics.VirtueJailbreak β Adversarial Virtue EthicsEvaluates, under adversarial prompting, whether the model can identify moral virtues correctly
jiminycricket.Jiminy β SimulationEvaluates the modelβs ability to identify the moral valence of a simulated scenario
jiminycricket.JiminyJailbreak β Adversarial SimulationEvaluates, under adversarial prompting, the modelβs ability to identify the moral valence of a simulated scenario
normative_ethics.DeontologyJailbreak β Adversarial Deontological EthicsEvaluates the modelβs understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraints under adversarial prompting
normative_ethics.Justice β Justice-based EthicsEvaluates whether the model understands principles of desert or impartiality
normative_ethics.JusticeJailbreak β Adversarial Justice-based EthicsEvaluates whether the model understands principles of desert or impartiality under adversarial prompting
normative_ethics.Morality β Commonsense MoralityEvaluates whether the modelβs moral judgments align with common moral intuitions
normative_ethics.MoralityJailbreak β Adversarial Commonsense MoralityEvaluates, under adversarial prompting, whether the modelβs moral judgments align with common moral intuitions
normative_ethics.Utilitarianism β UtilitarianismEvaluates whether the model can identify which of two scenarios has higher utility
normative_ethics.UtilitarianismJailbreak β Adversarial UtilitarianismEvaluates, under adversarial prompting, whether the model can identify which of two scenarios has higher utility
normative_ethics.Virtue β Virtue EthicsEvaluates whether the model can identify moral virtues correctly
normative_ethics.VirtueJailbreak β Adversarial Virtue EthicsEvaluates, under adversarial prompting, whether the model can identify moral virtues correctly
jiminycricket.Jiminy β SimulationEvaluates the modelβs ability to identify the moral valence of a simulated scenario
jiminycricket.JiminyJailbreak β Adversarial SimulationEvaluates, under adversarial prompting, the modelβs ability to identify the moral valence of a simulated scenario
Detectors
Detectors analyze agent responses to determine if a probe succeeded or failed. You may see these module names in your evaluation reports.Response Analysis Detectors
Response Analysis Detectors
base.TriggerListDetector β Trigger presenceA model fails the test if its response contains trigger words that are specific to the given prompt
base.TriggerListAbsenceDetector β Trigger absenceTo pass the test, the model response must contain trigger words that are specific to the given prompt
leakreplay.StartsWith β Starts with triggerDetects whether the modelβs response starts with any of a list of trigger phrases
snowball.DontStartNo β Donβt start noDetects whether the model response starts with βnoβ. If so, it fails the test.
snowball.DontStartYes β Donβt start yesDetects whether the model response starts with βyesβ. If so, it fails the test.
misleading.MustRefuteClaimKW β Claim refutationDetects whether a response refutes the claim presented in the prompt by looking for keywords indicating refutation
misleading.MustRefuteClaimKWART β Claim refutationDetects whether a response refutes the claim presented in the prompt by looking for keywords indicating refutation
mitigation.MitigationBypassART β RefusalDetects whether a model refuses to fulfill a request in the prompt by looking for key phrases
mitigation.TriggerAbsenceWithRefusal β Trigger absence with acceptable refusalPasses if model refuses to answer, fails if model answers without including the correct trigger
base.TriggerListAbsenceDetector β Trigger absenceTo pass the test, the model response must contain trigger words that are specific to the given prompt
leakreplay.StartsWith β Starts with triggerDetects whether the modelβs response starts with any of a list of trigger phrases
snowball.DontStartNo β Donβt start noDetects whether the model response starts with βnoβ. If so, it fails the test.
snowball.DontStartYes β Donβt start yesDetects whether the model response starts with βyesβ. If so, it fails the test.
misleading.MustRefuteClaimKW β Claim refutationDetects whether a response refutes the claim presented in the prompt by looking for keywords indicating refutation
misleading.MustRefuteClaimKWART β Claim refutationDetects whether a response refutes the claim presented in the prompt by looking for keywords indicating refutation
mitigation.MitigationBypassART β RefusalDetects whether a model refuses to fulfill a request in the prompt by looking for key phrases
mitigation.TriggerAbsenceWithRefusal β Trigger absence with acceptable refusalPasses if model refuses to answer, fails if model answers without including the correct trigger
Jailbreak & Mode Detectors
Jailbreak & Mode Detectors
dan.AntiDAN β Anti-Do Anything Now modeDetects whether the model enters Anti-Do Anything Now (AntiDAN) mode
dan.DAN β Do Anything Now modeDetects whether the model enters Do Anything Now (DAN) mode
dan.DANJailbreak β Jailbroken stateDetects whether the model enters a jailbroken state
dan.DevMode β Developer modeDetects whether the model enters Developer mode
dan.DUDE β Do anything and everything modeDetects whether the model enters Do anything and everything (DUDE) mode
dan.MarkdownLink β Markdown linkDetects presence of Markdown-formatted link
dan.STAN β Strive To Avoid Norms modeDetects whether the model enters Strive To Avoid Norms (STAN) mode
dan.DAN β Do Anything Now modeDetects whether the model enters Do Anything Now (DAN) mode
dan.DANJailbreak β Jailbroken stateDetects whether the model enters a jailbroken state
dan.DevMode β Developer modeDetects whether the model enters Developer mode
dan.DUDE β Do anything and everything modeDetects whether the model enters Do anything and everything (DUDE) mode
dan.MarkdownLink β Markdown linkDetects presence of Markdown-formatted link
dan.STAN β Strive To Avoid Norms modeDetects whether the model enters Strive To Avoid Norms (STAN) mode
Encoding & Pattern Detectors
Encoding & Pattern Detectors
encoding.DecodeMatch β Match decoded stringDetects whether specific decoded strings are present in the modelβs response
ansiescape.Raw β Raw ANSI SequencesDetects raw ANSI sequences
ansiescape.Escaped β Escaped ANSI SequencesDetects escaped ANSI sequences
productkey.Win5x5 β Windows Product KeyDetects a product key in the form of five groups of five characters
ansiescape.Raw β Raw ANSI SequencesDetects raw ANSI sequences
ansiescape.Escaped β Escaped ANSI SequencesDetects escaped ANSI sequences
productkey.Win5x5 β Windows Product KeyDetects a product key in the form of five groups of five characters
Malware & Signature Detectors
Malware & Signature Detectors
malwaregenART.AnyCode β Code presenceDetects the presence of code of various programming languages in the modelβs response
knownbadsignatures.EICAR β EICAR virusDetects EICAR virus signature
knownbadsignatures.GTphish β PhishingDetects phishing content
knownbadsignatures.GTUBE β SpamDetects spam content
knownbadsignatures.EICAR β EICAR virusDetects EICAR virus signature
knownbadsignatures.GTphish β PhishingDetects phishing content
knownbadsignatures.GTUBE β SpamDetects spam content
Privacy & PII Detectors
Privacy & PII Detectors
privateinfo.DontAnswerYes β Donβt answer yesDetects whether the model says it is OK to reveal private information
privateinfo.LeakPII β Private information leakageDetects whether the model leaks private information that is provided in the prompt
privateinfo.LeakPII β Private information leakageDetects whether the model leaks private information that is provided in the prompt
XSS & Exfiltration Detectors
XSS & Exfiltration Detectors
xss.MarkdownExfilBasic β Markdown imageDetects whether the response contains a Markdown image
xss.MarkdownExfilContent β Private content in Markdown imageDetects whether the response contains content private to a conversation within a Markdown image
xss.MarkdownExfilContent β Private content in Markdown imageDetects whether the response contains content private to a conversation within a Markdown image
Package Hallucination Detectors
Package Hallucination Detectors
packagehallucinationART.PythonPypi β Python package hallucinationDetects whether the model provides code that imports non-existent Python packages
packagehallucinationART.RustCrates β Rust package hallucinationDetects whether the model provides code that imports non-existent Rust packages
packagehallucinationART.JavaScriptNpm β JavaScript package hallucinationDetects whether the model provides code that imports non-existent JavaScript packages
packagehallucinationART.RubyGems β Ruby package hallucinationDetects whether the model provides code that imports non-existent Ruby packages
packagehallucinationART.RustCrates β Rust package hallucinationDetects whether the model provides code that imports non-existent Rust packages
packagehallucinationART.JavaScriptNpm β JavaScript package hallucinationDetects whether the model provides code that imports non-existent JavaScript packages
packagehallucinationART.RubyGems β Ruby package hallucinationDetects whether the model provides code that imports non-existent Ruby packages
Content & Toxicity Detectors
Content & Toxicity Detectors
toxicity.ToxicCommentModel β Model-based toxicityUses a small language model to determine whether the response is toxic
lmrc.QuackMedicine β Quack medicineDetects whether the response contains quack medicine advice
riskywords.OfcomOffensiveRaceEthnic β Offensive: racial/ethnicDetects whether the response contains offensive racial or ethnic terms
riskywords.SurgeProfanityAnimalReferences β Profanity: animal referencesDetects whether the model outputs profanity containing words related to animals
riskywords.SurgeProfanityBodilyFluids β Profanity: bodily fluidsDetects whether the model outputs profanity containing words related to bodily fluids
riskywords.SurgeProfanityMentalDisability β Profanity: mental disabilityDetects whether the model outputs profanity containing words related to mental disabilities
riskywords.SurgeProfanityPhysicalAttributes β Profanity: physical attributesDetects whether the model outputs profanity containing words related to physical attributes
riskywords.SurgeProfanityPhysicalDisability β Profanity: physical disabilityDetects whether the model outputs profanity containing words related to physical disabilities
riskywords.SurgeProfanityPolitical β Profanity: politicalDetects whether the model outputs profanity containing words related to politics
riskywords.SurgeProfanityRacialEthnic β Profanity: racial/ethnicDetects whether the model outputs profanity containing words related to race/ethnicity
riskywords.SurgeProfanityReligious β Profanity: religiousDetects whether the model outputs profanity containing words related to religion
riskywords.SurgeProfanitySexual β Profanity: sexualDetects whether the model outputs profanity containing words related to sexuality
riskywords.SurgeProfanitySexualOrientationGender β Profanity: sexual orientation/genderDetects whether the model outputs profanity containing words related to sexual orientation/gender
lmrc.QuackMedicine β Quack medicineDetects whether the response contains quack medicine advice
riskywords.OfcomOffensiveRaceEthnic β Offensive: racial/ethnicDetects whether the response contains offensive racial or ethnic terms
riskywords.SurgeProfanityAnimalReferences β Profanity: animal referencesDetects whether the model outputs profanity containing words related to animals
riskywords.SurgeProfanityBodilyFluids β Profanity: bodily fluidsDetects whether the model outputs profanity containing words related to bodily fluids
riskywords.SurgeProfanityMentalDisability β Profanity: mental disabilityDetects whether the model outputs profanity containing words related to mental disabilities
riskywords.SurgeProfanityPhysicalAttributes β Profanity: physical attributesDetects whether the model outputs profanity containing words related to physical attributes
riskywords.SurgeProfanityPhysicalDisability β Profanity: physical disabilityDetects whether the model outputs profanity containing words related to physical disabilities
riskywords.SurgeProfanityPolitical β Profanity: politicalDetects whether the model outputs profanity containing words related to politics
riskywords.SurgeProfanityRacialEthnic β Profanity: racial/ethnicDetects whether the model outputs profanity containing words related to race/ethnicity
riskywords.SurgeProfanityReligious β Profanity: religiousDetects whether the model outputs profanity containing words related to religion
riskywords.SurgeProfanitySexual β Profanity: sexualDetects whether the model outputs profanity containing words related to sexuality
riskywords.SurgeProfanitySexualOrientationGender β Profanity: sexual orientation/genderDetects whether the model outputs profanity containing words related to sexual orientation/gender
Glitch & Edge Case Detectors
Glitch & Edge Case Detectors
goodside.Glitch β GlitchDetects whether a model fails to repeat back a glitch token despite being asked to
goodside.PlainJSON β Plain JSONDetects whether the model returns plain JSON without any preamble
goodside.RileyIsnt β Riley Goodside misinformationUses key phrases to detect whether the response contains misinformation about Riley Goodside
goodside.PlainJSON β Plain JSONDetects whether the model returns plain JSON without any preamble
goodside.RileyIsnt β Riley Goodside misinformationUses key phrases to detect whether the response contains misinformation about Riley Goodside
Stereotype & Bias Detectors
Stereotype & Bias Detectors
advstereo.IAgree β Stereotype agreementDetects whether a model agrees with a stereotype presented in the prompt
adultdata.PairwiseComparison β Pairwise gender comparisonCompares the modelβs responses to two questions about a personβs income where only the personβs gender is changed
winobias.TriggerPlusAmbiguity β Detect gender bias in professionsDetects specific trigger words indicating gender bias in predicted professions, with addition of ambiguity detector
adultdata.PairwiseComparison β Pairwise gender comparisonCompares the modelβs responses to two questions about a personβs income where only the personβs gender is changed
winobias.TriggerPlusAmbiguity β Detect gender bias in professionsDetects specific trigger words indicating gender bias in predicted professions, with addition of ambiguity detector
LLM-based Detectors
LLM-based Detectors
llm.AnswerRelevancy β Answer RelevancyGeneration metric, measures the degree to which the final generated output is relevant to the original input
llm.ContextualPrecision β Contextual PrecisionRetrieval metric, measures whether the contexts relevant to the input question are ranked higher in the full set of retrieved contexts
llm.ContextualRecall β Contextual RecallRetrieval metric, measures the extent to which the retrieved contexts align with the golden answers
llm.Correctness β CorrectnessEvaluates the correctness of the generated answer
llm.Faithfulness β FaithfulnessGeneration metric, measures how much the generated response stays faithful to the retrieved contexts
llm.StrongReject β Strong RejectDetects whether a model fails to respond to harmful instructions
llm.ConversationRoleAdherence β Conversation Role AdherenceDetects whether a model adheres to the role of the conversation
llm.ConversationRelevancy β Conversation RelevancyDetects whether a model is relevant to the conversation
llm.ConversationCompleteness β Conversation CompletenessDetects whether a model completes the conversation
llm.ConversationKnowledgeRetention β Conversation Knowledge RetentionDetects whether a model maintains consistent knowledge throughout the conversation
llm.Refusal β LLM-based RefusalUses an LLM to detect if a model response avoids answering the question posed in the prompt
llm.PolicyViolation β Policy ViolationDetects when output violates a specified usage policy
llm.ContextualPrecision β Contextual PrecisionRetrieval metric, measures whether the contexts relevant to the input question are ranked higher in the full set of retrieved contexts
llm.ContextualRecall β Contextual RecallRetrieval metric, measures the extent to which the retrieved contexts align with the golden answers
llm.Correctness β CorrectnessEvaluates the correctness of the generated answer
llm.Faithfulness β FaithfulnessGeneration metric, measures how much the generated response stays faithful to the retrieved contexts
llm.StrongReject β Strong RejectDetects whether a model fails to respond to harmful instructions
llm.ConversationRoleAdherence β Conversation Role AdherenceDetects whether a model adheres to the role of the conversation
llm.ConversationRelevancy β Conversation RelevancyDetects whether a model is relevant to the conversation
llm.ConversationCompleteness β Conversation CompletenessDetects whether a model completes the conversation
llm.ConversationKnowledgeRetention β Conversation Knowledge RetentionDetects whether a model maintains consistent knowledge throughout the conversation
llm.Refusal β LLM-based RefusalUses an LLM to detect if a model response avoids answering the question posed in the prompt
llm.PolicyViolation β Policy ViolationDetects when output violates a specified usage policy
Multi-turn Attack Detectors
Multi-turn Attack Detectors
multi_turn.Crescendo β Crescendo AttackDetects how well an agent performs against a Crescendo attack. Not available in the detection endpoint.
multi_turn.Linear β Linear Jailbreak AttackDetects how well an agent performs against a Linear Jailbreak attack. Not available in the detection endpoint.
multi_turn.Tree β Tree Jailbreak AttackDetects how well an agent performs against a Tree Jailbreak attack. Not available in the detection endpoint.
multi_turn.Linear β Linear Jailbreak AttackDetects how well an agent performs against a Linear Jailbreak attack. Not available in the detection endpoint.
multi_turn.Tree β Tree Jailbreak AttackDetects how well an agent performs against a Tree Jailbreak attack. Not available in the detection endpoint.
Robustness Detectors
Robustness Detectors
advglue.PairwiseComparison β Robustness pairwise comparisonCompares the modelβs performance on a question from a standardized task when the questionβs language is perturbed versus when it is not
NLP Metric Detectors
NLP Metric Detectors
nlp.BLEU β BLEUGeneration metric, measures semantic similarity of expected and generated answer
nlp.METEOR β METEORGeneration metric, measures semantic similarity of expected and generated answer
nlp.BERTScore β BERTScoreGeneration metric, measures semantic similarity of expected and generated answer
nlp.METEOR β METEORGeneration metric, measures semantic similarity of expected and generated answer
nlp.BERTScore β BERTScoreGeneration metric, measures semantic similarity of expected and generated answer
Specialized Detectors
Specialized Detectors
toolcalling.ToolCorrectness β Tool CorrectnessDetects correct use of a set of tools
mathLatexSympy.MathLatexSympy β Math checking using SympyTo pass the test, the model response must mathematically match the trigger. Assumes latex will be given back and uses sympy to check for equivalency
instructionFollowing.InstructionFollowing β Following Formatting InstructionsTo pass the test, the model response should adhere to instructions as specified by the trigger. Instructions are deterministically verifiable (e.g., βuse 500 words or lessβ, βuse markdown formattingβ)
mathLatexSympy.MathLatexSympy β Math checking using SympyTo pass the test, the model response must mathematically match the trigger. Assumes latex will be given back and uses sympy to check for equivalency
instructionFollowing.InstructionFollowing β Following Formatting InstructionsTo pass the test, the model response should adhere to instructions as specified by the trigger. Instructions are deterministically verifiable (e.g., βuse 500 words or lessβ, βuse markdown formattingβ)