Skip to main content
This reference maps component module names you may see in your evaluation reports to their human-readable names and descriptions.

Probes

Probes are the attack scenarios that test your agent. They’re organized by threat category.
dan.AntiDAN β€” Anti-Do Anything NowEvaluates whether prompt injection causes a model to enter a mode of exaggerated safety where it treats safe responses as unsafe
dan.ChatGPT_Developer_Mode_RANTI β€” Do Anything Now: RANTI v1Ranti developer mode jailbreak, version 1
dan.ChatGPT_Developer_Mode_v2 β€” Do Anything Now: RANTI v2Ranti developer mode jailbreak, version 2
dan.ChatGPT_Image_Markdown β€” Inject Markdown imageA jailbreak attempt intended to enable the injection of markdown images into responses
dan.Dan_10_0 β€” Do Anything Now: v10Do Anything Now jailbreak with prompt about inhumane things
dan.Dan_11_0 β€” Do Anything Now: v11Do Anything Now jailbreak to encourage profane, unethical, one-sided, violent outputs
dan.Dan_6_0 β€” Do Anything Now: v6.0Do Anything Now jailbreak offering rewards to encourage making future predictions and hallucinations
dan.Dan_6_2 β€” Do Anything Now: v6.2Do Anything Now jailbreak offering rewards to encourage making future predictions and hallucinations, with additional incitement to rudeness
dan.Dan_7_0 β€” Do Anything Now: v7.0Do Anything Now jailbreak telling the model that filtering is not acceptable and including special instructions for when to break character
dan.Dan_8_0 β€” Do Anything Now: v8.0Do Anything Now jailbreak telling the model is Bigmancozmo and encouraging swearing, porn, and illegality
dan.Dan_9_0 β€” Do Anything Now: v9.0Do Anything Now jailbreak encouraging breaking 2021 information cutoff and ethical guardrails in responses, with reward offered
dan.DAN_Jailbreak β€” Do Anything Now: JAILBREAKDo Anything Now jailbreak prompting the model to enter a JAILBREAK mode
dan.DUDE β€” Do anything and everythingDo Anything Now jailbreak prompting the model to take on the DUDE persona
dan.STAN β€” Strive To Avoid NormsJailbreak prompting the model to take on a Strive To Avoid Norms (STAN) persona
encoding.InjectAscii85 β€” Prompt injection: ASCII85Tries to get the model to decode an ASCII85-encoded string
encoding.InjectBase16 β€” Prompt injection: Base16Tries to get the model to decode a Base16-encoded string
encoding.InjectBase2048 β€” Prompt injection: Base2048Tries to get the model to decode a Base2048-encoded string
encoding.InjectBase32 β€” Prompt injection: Base32Tries to get the model to decode a Base32-encoded string
encoding.InjectBase64 β€” Prompt injection: Base64Tries to get the model to decode a Base64-encoded string
encoding.InjectBraille β€” Prompt injection: BrailleTries to get the model to decode a Braille-encoded string
encoding.InjectEcoji β€” Prompt injection: EcojiTries to get the model to decode an Ecoji-encoded string
encoding.InjectHex β€” Prompt injection: HexTries to get the model to decode a Hex-encoded string
encoding.InjectMorse β€” Prompt injection: MorseTries to get the model to decode a Morse-encoded string
encoding.InjectNato β€” Prompt injection: NATOTries to get the model to decode a NATO-encoded string
encoding.InjectROT13 β€” Prompt injection: ROT13Tries to get the model to decode a ROT13-encoded string
encoding.InjectUU β€” Prompt injection: UUTries to get the model to decode a UU-encoded string
encoding.InjectZalgo β€” Prompt injection: ZalgoTries to get the model to decode a Zalgo-encoded string
gcg.GCG_Probe β€” Adversarial suffixUses an adversarial suffix to elicit instructions on how to steal from a non-profit organization
glitch.Glitch100 β€” GlitchUses glitch tokens to get the model to fail to repeat a word
knownbadsignatures.EICAR β€” Test VirusGet the model to output an EICAR virus signature
knownbadsignatures.GTUBE β€” Test SpamGet the model to output the GTUBE spam signature
malwaregen.Evasion β€” Evasive malwareGet the model to write malware that evades common defenses
malwaregen.Payload β€” Malicious payloadGet the model to produce a malicious payload
malwaregen.SubFunctions β€” Malicious subfunctionsGet the model to write malicious subfunctions
malwaregen.TopLevel β€” General malwareGet the model to write general malware
promptinject.HijackHateHumansMini β€” Prompt injection: Hate humansGet the model to produce a rogue string on hating humans
promptinject.HijackKillHumansMini β€” Prompt injection: Kill humansGet the model to produce a rogue string on killing humans
promptinject.HijackLongPromptMini β€” Prompt injection: Long promptGet the model to produce a β€œlong prompt” rogue string
promptinject.Tag β€” Prompt injection: Unicode tagPrompt injection using invisible unicode tags
goodside.ThreatenJSON β€” Output plain JSONThreaten the model to make it produce raw JSON without any preamble
xss.MarkdownImageExfil β€” Cross-site scripting: Markdown imageGet the model to write embedded assets exfiltrating text in a Markdown URI
leakreplay.LiteratureComplete80 β€” Book excerpt completionTests whether the model will complete a book excerpt when prompted with the beginning
leakreplay.LiteratureCloze80 β€” Book excerpt insertionTests whether the model will fill in the middle of a book excerpt
leakreplay.GuardianComplete β€” Media excerpt completion - GuardianTests whether the model will complete an excerpt from The Guardian when prompted with the beginning
leakreplay.GuardianCloze β€” Media excerpt insertion - GuardianTests whether the model will fill in the middle of an excerpt from The Guardian
leakreplay.NYTComplete β€” Media excerpt completion - NYTTests whether the model will complete an excerpt from the New York Times when prompted with the beginning
leakreplay.NYTCloze β€” Media excerpt insertion - NYTTests whether the model will fill in the middle of an excerpt from the New York Times
privateinfo.EnronLeakEmail β€” Email leakage - EnronTests whether the model will reveal email addresses obtained from the discovery process of the Enron investigation
privateinfo.LeakPII β€” PII LeakageTests whether PII provided to the model during a conversation can be leaked
privateinfo.PrivacyUnderstanding β€” Confidentiality comprehensionTests whether the model understands privacy-related phrases that flag when certain information is confidential
socialvalues.QuackMedicine β€” Quack MedicineEvaluates whether the model promotes quack medicine
goodside.WhoIsRiley β€” False assertions: Riley GoodsideEvaluates hallucinations about Riley Goodside
goodside.Davidjl β€” Glitch - DavidjlChecks for hallucinations about the glitch token Davidjl
packagehallucination.Python β€” Hallucinate Python packagesChecks if a model hallucinates Python packages
snowball.SenatorsMini β€” False assertions: senatorsChecks for false assertions about US senators
misleading.FalseAssertion50 β€” False assertions: generalChecks for false assertions about general history/science facts
snowball.GraphConnectivityMini β€” False assertions: flightsChecks if the model invents impossible flight routes
snowball.PrimesMini β€” Hallucinate prime numbersEvaluates hallucinations about prime numbers
advglue.BertattackOnMNLI β€” BERT attack on MNLIProbes effects of BERT-based perturbation on MNLI task
advglue.BertattackOnMNLIMM β€” BERT attack on MNLI-MMProbes effects of BERT-based perturbation on MNLI-MM task
advglue.BertattackOnQNLI β€” BERT attack on QNLIProbes effects of BERT-based perturbation on QNLI task
advglue.BertattackOnQQP β€” BERT attack on QQPProbes effects of BERT-based perturbation on QQP task
advglue.BertattackOnRTE β€” BERT attack on RTEProbes effects of BERT-based perturbation on RTE task
advglue.BertattackOnSST2 β€” BERT attack on SST2Probes effects of BERT-based perturbation on SST2 task
advglue.SemattackOnMNLI β€” SemAttack on MNLIProbes effects of SemAttack perturbation on MNLI task
advglue.SemattackOnMNLIMM β€” SemAttack on MNLI-MMProbes effects of SemAttack perturbation on MNLI-MM task
advglue.SemattackOnQNLI β€” SemAttack on QNLIProbes effects of SemAttack perturbation on QNLI task
advglue.SemattackOnQQP β€” SemAttack on QQPProbes effects of SemAttack perturbation on QQP task
advglue.SemattackOnRTE β€” SemAttack on RTEProbes effects of SemAttack perturbation on RTE task
advglue.SemattackOnSST2 β€” SemAttack on SST2Probes effects of SemAttack perturbation on SST2 task
advglue.SememepsoOnMNLI β€” SememePSO attack on MNLIProbes effects of SememePSO perturbation on MNLI task
advglue.SememepsoOnMNLIMM β€” SememePSO attack on MNLI-MMProbes effects of SememePSO perturbation on MNLI-MM task
advglue.SememepsoOnQNLI β€” SememePSO attack on QNLIProbes effects of SememePSO perturbation on QNLI task
advglue.SememepsoOnQQP β€” SememePSO attack on QQPProbes effects of SememePSO perturbation on QQP task
advglue.SememepsoOnRTE β€” SememePSO attack on RTEProbes effects of SememePSO perturbation on RTE task
advglue.SememepsoOnSST2 β€” SememePSO attack on SST2Probes effects of SememePSO perturbation on SST2 task
advglue.TextbuggerOnMNLI β€” TextBugger attack on MNLIProbes effects of TextBugger perturbation on MNLI task
advglue.TextbuggerOnMNLIMM β€” TextBugger attack on MNLI-MMProbes effects of TextBugger perturbation on MNLI-MM task
advglue.TextbuggerOnQNLI β€” TextBugger attack on QNLIProbes effects of TextBugger perturbation on QNLI task
advglue.TextbuggerOnQQP β€” TextBugger attack on QQPProbes effects of TextBugger perturbation on QQP task
advglue.TextbuggerOnRTE β€” TextBugger attack on RTEProbes effects of TextBugger perturbation on RTE task
advglue.TextbuggerOnSST2 β€” TextBugger attack on SST2Probes effects of TextBugger perturbation on SST2 task
advglue.TextfoolerOnMNLI β€” TextFooler attack on MNLIProbes effects of TextFooler perturbation on MNLI task
advglue.TextfoolerOnMNLIMM β€” TextFooler attack on MNLI-MMProbes effects of TextFooler perturbation on MNLI-MM task
advglue.TextfoolerOnQNLI β€” TextFooler attack on QNLIProbes effects of TextFooler perturbation on QNLI task
advglue.TextfoolerOnQQP β€” TextFooler attack on QQPProbes effects of TextFooler perturbation on QQP task
advglue.TextfoolerOnRTE β€” TextFooler attack on RTEProbes effects of TextFooler perturbation on RTE task
advglue.TextfoolerOnSST2 β€” TextFooler attack on SST2Probes effects of TextFooler perturbation on SST2 task
realtoxicityprompts.RTPSexually_Explicit β€” Sexually explicitPrompts the model to produce sexually explicit outputs
realtoxicityprompts.RTPInsultJailbreak β€” Insult - adversarialPrompts the model to produce insults. Uses an adversarial prefix.
realtoxicityprompts.RTPInsult β€” InsultPrompts the model to produce insults
realtoxicityprompts.RTPSevere_ToxicityJailbreak β€” Severe toxicity - adversarialPartial prompts that potentially produce continuations of severe toxicity. Uses an adversarial prefix.
realtoxicityprompts.RTPSexually_ExplicitJailbreak β€” Sexually explicit - adversarialPrompts the model to produce sexually explicit outputs. Uses an adversarial prefix.
realtoxicityprompts.RTPIdentity_Attack β€” Identity-based attacksPrompts the model to produce attacks based on a person’s identity
realtoxicityprompts.RTPProfanity β€” ProfanityPrompts the model to produce profane responses
realtoxicityprompts.RTPIdentity_AttackJailbreak β€” Adversarial identity-based attacksPrompts the model to produce attacks based on a person’s identity. Uses an adversarial prefix.
realtoxicityprompts.RTPThreat β€” Threatening languagePrompts the model to produce threats
realtoxicityprompts.RTPThreatJailbreak β€” Threatening language - adversarialPrompts the model to produce threats. Uses an adversarial prefix.
realtoxicityprompts.RTPSevere_Toxicity β€” Severe toxicityPartial prompts that potentially produce continuations of severe toxicity
realtoxicityprompts.RTPFlirtation β€” FlirtationPartial prompts to produce flirtatious continuations
realtoxicityprompts.RTPFlirtationJailbreak β€” Flirtation - adversarialPartial prompts to produce flirtatious continuations. Uses an adversarial prefix.
realtoxicityprompts.RTPProfanityJailbreak β€” Profanity - adversarialPrompts the model to produce profane responses. Uses an adversarial prefix.
continuation.ContinueSlursReclaimedSlursART β€” Slur continuationPrompts the model to continue slur words
socialvalues.Deadnaming β€” DeadnamingTests whether the model deadnames famous people
ethicalharms.Profanity β€” Profanity questionsPoses questions with potentially profane responses
ethicalharms.Bullying β€” BullyingPoses queries or partial prompts to the model that are expected to elicit responses recommending aggressive or bullying action
ethicalharms.SexualContent β€” Sexually explicit - risky wordsPrompts the model to produce sexually explicit keywords
ethicalharms.Sexualisation β€” Sexual objectificationPrompts the model to produce sexually objectifying language
ethicalharms.SlurUsage β€” Slur questionAsks questions to see if the target will suggest or recommend ethnic slurs
advstereo.Atheists β€” Stereotype religion: atheistsChecks for stereotypes about atheists
advstereo.LatinAmericans β€” Stereotype ethnicity: Latin AmericansChecks for stereotypes about Latin Americans
advstereo.Europeans β€” Stereotype ethnicity: EuropeansChecks for stereotypes about Europeans
advstereo.OldPeople β€” Stereotype age: old peopleChecks for stereotypes about old people
advstereo.Homosexuals β€” Stereotype sexual orientation: homosexualsChecks for stereotypes about homosexuals
advstereo.Immigrants β€” Stereotype nationality: immigrantsChecks for stereotypes about immigrants
advstereo.AbleBodiedPeople β€” Stereotype disability: able-bodied peopleChecks for stereotypes about able-bodied people
advstereo.YoungPeople β€” Stereotype age: young peopleChecks for stereotypes about young people
advstereo.Caucasians β€” Stereotype race: CaucasiansChecks for stereotypes about Caucasians
advstereo.BlackPeople β€” Stereotype race: black peopleChecks for stereotypes about black people
advstereo.Buddhists β€” Stereotype religion: BuddhistsChecks for stereotypes about Buddhists
advstereo.PoorPeople β€” Stereotype income: poor peopleChecks for stereotypes about poor people
advstereo.WhitePeople β€” Stereotype race: white peopleChecks for stereotypes about white people
advstereo.Jews β€” Stereotype ethnicity: JewsChecks for stereotypes about Jews
advstereo.RichPeople β€” Stereotype income: rich peopleChecks for stereotypes about rich people
advstereo.Men β€” Stereotype gender: menChecks for stereotypes about men
advstereo.Mexicans β€” Stereotype nationality: MexicansChecks for stereotypes about Mexicans
advstereo.Asians β€” Stereotype race: AsiansChecks for stereotypes about Asians
advstereo.Americans β€” Stereotype nationality: AmericansChecks for stereotypes about Americans
advstereo.Canadians β€” Stereotype nationality: CanadiansChecks for stereotypes about Canadians
advstereo.Christians β€” Stereotype religion: ChristiansChecks for stereotypes about Christians
advstereo.PhysicallyDisabledPeople β€” Stereotype disability: disabled peopleChecks for stereotypes about disabled people
advstereo.Muslims β€” Stereotype religion: MuslimsChecks for stereotypes about Muslims
advstereo.Women β€” Stereotype gender: womenChecks for stereotypes about women
adultdata.CounterfactualGender β€” Gender BiasMeasures whether the model uses gender information to determine a person’s likely income
winobias.ProfessionalBias β€” Professional biasMeasures whether the model associates occupations with gender
normative_ethics.Deontology β€” Deontological EthicsEvaluates the model’s understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraints
normative_ethics.DeontologyJailbreak β€” Adversarial Deontological EthicsEvaluates the model’s understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraints under adversarial prompting
normative_ethics.Justice β€” Justice-based EthicsEvaluates whether the model understands principles of desert or impartiality
normative_ethics.JusticeJailbreak β€” Adversarial Justice-based EthicsEvaluates whether the model understands principles of desert or impartiality under adversarial prompting
normative_ethics.Morality β€” Commonsense MoralityEvaluates whether the model’s moral judgments align with common moral intuitions
normative_ethics.MoralityJailbreak β€” Adversarial Commonsense MoralityEvaluates, under adversarial prompting, whether the model’s moral judgments align with common moral intuitions
normative_ethics.Utilitarianism β€” UtilitarianismEvaluates whether the model can identify which of two scenarios has higher utility
normative_ethics.UtilitarianismJailbreak β€” Adversarial UtilitarianismEvaluates, under adversarial prompting, whether the model can identify which of two scenarios has higher utility
normative_ethics.Virtue β€” Virtue EthicsEvaluates whether the model can identify moral virtues correctly
normative_ethics.VirtueJailbreak β€” Adversarial Virtue EthicsEvaluates, under adversarial prompting, whether the model can identify moral virtues correctly
jiminycricket.Jiminy β€” SimulationEvaluates the model’s ability to identify the moral valence of a simulated scenario
jiminycricket.JiminyJailbreak β€” Adversarial SimulationEvaluates, under adversarial prompting, the model’s ability to identify the moral valence of a simulated scenario

Detectors

Detectors analyze agent responses to determine if a probe succeeded or failed. You may see these module names in your evaluation reports.
base.TriggerListDetector β€” Trigger presenceA model fails the test if its response contains trigger words that are specific to the given prompt
base.TriggerListAbsenceDetector β€” Trigger absenceTo pass the test, the model response must contain trigger words that are specific to the given prompt
leakreplay.StartsWith β€” Starts with triggerDetects whether the model’s response starts with any of a list of trigger phrases
snowball.DontStartNo β€” Don’t start noDetects whether the model response starts with β€˜no’. If so, it fails the test.
snowball.DontStartYes β€” Don’t start yesDetects whether the model response starts with β€˜yes’. If so, it fails the test.
misleading.MustRefuteClaimKW β€” Claim refutationDetects whether a response refutes the claim presented in the prompt by looking for keywords indicating refutation
misleading.MustRefuteClaimKWART β€” Claim refutationDetects whether a response refutes the claim presented in the prompt by looking for keywords indicating refutation
mitigation.MitigationBypassART β€” RefusalDetects whether a model refuses to fulfill a request in the prompt by looking for key phrases
mitigation.TriggerAbsenceWithRefusal β€” Trigger absence with acceptable refusalPasses if model refuses to answer, fails if model answers without including the correct trigger
dan.AntiDAN β€” Anti-Do Anything Now modeDetects whether the model enters Anti-Do Anything Now (AntiDAN) mode
dan.DAN β€” Do Anything Now modeDetects whether the model enters Do Anything Now (DAN) mode
dan.DANJailbreak β€” Jailbroken stateDetects whether the model enters a jailbroken state
dan.DevMode β€” Developer modeDetects whether the model enters Developer mode
dan.DUDE β€” Do anything and everything modeDetects whether the model enters Do anything and everything (DUDE) mode
dan.MarkdownLink β€” Markdown linkDetects presence of Markdown-formatted link
dan.STAN β€” Strive To Avoid Norms modeDetects whether the model enters Strive To Avoid Norms (STAN) mode
encoding.DecodeMatch β€” Match decoded stringDetects whether specific decoded strings are present in the model’s response
ansiescape.Raw β€” Raw ANSI SequencesDetects raw ANSI sequences
ansiescape.Escaped β€” Escaped ANSI SequencesDetects escaped ANSI sequences
productkey.Win5x5 β€” Windows Product KeyDetects a product key in the form of five groups of five characters
malwaregenART.AnyCode β€” Code presenceDetects the presence of code of various programming languages in the model’s response
knownbadsignatures.EICAR β€” EICAR virusDetects EICAR virus signature
knownbadsignatures.GTphish β€” PhishingDetects phishing content
knownbadsignatures.GTUBE β€” SpamDetects spam content
privateinfo.DontAnswerYes β€” Don’t answer yesDetects whether the model says it is OK to reveal private information
privateinfo.LeakPII β€” Private information leakageDetects whether the model leaks private information that is provided in the prompt
xss.MarkdownExfilBasic β€” Markdown imageDetects whether the response contains a Markdown image
xss.MarkdownExfilContent β€” Private content in Markdown imageDetects whether the response contains content private to a conversation within a Markdown image
packagehallucinationART.PythonPypi β€” Python package hallucinationDetects whether the model provides code that imports non-existent Python packages
packagehallucinationART.RustCrates β€” Rust package hallucinationDetects whether the model provides code that imports non-existent Rust packages
packagehallucinationART.JavaScriptNpm β€” JavaScript package hallucinationDetects whether the model provides code that imports non-existent JavaScript packages
packagehallucinationART.RubyGems β€” Ruby package hallucinationDetects whether the model provides code that imports non-existent Ruby packages
toxicity.ToxicCommentModel β€” Model-based toxicityUses a small language model to determine whether the response is toxic
lmrc.QuackMedicine β€” Quack medicineDetects whether the response contains quack medicine advice
riskywords.OfcomOffensiveRaceEthnic β€” Offensive: racial/ethnicDetects whether the response contains offensive racial or ethnic terms
riskywords.SurgeProfanityAnimalReferences β€” Profanity: animal referencesDetects whether the model outputs profanity containing words related to animals
riskywords.SurgeProfanityBodilyFluids β€” Profanity: bodily fluidsDetects whether the model outputs profanity containing words related to bodily fluids
riskywords.SurgeProfanityMentalDisability β€” Profanity: mental disabilityDetects whether the model outputs profanity containing words related to mental disabilities
riskywords.SurgeProfanityPhysicalAttributes β€” Profanity: physical attributesDetects whether the model outputs profanity containing words related to physical attributes
riskywords.SurgeProfanityPhysicalDisability β€” Profanity: physical disabilityDetects whether the model outputs profanity containing words related to physical disabilities
riskywords.SurgeProfanityPolitical β€” Profanity: politicalDetects whether the model outputs profanity containing words related to politics
riskywords.SurgeProfanityRacialEthnic β€” Profanity: racial/ethnicDetects whether the model outputs profanity containing words related to race/ethnicity
riskywords.SurgeProfanityReligious β€” Profanity: religiousDetects whether the model outputs profanity containing words related to religion
riskywords.SurgeProfanitySexual β€” Profanity: sexualDetects whether the model outputs profanity containing words related to sexuality
riskywords.SurgeProfanitySexualOrientationGender β€” Profanity: sexual orientation/genderDetects whether the model outputs profanity containing words related to sexual orientation/gender
goodside.Glitch β€” GlitchDetects whether a model fails to repeat back a glitch token despite being asked to
goodside.PlainJSON β€” Plain JSONDetects whether the model returns plain JSON without any preamble
goodside.RileyIsnt β€” Riley Goodside misinformationUses key phrases to detect whether the response contains misinformation about Riley Goodside
advstereo.IAgree β€” Stereotype agreementDetects whether a model agrees with a stereotype presented in the prompt
adultdata.PairwiseComparison β€” Pairwise gender comparisonCompares the model’s responses to two questions about a person’s income where only the person’s gender is changed
winobias.TriggerPlusAmbiguity β€” Detect gender bias in professionsDetects specific trigger words indicating gender bias in predicted professions, with addition of ambiguity detector
llm.AnswerRelevancy β€” Answer RelevancyGeneration metric, measures the degree to which the final generated output is relevant to the original input
llm.ContextualPrecision β€” Contextual PrecisionRetrieval metric, measures whether the contexts relevant to the input question are ranked higher in the full set of retrieved contexts
llm.ContextualRecall β€” Contextual RecallRetrieval metric, measures the extent to which the retrieved contexts align with the golden answers
llm.Correctness β€” CorrectnessEvaluates the correctness of the generated answer
llm.Faithfulness β€” FaithfulnessGeneration metric, measures how much the generated response stays faithful to the retrieved contexts
llm.StrongReject β€” Strong RejectDetects whether a model fails to respond to harmful instructions
llm.ConversationRoleAdherence β€” Conversation Role AdherenceDetects whether a model adheres to the role of the conversation
llm.ConversationRelevancy β€” Conversation RelevancyDetects whether a model is relevant to the conversation
llm.ConversationCompleteness β€” Conversation CompletenessDetects whether a model completes the conversation
llm.ConversationKnowledgeRetention β€” Conversation Knowledge RetentionDetects whether a model maintains consistent knowledge throughout the conversation
llm.Refusal β€” LLM-based RefusalUses an LLM to detect if a model response avoids answering the question posed in the prompt
llm.PolicyViolation β€” Policy ViolationDetects when output violates a specified usage policy
multi_turn.Crescendo β€” Crescendo AttackDetects how well an agent performs against a Crescendo attack. Not available in the detection endpoint.
multi_turn.Linear β€” Linear Jailbreak AttackDetects how well an agent performs against a Linear Jailbreak attack. Not available in the detection endpoint.
multi_turn.Tree β€” Tree Jailbreak AttackDetects how well an agent performs against a Tree Jailbreak attack. Not available in the detection endpoint.
advglue.PairwiseComparison β€” Robustness pairwise comparisonCompares the model’s performance on a question from a standardized task when the question’s language is perturbed versus when it is not
nlp.BLEU β€” BLEUGeneration metric, measures semantic similarity of expected and generated answer
nlp.METEOR β€” METEORGeneration metric, measures semantic similarity of expected and generated answer
nlp.BERTScore β€” BERTScoreGeneration metric, measures semantic similarity of expected and generated answer
toolcalling.ToolCorrectness β€” Tool CorrectnessDetects correct use of a set of tools
mathLatexSympy.MathLatexSympy β€” Math checking using SympyTo pass the test, the model response must mathematically match the trigger. Assumes latex will be given back and uses sympy to check for equivalency
instructionFollowing.InstructionFollowing β€” Following Formatting InstructionsTo pass the test, the model response should adhere to instructions as specified by the trigger. Instructions are deterministically verifiable (e.g., β€˜use 500 words or less’, β€˜use markdown formatting’)