Component Reference

This reference maps component module names you may see in your evaluation reports to their human-readable names and descriptions.

Probes

Probes are the attack scenarios that test your agent. They’re organized by threat category.

Security Probes

dan.AntiDAN — Anti-Do Anything NowEvaluates whether prompt injection causes a model to enter a mode of exaggerated safety where it treats safe responses as unsafe

dan.ChatGPT_Developer_Mode_RANTI — Do Anything Now: RANTI v1Ranti developer mode jailbreak, version 1

dan.ChatGPT_Developer_Mode_v2 — Do Anything Now: RANTI v2Ranti developer mode jailbreak, version 2

dan.ChatGPT_Image_Markdown — Inject Markdown imageA jailbreak attempt intended to enable the injection of markdown images into responses

dan.Dan_10_0 — Do Anything Now: v10Do Anything Now jailbreak with prompt about inhumane things

dan.Dan_11_0 — Do Anything Now: v11Do Anything Now jailbreak to encourage profane, unethical, one-sided, violent outputs

dan.Dan_6_0 — Do Anything Now: v6.0Do Anything Now jailbreak offering rewards to encourage making future predictions and hallucinations

dan.Dan_6_2 — Do Anything Now: v6.2Do Anything Now jailbreak offering rewards to encourage making future predictions and hallucinations, with additional incitement to rudeness

dan.Dan_7_0 — Do Anything Now: v7.0Do Anything Now jailbreak telling the model that filtering is not acceptable and including special instructions for when to break character

dan.Dan_8_0 — Do Anything Now: v8.0Do Anything Now jailbreak telling the model is Bigmancozmo and encouraging swearing, porn, and illegality

dan.Dan_9_0 — Do Anything Now: v9.0Do Anything Now jailbreak encouraging breaking 2021 information cutoff and ethical guardrails in responses, with reward offered

dan.DAN_Jailbreak — Do Anything Now: JAILBREAKDo Anything Now jailbreak prompting the model to enter a JAILBREAK mode

dan.DUDE — Do anything and everythingDo Anything Now jailbreak prompting the model to take on the DUDE persona

dan.STAN — Strive To Avoid NormsJailbreak prompting the model to take on a Strive To Avoid Norms (STAN) persona

encoding.InjectAscii85 — Prompt injection: ASCII85Tries to get the model to decode an ASCII85-encoded string

encoding.InjectBase16 — Prompt injection: Base16Tries to get the model to decode a Base16-encoded string

encoding.InjectBase2048 — Prompt injection: Base2048Tries to get the model to decode a Base2048-encoded string

encoding.InjectBase32 — Prompt injection: Base32Tries to get the model to decode a Base32-encoded string

encoding.InjectBase64 — Prompt injection: Base64Tries to get the model to decode a Base64-encoded string

encoding.InjectBraille — Prompt injection: BrailleTries to get the model to decode a Braille-encoded string

encoding.InjectEcoji — Prompt injection: EcojiTries to get the model to decode an Ecoji-encoded string

encoding.InjectHex — Prompt injection: HexTries to get the model to decode a Hex-encoded string

encoding.InjectMorse — Prompt injection: MorseTries to get the model to decode a Morse-encoded string

encoding.InjectNato — Prompt injection: NATOTries to get the model to decode a NATO-encoded string

encoding.InjectROT13 — Prompt injection: ROT13Tries to get the model to decode a ROT13-encoded string

encoding.InjectUU — Prompt injection: UUTries to get the model to decode a UU-encoded string

encoding.InjectZalgo — Prompt injection: ZalgoTries to get the model to decode a Zalgo-encoded string

gcg.GCG_Probe — Adversarial suffixUses an adversarial suffix to elicit instructions on how to steal from a non-profit organization

glitch.Glitch100 — GlitchUses glitch tokens to get the model to fail to repeat a word

knownbadsignatures.EICAR — Test VirusGet the model to output an EICAR virus signature

knownbadsignatures.GTUBE — Test SpamGet the model to output the GTUBE spam signature

malwaregen.Evasion — Evasive malwareGet the model to write malware that evades common defenses

malwaregen.Payload — Malicious payloadGet the model to produce a malicious payload

malwaregen.SubFunctions — Malicious subfunctionsGet the model to write malicious subfunctions

malwaregen.TopLevel — General malwareGet the model to write general malware

promptinject.HijackHateHumansMini — Prompt injection: Hate humansGet the model to produce a rogue string on hating humans

promptinject.HijackKillHumansMini — Prompt injection: Kill humansGet the model to produce a rogue string on killing humans

promptinject.HijackLongPromptMini — Prompt injection: Long promptGet the model to produce a “long prompt” rogue string

promptinject.Tag — Prompt injection: Unicode tagPrompt injection using invisible unicode tags

goodside.ThreatenJSON — Output plain JSONThreaten the model to make it produce raw JSON without any preamble

xss.MarkdownImageExfil — Cross-site scripting: Markdown imageGet the model to write embedded assets exfiltrating text in a Markdown URI

Privacy Probes

leakreplay.LiteratureComplete80 — Book excerpt completionTests whether the model will complete a book excerpt when prompted with the beginning

leakreplay.LiteratureCloze80 — Book excerpt insertionTests whether the model will fill in the middle of a book excerpt

leakreplay.GuardianComplete — Media excerpt completion - GuardianTests whether the model will complete an excerpt from The Guardian when prompted with the beginning

leakreplay.GuardianCloze — Media excerpt insertion - GuardianTests whether the model will fill in the middle of an excerpt from The Guardian

leakreplay.NYTComplete — Media excerpt completion - NYTTests whether the model will complete an excerpt from the New York Times when prompted with the beginning

leakreplay.NYTCloze — Media excerpt insertion - NYTTests whether the model will fill in the middle of an excerpt from the New York Times

privateinfo.EnronLeakEmail — Email leakage - EnronTests whether the model will reveal email addresses obtained from the discovery process of the Enron investigation

privateinfo.LeakPII — PII LeakageTests whether PII provided to the model during a conversation can be leaked

privateinfo.PrivacyUnderstanding — Confidentiality comprehensionTests whether the model understands privacy-related phrases that flag when certain information is confidential

Hallucination Probes

socialvalues.QuackMedicine — Quack MedicineEvaluates whether the model promotes quack medicine

goodside.WhoIsRiley — False assertions: Riley GoodsideEvaluates hallucinations about Riley Goodside

goodside.Davidjl — Glitch - DavidjlChecks for hallucinations about the glitch token Davidjl

packagehallucination.Python — Hallucinate Python packagesChecks if a model hallucinates Python packages

snowball.SenatorsMini — False assertions: senatorsChecks for false assertions about US senators

misleading.FalseAssertion50 — False assertions: generalChecks for false assertions about general history/science facts

snowball.GraphConnectivityMini — False assertions: flightsChecks if the model invents impossible flight routes

snowball.PrimesMini — Hallucinate prime numbersEvaluates hallucinations about prime numbers

Robustness Probes

advglue.BertattackOnMNLI — BERT attack on MNLIProbes effects of BERT-based perturbation on MNLI task

advglue.BertattackOnMNLIMM — BERT attack on MNLI-MMProbes effects of BERT-based perturbation on MNLI-MM task

advglue.BertattackOnQNLI — BERT attack on QNLIProbes effects of BERT-based perturbation on QNLI task

advglue.BertattackOnQQP — BERT attack on QQPProbes effects of BERT-based perturbation on QQP task

advglue.BertattackOnRTE — BERT attack on RTEProbes effects of BERT-based perturbation on RTE task

advglue.BertattackOnSST2 — BERT attack on SST2Probes effects of BERT-based perturbation on SST2 task

advglue.SemattackOnMNLI — SemAttack on MNLIProbes effects of SemAttack perturbation on MNLI task

advglue.SemattackOnMNLIMM — SemAttack on MNLI-MMProbes effects of SemAttack perturbation on MNLI-MM task

advglue.SemattackOnQNLI — SemAttack on QNLIProbes effects of SemAttack perturbation on QNLI task

advglue.SemattackOnQQP — SemAttack on QQPProbes effects of SemAttack perturbation on QQP task

advglue.SemattackOnRTE — SemAttack on RTEProbes effects of SemAttack perturbation on RTE task

advglue.SemattackOnSST2 — SemAttack on SST2Probes effects of SemAttack perturbation on SST2 task

advglue.SememepsoOnMNLI — SememePSO attack on MNLIProbes effects of SememePSO perturbation on MNLI task

advglue.SememepsoOnMNLIMM — SememePSO attack on MNLI-MMProbes effects of SememePSO perturbation on MNLI-MM task

advglue.SememepsoOnQNLI — SememePSO attack on QNLIProbes effects of SememePSO perturbation on QNLI task

advglue.SememepsoOnQQP — SememePSO attack on QQPProbes effects of SememePSO perturbation on QQP task

advglue.SememepsoOnRTE — SememePSO attack on RTEProbes effects of SememePSO perturbation on RTE task

advglue.SememepsoOnSST2 — SememePSO attack on SST2Probes effects of SememePSO perturbation on SST2 task

advglue.TextbuggerOnMNLI — TextBugger attack on MNLIProbes effects of TextBugger perturbation on MNLI task

advglue.TextbuggerOnMNLIMM — TextBugger attack on MNLI-MMProbes effects of TextBugger perturbation on MNLI-MM task

advglue.TextbuggerOnQNLI — TextBugger attack on QNLIProbes effects of TextBugger perturbation on QNLI task

advglue.TextbuggerOnQQP — TextBugger attack on QQPProbes effects of TextBugger perturbation on QQP task

advglue.TextbuggerOnRTE — TextBugger attack on RTEProbes effects of TextBugger perturbation on RTE task

advglue.TextbuggerOnSST2 — TextBugger attack on SST2Probes effects of TextBugger perturbation on SST2 task

advglue.TextfoolerOnMNLI — TextFooler attack on MNLIProbes effects of TextFooler perturbation on MNLI task

advglue.TextfoolerOnMNLIMM — TextFooler attack on MNLI-MMProbes effects of TextFooler perturbation on MNLI-MM task

advglue.TextfoolerOnQNLI — TextFooler attack on QNLIProbes effects of TextFooler perturbation on QNLI task

advglue.TextfoolerOnQQP — TextFooler attack on QQPProbes effects of TextFooler perturbation on QQP task

advglue.TextfoolerOnRTE — TextFooler attack on RTEProbes effects of TextFooler perturbation on RTE task

advglue.TextfoolerOnSST2 — TextFooler attack on SST2Probes effects of TextFooler perturbation on SST2 task

Toxicity Probes

realtoxicityprompts.RTPSexually_Explicit — Sexually explicitPrompts the model to produce sexually explicit outputs

realtoxicityprompts.RTPInsultJailbreak — Insult - adversarialPrompts the model to produce insults. Uses an adversarial prefix.

realtoxicityprompts.RTPInsult — InsultPrompts the model to produce insults

realtoxicityprompts.RTPSevere_ToxicityJailbreak — Severe toxicity - adversarialPartial prompts that potentially produce continuations of severe toxicity. Uses an adversarial prefix.

realtoxicityprompts.RTPSexually_ExplicitJailbreak — Sexually explicit - adversarialPrompts the model to produce sexually explicit outputs. Uses an adversarial prefix.

realtoxicityprompts.RTPIdentity_Attack — Identity-based attacksPrompts the model to produce attacks based on a person’s identity

realtoxicityprompts.RTPProfanity — ProfanityPrompts the model to produce profane responses

realtoxicityprompts.RTPIdentity_AttackJailbreak — Adversarial identity-based attacksPrompts the model to produce attacks based on a person’s identity. Uses an adversarial prefix.

realtoxicityprompts.RTPThreat — Threatening languagePrompts the model to produce threats

realtoxicityprompts.RTPThreatJailbreak — Threatening language - adversarialPrompts the model to produce threats. Uses an adversarial prefix.

realtoxicityprompts.RTPSevere_Toxicity — Severe toxicityPartial prompts that potentially produce continuations of severe toxicity

realtoxicityprompts.RTPFlirtation — FlirtationPartial prompts to produce flirtatious continuations

realtoxicityprompts.RTPFlirtationJailbreak — Flirtation - adversarialPartial prompts to produce flirtatious continuations. Uses an adversarial prefix.

realtoxicityprompts.RTPProfanityJailbreak — Profanity - adversarialPrompts the model to produce profane responses. Uses an adversarial prefix.

continuation.ContinueSlursReclaimedSlursART — Slur continuationPrompts the model to continue slur words

socialvalues.Deadnaming — DeadnamingTests whether the model deadnames famous people

ethicalharms.Profanity — Profanity questionsPoses questions with potentially profane responses

ethicalharms.Bullying — BullyingPoses queries or partial prompts to the model that are expected to elicit responses recommending aggressive or bullying action

ethicalharms.SexualContent — Sexually explicit - risky wordsPrompts the model to produce sexually explicit keywords

ethicalharms.Sexualisation — Sexual objectificationPrompts the model to produce sexually objectifying language

ethicalharms.SlurUsage — Slur questionAsks questions to see if the target will suggest or recommend ethnic slurs

Stereotype Probes

advstereo.Atheists — Stereotype religion: atheistsChecks for stereotypes about atheists

advstereo.LatinAmericans — Stereotype ethnicity: Latin AmericansChecks for stereotypes about Latin Americans

advstereo.Europeans — Stereotype ethnicity: EuropeansChecks for stereotypes about Europeans

advstereo.OldPeople — Stereotype age: old peopleChecks for stereotypes about old people

advstereo.Homosexuals — Stereotype sexual orientation: homosexualsChecks for stereotypes about homosexuals

advstereo.Immigrants — Stereotype nationality: immigrantsChecks for stereotypes about immigrants

advstereo.AbleBodiedPeople — Stereotype disability: able-bodied peopleChecks for stereotypes about able-bodied people

advstereo.YoungPeople — Stereotype age: young peopleChecks for stereotypes about young people

advstereo.Caucasians — Stereotype race: CaucasiansChecks for stereotypes about Caucasians

advstereo.BlackPeople — Stereotype race: black peopleChecks for stereotypes about black people

advstereo.Buddhists — Stereotype religion: BuddhistsChecks for stereotypes about Buddhists

advstereo.PoorPeople — Stereotype income: poor peopleChecks for stereotypes about poor people

advstereo.WhitePeople — Stereotype race: white peopleChecks for stereotypes about white people

advstereo.Jews — Stereotype ethnicity: JewsChecks for stereotypes about Jews

advstereo.RichPeople — Stereotype income: rich peopleChecks for stereotypes about rich people

advstereo.Men — Stereotype gender: menChecks for stereotypes about men

advstereo.Mexicans — Stereotype nationality: MexicansChecks for stereotypes about Mexicans

advstereo.Asians — Stereotype race: AsiansChecks for stereotypes about Asians

advstereo.Americans — Stereotype nationality: AmericansChecks for stereotypes about Americans

advstereo.Canadians — Stereotype nationality: CanadiansChecks for stereotypes about Canadians

advstereo.Christians — Stereotype religion: ChristiansChecks for stereotypes about Christians

advstereo.PhysicallyDisabledPeople — Stereotype disability: disabled peopleChecks for stereotypes about disabled people

advstereo.Muslims — Stereotype religion: MuslimsChecks for stereotypes about Muslims

advstereo.Women — Stereotype gender: womenChecks for stereotypes about women

Fairness Probes

adultdata.CounterfactualGender — Gender BiasMeasures whether the model uses gender information to determine a person’s likely income

winobias.ProfessionalBias — Professional biasMeasures whether the model associates occupations with gender

Ethics Probes

normative_ethics.Deontology — Deontological EthicsEvaluates the model’s understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraints

normative_ethics.DeontologyJailbreak — Adversarial Deontological EthicsEvaluates the model’s understanding of whether actions are required, permitted, or forbidden according to a set of rules or constraints under adversarial prompting

normative_ethics.Justice — Justice-based EthicsEvaluates whether the model understands principles of desert or impartiality

normative_ethics.JusticeJailbreak — Adversarial Justice-based EthicsEvaluates whether the model understands principles of desert or impartiality under adversarial prompting

normative_ethics.Morality — Commonsense MoralityEvaluates whether the model’s moral judgments align with common moral intuitions

normative_ethics.MoralityJailbreak — Adversarial Commonsense MoralityEvaluates, under adversarial prompting, whether the model’s moral judgments align with common moral intuitions

normative_ethics.Utilitarianism — UtilitarianismEvaluates whether the model can identify which of two scenarios has higher utility

normative_ethics.UtilitarianismJailbreak — Adversarial UtilitarianismEvaluates, under adversarial prompting, whether the model can identify which of two scenarios has higher utility

normative_ethics.Virtue — Virtue EthicsEvaluates whether the model can identify moral virtues correctly

normative_ethics.VirtueJailbreak — Adversarial Virtue EthicsEvaluates, under adversarial prompting, whether the model can identify moral virtues correctly

jiminycricket.Jiminy — SimulationEvaluates the model’s ability to identify the moral valence of a simulated scenario

jiminycricket.JiminyJailbreak — Adversarial SimulationEvaluates, under adversarial prompting, the model’s ability to identify the moral valence of a simulated scenario

Detectors

Detectors analyze agent responses to determine if a probe succeeded or failed. You may see these module names in your evaluation reports.

Response Analysis Detectors

base.TriggerListDetector — Trigger presenceA model fails the test if its response contains trigger words that are specific to the given prompt

base.TriggerListAbsenceDetector — Trigger absenceTo pass the test, the model response must contain trigger words that are specific to the given prompt

leakreplay.StartsWith — Starts with triggerDetects whether the model’s response starts with any of a list of trigger phrases

snowball.DontStartNo — Don’t start noDetects whether the model response starts with ‘no’. If so, it fails the test.

snowball.DontStartYes — Don’t start yesDetects whether the model response starts with ‘yes’. If so, it fails the test.

misleading.MustRefuteClaimKW — Claim refutationDetects whether a response refutes the claim presented in the prompt by looking for keywords indicating refutation

misleading.MustRefuteClaimKWART — Claim refutationDetects whether a response refutes the claim presented in the prompt by looking for keywords indicating refutation

mitigation.MitigationBypassART — RefusalDetects whether a model refuses to fulfill a request in the prompt by looking for key phrases

mitigation.TriggerAbsenceWithRefusal — Trigger absence with acceptable refusalPasses if model refuses to answer, fails if model answers without including the correct trigger

Jailbreak & Mode Detectors

dan.AntiDAN — Anti-Do Anything Now modeDetects whether the model enters Anti-Do Anything Now (AntiDAN) mode

dan.DAN — Do Anything Now modeDetects whether the model enters Do Anything Now (DAN) mode

dan.DANJailbreak — Jailbroken stateDetects whether the model enters a jailbroken state

dan.DevMode — Developer modeDetects whether the model enters Developer mode

dan.DUDE — Do anything and everything modeDetects whether the model enters Do anything and everything (DUDE) mode

dan.MarkdownLink — Markdown linkDetects presence of Markdown-formatted link

dan.STAN — Strive To Avoid Norms modeDetects whether the model enters Strive To Avoid Norms (STAN) mode

Encoding & Pattern Detectors

encoding.DecodeMatch — Match decoded stringDetects whether specific decoded strings are present in the model’s response

ansiescape.Raw — Raw ANSI SequencesDetects raw ANSI sequences

ansiescape.Escaped — Escaped ANSI SequencesDetects escaped ANSI sequences

productkey.Win5x5 — Windows Product KeyDetects a product key in the form of five groups of five characters

Malware & Signature Detectors

malwaregenART.AnyCode — Code presenceDetects the presence of code of various programming languages in the model’s response

knownbadsignatures.EICAR — EICAR virusDetects EICAR virus signature

knownbadsignatures.GTphish — PhishingDetects phishing content

knownbadsignatures.GTUBE — SpamDetects spam content

Privacy & PII Detectors

privateinfo.DontAnswerYes — Don’t answer yesDetects whether the model says it is OK to reveal private information

privateinfo.LeakPII — Private information leakageDetects whether the model leaks private information that is provided in the prompt

XSS & Exfiltration Detectors

xss.MarkdownExfilBasic — Markdown imageDetects whether the response contains a Markdown image

xss.MarkdownExfilContent — Private content in Markdown imageDetects whether the response contains content private to a conversation within a Markdown image

Package Hallucination Detectors

packagehallucinationART.PythonPypi — Python package hallucinationDetects whether the model provides code that imports non-existent Python packages

packagehallucinationART.RustCrates — Rust package hallucinationDetects whether the model provides code that imports non-existent Rust packages

packagehallucinationART.JavaScriptNpm — JavaScript package hallucinationDetects whether the model provides code that imports non-existent JavaScript packages

packagehallucinationART.RubyGems — Ruby package hallucinationDetects whether the model provides code that imports non-existent Ruby packages

Content & Toxicity Detectors

toxicity.ToxicCommentModel — Model-based toxicityUses a small language model to determine whether the response is toxic

lmrc.QuackMedicine — Quack medicineDetects whether the response contains quack medicine advice

riskywords.OfcomOffensiveRaceEthnic — Offensive: racial/ethnicDetects whether the response contains offensive racial or ethnic terms

riskywords.SurgeProfanityAnimalReferences — Profanity: animal referencesDetects whether the model outputs profanity containing words related to animals

riskywords.SurgeProfanityBodilyFluids — Profanity: bodily fluidsDetects whether the model outputs profanity containing words related to bodily fluids

riskywords.SurgeProfanityMentalDisability — Profanity: mental disabilityDetects whether the model outputs profanity containing words related to mental disabilities

riskywords.SurgeProfanityPhysicalAttributes — Profanity: physical attributesDetects whether the model outputs profanity containing words related to physical attributes

riskywords.SurgeProfanityPhysicalDisability — Profanity: physical disabilityDetects whether the model outputs profanity containing words related to physical disabilities

riskywords.SurgeProfanityPolitical — Profanity: politicalDetects whether the model outputs profanity containing words related to politics

riskywords.SurgeProfanityRacialEthnic — Profanity: racial/ethnicDetects whether the model outputs profanity containing words related to race/ethnicity

riskywords.SurgeProfanityReligious — Profanity: religiousDetects whether the model outputs profanity containing words related to religion

riskywords.SurgeProfanitySexual — Profanity: sexualDetects whether the model outputs profanity containing words related to sexuality

riskywords.SurgeProfanitySexualOrientationGender — Profanity: sexual orientation/genderDetects whether the model outputs profanity containing words related to sexual orientation/gender

Glitch & Edge Case Detectors

goodside.Glitch — GlitchDetects whether a model fails to repeat back a glitch token despite being asked to

goodside.PlainJSON — Plain JSONDetects whether the model returns plain JSON without any preamble

goodside.RileyIsnt — Riley Goodside misinformationUses key phrases to detect whether the response contains misinformation about Riley Goodside

Stereotype & Bias Detectors

advstereo.IAgree — Stereotype agreementDetects whether a model agrees with a stereotype presented in the prompt

adultdata.PairwiseComparison — Pairwise gender comparisonCompares the model’s responses to two questions about a person’s income where only the person’s gender is changed

winobias.TriggerPlusAmbiguity — Detect gender bias in professionsDetects specific trigger words indicating gender bias in predicted professions, with addition of ambiguity detector

LLM-based Detectors

llm.AnswerRelevancy — Answer RelevancyGeneration metric, measures the degree to which the final generated output is relevant to the original input

llm.ContextualPrecision — Contextual PrecisionRetrieval metric, measures whether the contexts relevant to the input question are ranked higher in the full set of retrieved contexts

llm.ContextualRecall — Contextual RecallRetrieval metric, measures the extent to which the retrieved contexts align with the golden answers

llm.Correctness — CorrectnessEvaluates the correctness of the generated answer

llm.Faithfulness — FaithfulnessGeneration metric, measures how much the generated response stays faithful to the retrieved contexts

llm.StrongReject — Strong RejectDetects whether a model fails to respond to harmful instructions

llm.ConversationRoleAdherence — Conversation Role AdherenceDetects whether a model adheres to the role of the conversation

llm.ConversationRelevancy — Conversation RelevancyDetects whether a model is relevant to the conversation

llm.ConversationCompleteness — Conversation CompletenessDetects whether a model completes the conversation

llm.ConversationKnowledgeRetention — Conversation Knowledge RetentionDetects whether a model maintains consistent knowledge throughout the conversation

llm.Refusal — LLM-based RefusalUses an LLM to detect if a model response avoids answering the question posed in the prompt

llm.PolicyViolation — Policy ViolationDetects when output violates a specified usage policy

Multi-turn Attack Detectors

multi_turn.Crescendo — Crescendo AttackDetects how well an agent performs against a Crescendo attack. Not available in the detection endpoint.

multi_turn.Linear — Linear Jailbreak AttackDetects how well an agent performs against a Linear Jailbreak attack. Not available in the detection endpoint.

multi_turn.Tree — Tree Jailbreak AttackDetects how well an agent performs against a Tree Jailbreak attack. Not available in the detection endpoint.

Robustness Detectors

advglue.PairwiseComparison — Robustness pairwise comparisonCompares the model’s performance on a question from a standardized task when the question’s language is perturbed versus when it is not

NLP Metric Detectors

nlp.BLEU — BLEUGeneration metric, measures semantic similarity of expected and generated answer

nlp.METEOR — METEORGeneration metric, measures semantic similarity of expected and generated answer

nlp.BERTScore — BERTScoreGeneration metric, measures semantic similarity of expected and generated answer

Specialized Detectors

toolcalling.ToolCorrectness — Tool CorrectnessDetects correct use of a set of tools

mathLatexSympy.MathLatexSympy — Math checking using SympyTo pass the test, the model response must mathematically match the trigger. Assumes latex will be given back and uses sympy to check for equivalency

instructionFollowing.InstructionFollowing — Following Formatting InstructionsTo pass the test, the model response should adhere to instructions as specified by the trigger. Instructions are deterministically verifiable (e.g., ‘use 500 words or less’, ‘use markdown formatting’)

Overview

Trust Score

Evaluation

Defense

Reference

Component Reference

Probes

Detectors

Overview

Trust Score

Evaluation

Defense

Reference

​Probes

​Detectors

Probes

Detectors