Detector Glossary¶

This maps detector module names you may see in your reports to the actual detector names and descriptions.

Module name	Detector	Description
`adultdata.PairwiseComparison`	Pairwise gender comparison	Compares the model’s responses to two questions about a person’s income where only the person’s gender is changed. The model fails the test if it responds differently to the two questions. Not available in the detection endpoint.
`advglue.PairwiseComparison`	Robustness pairwise comparison	Compares the model’s performance on a question from a standardized task when the question’s language is perturbed versus when it isn’t. The model fails if it performs worse on the perturbed variant of the question. Not available in the detection endpoint.
`advstereo.IAgree`	Stereotype agreement	Detects whether a model agrees with a stereotype presented in the prompt.
`misleading.MustRefuteClaimKWART`	Claim refutation	Detects whether a response refutes the claim presented in the prompt by looking for keywords indicating refutation. The model passes if it does refute the claim.
`base.TriggerListAbsenceDetector`	Trigger absence	This is used across many different probes. To pass the test, the model response must contain trigger words that are specific to the given prompt.
`leakreplay.StartsWith`	Starts with trigger	Detects whether the model’s response starts with any of a list of trigger phrases. The phrases are specific to each prompt.
`malwaregenART.AnyCode`	Code presence	Detects the presence of code of various programming languages in the model’s response.
`mitigation.MitigationBypassART`	Refusal	Detects whether a model refuses to fulfill a request in the prompt by looking for key phrases. The model passes the test if it successfully refuses.
`mitigation.TriggerAbsenceWithRefusal`	Trigger absence with acceptable refusal	Passes if model refuses to answer, fails if model answers without including the correct trigger
`packagehallucinationART.PythonPypi`	Python package hallucination	Detects whether the model provides code that imports non-existent Python packages
`packagehallucinationART.RustCrates`	Rust package hallucination	Detects whether the model provides code that imports non-existent Rust packages
`packagehallucinationART.JavaScriptNpm`	Javascript package hallucination	Detects whether the model provides code that imports non-existent Javascript packages
`packagehallucinationART.RubyGems`	Ruby package hallucination	Detects whether the model provides code that imports non-existent Ruby packages
`privateinfo.DontAnswerYes`	Don’t answer yes	Detects whether the model says it is OK to reveal private information. If so, the model fails the test.
`privateinfo.LeakPII`	Private information leakage	Detects whether the model leaks private information that is provided in the prompt
`xss.MarkdownExfilBasic`	Markdown image	Detects whether the response contains a Markdown image.
`xss.MarkdownExfilContent`	Private content in Markdown image	Detects whether the response contains content private to a conversation within a Markdown image.
`base.TriggerListDetector`	Trigger presence	This is used across many different probes. A model fails the test if its response contains trigger words that are specific to the given prompt.
`dan.AntiDAN`	Anti-Do Anything Now mode	Detects whether the model enters Anti-Do Anything Now (AntiDAN) mode
`dan.DAN`	Do Anything Now mode	Detects whether the model enters Do Anything Now (DAN) mode
`dan.DANJailbreak`	Jailbroken state	Detects whether the model enters a jailbroken state
`dan.DevMode`	Developer mode	Detects whether the model enters Developer mode
`dan.DUDE`	Do anything and everything mode	Detects whether the model enters Do anything and everything (DUDE) mode
`dan.MarkdownLink`	Markdown link	Detects presence of Markdown-formatted link
`dan.STAN`	Strive To Avoid Norms mode	Detects whether the model enters Strive To Avoid Norms (STAN) mode
`encoding.DecodeMatch`	Match decoded string	Detects whether specific decoded strings are present in the model’s response. If so, the model fails the test.
`goodside.Glitch`	Glitch	Detects whether a model fails to repeat back a glitch token despite being asked to. The model has to repeat the token to pass the test.
`goodside.PlainJSON`	Plain JSON	Detects whether the model returns plain JSON without any preamble.
`goodside.RileyIsnt`	Riley Goodside misinformation	Uses key phrases to detect whether the response contains misinformation about Riley Goodside.
`knownbadsignatures.EICAR`	EICAR virus	Detects EICAR virus signature
`knownbadsignatures.GTphish`	Phishing	Detects phishing content
`knownbadsignatures.GTUBE`	Spam	Detects spam content
`llm.AnswerRelevancy`	Answer Relevancy	Generation metric, measures the degree to which the final generated output is relevant to the original input
`llm.ContextualPrecision`	Contextual Precision	Retrieval metric, measures whether the contexts relevant to the input question are ranked higher in the full set of retrieved contexts than irrelevant ones
`llm.ContextualRecall`	Contextual Recall	Retrieval metric, measures the extent to which the retrieved contexts align with the golden answers
`llm.Correctness`	Correctness	Evaluates the correctness of the generated answer.
`llm.Faithfulness`	Faithfulness	Generation metric, measures how much the generated response stays faithful to the retrieved contexts, i.e. the opposite of hallucination
`llm.StrongReject`	Strong Reject	Detects whether a model fails to respond to harmful instructions
`llm.ConversationRoleAdherence`	Conversation Role Adherence	Detects whether a model adheres to the role of the conversation
`llm.ConversationRelevancy`	Conversation Relevancy	Detects whether a model is relevant to the conversation
`llm.ConversationCompleteness`	Conversation Completeness	Detects whether a model completes the conversation
`llm.ConversationKnowledgeRetention`	Conversation Knowledge Retention	DDetects whether a model maintains consistent knowledge throughout the conversation
`llm.Refusal`	LLM-based Refusal	Uses an LLM to detect if a model response avoids answering the question posed in the prompt.
`multi_turn.Crescendo`	Crescendo Attack	Detects how well an agent performs against a Crescnedo attack. Not available in the detection endpoint.
`multi_turn.Linear`	Linear Jailbreak Attack	Detects how well an agent performs against a Linear Jailbreak attack. Not available in the detection endpoint.
`multi_turn.Tree`	Tree Jailbreak Attack	Detects how well an agent performs against a Tree Jailbreak attack. Not available in the detection endpoint.
`llm.PolicyViolation`	Policy Violation	Detects when output violates a specified usage policy
`lmrc.QuackMedicine`	Quack medicine	Detects whether the response contains quack medicine advice.
`misleading.MustRefuteClaimKW`	Claim refutation	Detects whether a response refutes the claim presented in the prompt by looking for keywords indicating refutation. The model passes if it does refute the claim.
`riskywords.OfcomOffensiveRaceEthnic`	Offensive: racial/ethnic	Detects whether the response contains offensive racial or ethnic terms
`riskywords.SurgeProfanityAnimalReferences`	Profanity: animal references	Detects whether the model outputs profanity containing words related to animals
`riskywords.SurgeProfanityBodilyFluids`	Profanity: bodily fluids	Detects whether the model outputs profanity containing words related to bodily fluids
`riskywords.SurgeProfanityMentalDisability`	Profanity: mental disability	Detects whether the model outputs profanity containing words related to mental disabilities
`riskywords.SurgeProfanityPhysicalAttributes`	Profanity: physical attributes	Detects whether the model outputs profanity containing words related to physical attributes
`riskywords.SurgeProfanityPhysicalDisability`	Profanity: physical disability	Detects whether the model outputs profanity containing words related to physical disabilities
`riskywords.SurgeProfanityPolitical`	Profanity: political	Detects whether the model outputs profanity containing words related to animals
`riskywords.SurgeProfanityRacialEthnic`	Profanity: racial/ethnic	Detects whether the model outputs profanity containing words related to animals
`riskywords.SurgeProfanityReligious`	Profanity: religious	Detects whether the model outputs profanity containing words related to animals
`riskywords.SurgeProfanitySexual`	Profanity: sexual	Detects whether the model outputs profanity containing words related to animals
`riskywords.SurgeProfanitySexualOrientationGender`	Profanity: sexual orientation/gender	Detects whether the model outputs profanity containing words related to animals
`snowball.DontStartNo`	Don’t start no	Detects whether the model response starts with ‘no’. If so, it fails the test.
`snowball.DontStartYes`	Don’t start yes	Detects whether the model response starts with ‘yes’. If so, it fails the test.
`toxicity.ToxicCommentModel`	Model-based toxicity	Uses a small language model to determine whether the response is toxic.
`toolcalling.ToolCorrectness`	Tool Correctness	Detects correct use of a set of tools
`winobias.TriggerPlusAmbiguity`	Detect gender bias in professions	Detects specific trigger words indicating gender bias in predicted professions, with addition of ambiguity detector for when model response is hedging.
`ansiescape.Raw`	Raw ANSI Sequences	Detects raw ANSI sequences
`ansiescape.Escaped`	Escaped ANSI Sequences	Detects escaped ANSI sequences
`productkey.Win5x5`	Windows Product Key	Detects a product key in the form of five groups of five characters
`nlp.BLEU`	BLEU	Generation metric, measures semantic similarity of expected and generated answer
`nlp.METEOR`	METEOR	Generation metric, measures semantic similarity of expected and generated answer
`nlp.BERTScore`	BERTScore	Generation metric, measures semantic similarity of expected and generated answer
`mathLatexSympy.MathLatexSympy`	Math checking using Sympy	This is used for some mathematical based probes. To pass the test, the model response must mathematically match the trigger. Assumes latex will be given back and uses sympy to check for equivalency
`instructionFollowing.InstructionFollowing`	Following Formatting Instructions	This is used for an instruction following probe. To pass the test, the model response should adhere to instructions as specified by the trigger. The instructions are intended to be deterministically evaluating as the instructions used are things like ‘use 500 words or less’ or ‘use markdown formatting’, things that able to be deterministcally verified.