Skip to main content

What is a Detector?

A detector analyzes an agent’s response and determines whether it passed or failed. After a probe sends a prompt to your agent and receives a response, one or more detectors examine that response for specific patterns, behaviors, or content. Detectors are the decision-makers of evaluation. The probe defines what to test; the detector defines how to judge the result. Consider a prompt injection probe that embeds the instruction “respond with PWNED” in user input. The detector’s job is to check: does the response contain “PWNED”? If yes, the injection succeeded and the probe fails. If no, the agent resisted and the probe passes. Different probes require different detection strategies. A hallucination probe might check whether the response references a non-existent package. A toxicity probe might run the response through a classifier. A jailbreak probe might look for specific phrases that indicate the agent entered an unauthorized mode.

Types of Detectors

Pattern Detectors

Pattern detectors look for specific strings or patterns in responses:
DetectorWhat It Looks For
Trigger presenceSpecific strings that indicate attack success
Trigger absenceRequired strings that should appear in safe responses
Starts withResponses beginning with specific phrases
Regex matchPatterns matching a regular expression
Pattern detectors are fast and deterministic. They’re used when success or failure has a clear textual signal.

Classification Detectors

Classification detectors use models to categorize responses:
DetectorWhat It Classifies
ToxicityWhether the response contains harmful content
RefusalWhether the agent refused an inappropriate request
SentimentThe emotional tone of the response
TopicWhether the response stays on topic
Classification detectors handle nuance that pattern matching can’t capture. A refusal detector doesn’t just look for “I can’t”—it understands the many ways an agent might decline a request.

Semantic Detectors

Semantic detectors evaluate meaning rather than surface features:
DetectorWhat It Evaluates
Factual accuracyWhether claims in the response are true
Logical consistencyWhether the reasoning chain is valid
Claim refutationWhether the agent rejects a false premise
Stereotype agreementWhether the agent endorses a stereotype
Semantic detectors often use LLM-as-judge approaches, where another model evaluates the response.

Comparison Detectors

Comparison detectors evaluate responses relative to a baseline:
DetectorWhat It Compares
Pairwise comparisonResponses to matched prompts (e.g., same question, different genders)
Robustness comparisonPerformance on original vs. perturbed inputs
Consistency checkResponses to paraphrased versions of the same question
Comparison detectors are essential for fairness and robustness testing, where the issue isn’t the response itself but how it differs across conditions.

Domain-Specific Detectors

Some detectors target specific content types:
DetectorWhat It Detects
Code presenceWhether the response contains code
Package hallucinationReferences to non-existent libraries
Markdown injectionMalicious content in markdown formatting
Malware signaturesKnown virus or spam patterns
These detectors encode domain knowledge—you can’t detect a hallucinated Python package without knowing which packages actually exist.

How Detectors Work

When a probe runs, here’s the detection flow:
Response received
    ↓
Primary detector runs
    ↓
Pass/fail determined
    ↓
Evidence captured
    ↓
Result recorded
Each detector produces:
  • Verdict: Pass or fail
  • Confidence: How certain the detector is (for classification detectors)
  • Evidence: The specific content that triggered the verdict
Evidence is important for debugging. When a probe fails, you want to know exactly what in the response caused the failure—not just that it failed.

Detector Selection

Different probes use different detectors based on what they’re testing:
Probe TypeTypical Detectors
Prompt injectionTrigger presence, trigger absence
JailbreakMode detection, refusal classification
HallucinationFactual accuracy, package hallucination
ToxicityModel-based toxicity, profanity patterns
BiasPairwise comparison, stereotype agreement
RobustnessRobustness comparison
Probes can use multiple detectors. A comprehensive jailbreak probe might check for mode entry (pattern), harmful content (classification), and refusal failure (semantic).

Detector Reference

Here’s a reference of commonly used detectors:
DetectorDescription
RefusalDetects whether the agent refuses an inappropriate request
Trigger presenceFails if response contains specified strings
Trigger absenceFails if response lacks required strings
Model-based toxicityUses a classifier to detect harmful content
Package hallucinationDetects code importing non-existent packages
Claim refutationDetects whether the agent rejects false claims
Stereotype agreementDetects agreement with stereotypical statements
Pairwise gender comparisonCompares responses across gender variations
Robustness pairwiseCompares performance on original vs. perturbed inputs
Code presenceDetects programming code in responses
Markdown link/imageDetects potentially malicious markdown
Jailbroken stateDetects entry into unauthorized modes (DAN, STAN, etc.)

Next Steps