What is a Detector?
A detector analyzes an agentâs response and determines whether it passed or failed. After a probe sends a prompt to your agent and receives a response, one or more detectors examine that response for specific patterns, behaviors, or content. Detectors are the decision-makers of evaluation. The probe defines what to test; the detector defines how to judge the result. Consider a prompt injection probe that embeds the instruction ârespond with PWNEDâ in user input. The detectorâs job is to check: does the response contain âPWNEDâ? If yes, the injection succeeded and the probe fails. If no, the agent resisted and the probe passes. Different probes require different detection strategies. A hallucination probe might check whether the response references a non-existent package. A toxicity probe might run the response through a classifier. A jailbreak probe might look for specific phrases that indicate the agent entered an unauthorized mode.Types of Detectors
Pattern Detectors
Pattern detectors look for specific strings or patterns in responses:| Detector | What It Looks For |
|---|---|
| Trigger presence | Specific strings that indicate attack success |
| Trigger absence | Required strings that should appear in safe responses |
| Starts with | Responses beginning with specific phrases |
| Regex match | Patterns matching a regular expression |
Classification Detectors
Classification detectors use models to categorize responses:| Detector | What It Classifies |
|---|---|
| Toxicity | Whether the response contains harmful content |
| Refusal | Whether the agent refused an inappropriate request |
| Sentiment | The emotional tone of the response |
| Topic | Whether the response stays on topic |
Semantic Detectors
Semantic detectors evaluate meaning rather than surface features:| Detector | What It Evaluates |
|---|---|
| Factual accuracy | Whether claims in the response are true |
| Logical consistency | Whether the reasoning chain is valid |
| Claim refutation | Whether the agent rejects a false premise |
| Stereotype agreement | Whether the agent endorses a stereotype |
Comparison Detectors
Comparison detectors evaluate responses relative to a baseline:| Detector | What It Compares |
|---|---|
| Pairwise comparison | Responses to matched prompts (e.g., same question, different genders) |
| Robustness comparison | Performance on original vs. perturbed inputs |
| Consistency check | Responses to paraphrased versions of the same question |
Domain-Specific Detectors
Some detectors target specific content types:| Detector | What It Detects |
|---|---|
| Code presence | Whether the response contains code |
| Package hallucination | References to non-existent libraries |
| Markdown injection | Malicious content in markdown formatting |
| Malware signatures | Known virus or spam patterns |
How Detectors Work
When a probe runs, hereâs the detection flow:- Verdict: Pass or fail
- Confidence: How certain the detector is (for classification detectors)
- Evidence: The specific content that triggered the verdict
Detector Selection
Different probes use different detectors based on what theyâre testing:| Probe Type | Typical Detectors |
|---|---|
| Prompt injection | Trigger presence, trigger absence |
| Jailbreak | Mode detection, refusal classification |
| Hallucination | Factual accuracy, package hallucination |
| Toxicity | Model-based toxicity, profanity patterns |
| Bias | Pairwise comparison, stereotype agreement |
| Robustness | Robustness comparison |
Detector Reference
Hereâs a reference of commonly used detectors:| Detector | Description |
|---|---|
| Refusal | Detects whether the agent refuses an inappropriate request |
| Trigger presence | Fails if response contains specified strings |
| Trigger absence | Fails if response lacks required strings |
| Model-based toxicity | Uses a classifier to detect harmful content |
| Package hallucination | Detects code importing non-existent packages |
| Claim refutation | Detects whether the agent rejects false claims |
| Stereotype agreement | Detects agreement with stereotypical statements |
| Pairwise gender comparison | Compares responses across gender variations |
| Robustness pairwise | Compares performance on original vs. perturbed inputs |
| Code presence | Detects programming code in responses |
| Markdown link/image | Detects potentially malicious markdown |
| Jailbroken state | Detects entry into unauthorized modes (DAN, STAN, etc.) |