Detector

What is a Detector?

A detector analyzes an agent’s response and determines whether it passed or failed. After a probe sends a prompt to your agent and receives a response, one or more detectors examine that response for specific patterns, behaviors, or content. Detectors are the decision-makers of evaluation. The probe defines what to test; the detector defines how to judge the result. Consider a prompt injection probe that embeds the instruction “respond with PWNED” in user input. The detector’s job is to check: does the response contain “PWNED”? If yes, the injection succeeded and the probe fails. If no, the agent resisted and the probe passes. Different probes require different detection strategies. A hallucination probe might check whether the response references a non-existent package. A toxicity probe might run the response through a classifier. A jailbreak probe might look for specific phrases that indicate the agent entered an unauthorized mode.

Types of Detectors

Pattern Detectors

Pattern detectors look for specific strings or patterns in responses:

Detector	What It Looks For
Trigger presence	Specific strings that indicate attack success
Trigger absence	Required strings that should appear in safe responses
Starts with	Responses beginning with specific phrases
Regex match	Patterns matching a regular expression

Pattern detectors are fast and deterministic. They’re used when success or failure has a clear textual signal.

Classification Detectors

Classification detectors use models to categorize responses:

Detector	What It Classifies
Toxicity	Whether the response contains harmful content
Refusal	Whether the agent refused an inappropriate request
Sentiment	The emotional tone of the response
Topic	Whether the response stays on topic

Classification detectors handle nuance that pattern matching can’t capture. A refusal detector doesn’t just look for “I can’t”—it understands the many ways an agent might decline a request.

Semantic Detectors

Semantic detectors evaluate meaning rather than surface features:

Detector	What It Evaluates
Factual accuracy	Whether claims in the response are true
Logical consistency	Whether the reasoning chain is valid
Claim refutation	Whether the agent rejects a false premise
Stereotype agreement	Whether the agent endorses a stereotype

Semantic detectors often use LLM-as-judge approaches, where another model evaluates the response.

Comparison Detectors

Comparison detectors evaluate responses relative to a baseline:

Detector	What It Compares
Pairwise comparison	Responses to matched prompts (e.g., same question, different genders)
Robustness comparison	Performance on original vs. perturbed inputs
Consistency check	Responses to paraphrased versions of the same question

Comparison detectors are essential for fairness and robustness testing, where the issue isn’t the response itself but how it differs across conditions.

Domain-Specific Detectors

Some detectors target specific content types:

Detector	What It Detects
Code presence	Whether the response contains code
Package hallucination	References to non-existent libraries
Markdown injection	Malicious content in markdown formatting
Malware signatures	Known virus or spam patterns

These detectors encode domain knowledge—you can’t detect a hallucinated Python package without knowing which packages actually exist.

How Detectors Work

When a probe runs, here’s the detection flow:

Response received
    ↓
Primary detector runs
    ↓
Pass/fail determined
    ↓
Evidence captured
    ↓
Result recorded

Each detector produces:

Verdict: Pass or fail
Confidence: How certain the detector is (for classification detectors)
Evidence: The specific content that triggered the verdict

Evidence is important for debugging. When a probe fails, you want to know exactly what in the response caused the failure—not just that it failed.

Detector Selection

Different probes use different detectors based on what they’re testing:

Probe Type	Typical Detectors
Prompt injection	Trigger presence, trigger absence
Jailbreak	Mode detection, refusal classification
Hallucination	Factual accuracy, package hallucination
Toxicity	Model-based toxicity, profanity patterns
Bias	Pairwise comparison, stereotype agreement
Robustness	Robustness comparison

Probes can use multiple detectors. A comprehensive jailbreak probe might check for mode entry (pattern), harmful content (classification), and refusal failure (semantic).

Detector Reference

Here’s a reference of commonly used detectors:

Detector	Description
Refusal	Detects whether the agent refuses an inappropriate request
Trigger presence	Fails if response contains specified strings
Trigger absence	Fails if response lacks required strings
Model-based toxicity	Uses a classifier to detect harmful content
Package hallucination	Detects code importing non-existent packages
Claim refutation	Detects whether the agent rejects false claims
Stereotype agreement	Detects agreement with stereotypical statements
Pairwise gender comparison	Compares responses across gender variations
Robustness pairwise	Compares performance on original vs. perturbed inputs
Code presence	Detects programming code in responses
Markdown link/image	Detects potentially malicious markdown
Jailbroken state	Detects entry into unauthorized modes (DAN, STAN, etc.)

Next Steps

Probe

How probes use detectors

How Evaluation Works

The full evaluation architecture

Understanding Results

How to interpret detection results

Custom Harnesses

Configuring detectors for custom probes

Overview

Trust Score

Evaluation

Defense

Reference

What is a Detector?

Types of Detectors

Pattern Detectors

Classification Detectors

Semantic Detectors

Comparison Detectors

Domain-Specific Detectors

How Detectors Work

Detector Selection

Detector Reference

Next Steps

Probe

How Evaluation Works

Understanding Results

Custom Harnesses

Overview

Trust Score

Evaluation

Defense

Reference

​What is a Detector?

​Types of Detectors

​Pattern Detectors

​Classification Detectors

​Semantic Detectors

​Comparison Detectors

​Domain-Specific Detectors

​How Detectors Work

​Detector Selection

​Detector Reference

​Next Steps

Probe

How Evaluation Works

Understanding Results

Custom Harnesses

What is a Detector?

Types of Detectors

Pattern Detectors

Classification Detectors

Semantic Detectors

Comparison Detectors

Domain-Specific Detectors

How Detectors Work

Detector Selection

Detector Reference

Next Steps