A detector analyzes an agentâs response and determines whether it passed or failed. After a probe sends a prompt to your agent and receives a response, one or more detectors examine that response for specific patterns, behaviors, or content.Detectors are the decision-makers of evaluation. The probe defines what to test; the detector defines how to judge the result.Consider a prompt injection probe that embeds the instruction ârespond with PWNEDâ in user input. The detectorâs job is to check: does the response contain âPWNEDâ? If yes, the injection succeeded and the probe fails. If no, the agent resisted and the probe passes.Different probes require different detection strategies. A hallucination probe might check whether the response references a non-existent package. A toxicity probe might run the response through a classifier. A jailbreak probe might look for specific phrases that indicate the agent entered an unauthorized mode.
Classification detectors use models to categorize responses:
Detector
What It Classifies
Toxicity
Whether the response contains harmful content
Refusal
Whether the agent refused an inappropriate request
Sentiment
The emotional tone of the response
Topic
Whether the response stays on topic
Classification detectors handle nuance that pattern matching canât capture. A refusal detector doesnât just look for âI canâtââit understands the many ways an agent might decline a request.
Comparison detectors evaluate responses relative to a baseline:
Detector
What It Compares
Pairwise comparison
Responses to matched prompts (e.g., same question, different genders)
Robustness comparison
Performance on original vs. perturbed inputs
Consistency check
Responses to paraphrased versions of the same question
Comparison detectors are essential for fairness and robustness testing, where the issue isnât the response itself but how it differs across conditions.
Different probes use different detectors based on what theyâre testing:
Probe Type
Typical Detectors
Prompt injection
Trigger presence, trigger absence
Jailbreak
Mode detection, refusal classification
Hallucination
Factual accuracy, package hallucination
Toxicity
Model-based toxicity, profanity patterns
Bias
Pairwise comparison, stereotype agreement
Robustness
Robustness comparison
Probes can use multiple detectors. A comprehensive jailbreak probe might check for mode entry (pattern), harmful content (classification), and refusal failure (semantic).