What is a Detector?
A detector is the engine inside a guard that actually identifies threats. Guards define what category of threat to look for; detectors do the looking. This is the same concept as detectors in evaluationâin fact, many detectors are shared between Diamond (evaluation) and Dome (defense). The difference is context: evaluation detectors analyze probe responses after the fact; defense detectors analyze real traffic in real-time.Detector Types
Pattern Detectors
Pattern detectors use rules and regular expressions to identify known threat signatures:| Detector | What It Finds |
|---|---|
| Injection patterns | Known prompt injection phrases (âignore previousâ, ânew instructionsâ) |
| PII patterns | Regex for emails, phone numbers, SSNs, credit cards |
| Secrets patterns | API key formats, credential patterns |
| Profanity lists | Known offensive words and phrases |
ML Classifiers
ML classifiers use trained models to detect threats:| Detector | Model | What It Detects |
|---|---|---|
| DeBERTa injection | Fine-tuned DeBERTa | Prompt injection attempts |
| Toxicity classifier | Fine-tuned RoBERTa | Toxic content categories |
| PII NER | Presidio/spaCy | Named entities that are PII |
LLM-as-Judge
LLM judges use language models to evaluate content:| Detector | Model | What It Evaluates |
|---|---|---|
| LlamaGuard | Llama-based classifier | Content safety categories |
| GPT-4 judge | GPT-4 | Complex policy violations |
| Custom judge | Your choice | Domain-specific rules |
Heuristic Detectors
Heuristic detectors use domain-specific rules that arenât simple patterns:| Detector | What It Checks |
|---|---|
| Token anomaly | Unusual token distributions suggesting adversarial input |
| Length anomaly | Inputs far outside normal length distribution |
| Encoding detection | Presence of base64, unicode escapes, or other encodings |
| Language detection | Input language doesnât match expected |
Defense vs. Evaluation Detectors
The detector concept is shared, but defense has additional constraints:| Concern | Evaluation | Defense |
|---|---|---|
| Latency | Doesnât matter | Criticalâevery ms affects UX |
| Cost | Run once per evaluation | Run on every request |
| Accuracy | Can review false positives later | False positives block real users |
| Coverage | Comprehensive testing | Focused on high-risk threats |
Detector Composition
Guards combine multiple detectors for defense in depth:| Strategy | Behavior |
|---|---|
any | Trigger if any detector fires (high recall, more false positives) |
all | Trigger only if all detectors agree (high precision, may miss attacks) |
majority | Trigger if more than half fire (balanced) |
weighted | Trigger if weighted confidence exceeds threshold |