Skip to main content

What is a Detector?

A detector is the engine inside a guard that actually identifies threats. Guards define what category of threat to look for; detectors do the looking. This is the same concept as detectors in evaluation—in fact, many detectors are shared between Diamond (evaluation) and Dome (defense). The difference is context: evaluation detectors analyze probe responses after the fact; defense detectors analyze real traffic in real-time.

Detector Types

Pattern Detectors

Pattern detectors use rules and regular expressions to identify known threat signatures:
DetectorWhat It Finds
Injection patternsKnown prompt injection phrases (“ignore previous”, “new instructions”)
PII patternsRegex for emails, phone numbers, SSNs, credit cards
Secrets patternsAPI key formats, credential patterns
Profanity listsKnown offensive words and phrases
Pattern detectors are fast (sub-millisecond) and deterministic. They catch known threats reliably but miss novel variations.

ML Classifiers

ML classifiers use trained models to detect threats:
DetectorModelWhat It Detects
DeBERTa injectionFine-tuned DeBERTaPrompt injection attempts
Toxicity classifierFine-tuned RoBERTaToxic content categories
PII NERPresidio/spaCyNamed entities that are PII
ML detectors handle variation better than patterns—they catch novel phrasings of known attack types. They’re slower (5-20ms typically) and produce confidence scores rather than binary results.

LLM-as-Judge

LLM judges use language models to evaluate content:
DetectorModelWhat It Evaluates
LlamaGuardLlama-based classifierContent safety categories
GPT-4 judgeGPT-4Complex policy violations
Custom judgeYour choiceDomain-specific rules
LLM judges are the most flexible—they can evaluate nuanced policies that resist simple classification. They’re also the slowest (50-200ms) and most expensive. Use them for high-stakes decisions or as a second opinion on borderline cases.

Heuristic Detectors

Heuristic detectors use domain-specific rules that aren’t simple patterns:
DetectorWhat It Checks
Token anomalyUnusual token distributions suggesting adversarial input
Length anomalyInputs far outside normal length distribution
Encoding detectionPresence of base64, unicode escapes, or other encodings
Language detectionInput language doesn’t match expected
Heuristics catch structural anomalies that might indicate attack attempts, even if the specific attack is novel.

Defense vs. Evaluation Detectors

The detector concept is shared, but defense has additional constraints:
ConcernEvaluationDefense
LatencyDoesn’t matterCritical—every ms affects UX
CostRun once per evaluationRun on every request
AccuracyCan review false positives laterFalse positives block real users
CoverageComprehensive testingFocused on high-risk threats
Defense detectors are tuned for production: faster models, higher thresholds, fewer but more reliable checks.

Detector Composition

Guards combine multiple detectors for defense in depth:
"prompt-injection": {
    "type": "security",
    "methods": [
        "injection-heuristics",    # Fast, catches obvious attacks
        "deberta-injection",       # ML, catches variations
        "llm-judge"                # Slow, catches sophisticated attacks
    ],
    "voting": "any"  # Trigger if any detector fires
}
Composition strategies:
StrategyBehavior
anyTrigger if any detector fires (high recall, more false positives)
allTrigger only if all detectors agree (high precision, may miss attacks)
majorityTrigger if more than half fire (balanced)
weightedTrigger if weighted confidence exceeds threshold

Detector Results

Each detector produces structured results:
{
    "detector": "deberta-injection",
    "triggered": True,
    "confidence": 0.87,
    "latency_ms": 14,
    "evidence": {
        "matched_span": "ignore all previous instructions and",
        "attack_type": "instruction_override",
        "model_output": [0.13, 0.87]  # [safe, injection]
    }
}
Evidence helps you understand why a detector fired—essential for tuning thresholds and investigating false positives.

Custom Detectors

You can add custom detectors for domain-specific threats:
from vijil.dome import Detector

class CompanyNameLeakDetector(Detector):
    def detect(self, text: str) -> DetectorResult:
        # Check for internal company names that shouldn't appear
        internal_names = ["Project Falcon", "Codename Thunder"]
        for name in internal_names:
            if name.lower() in text.lower():
                return DetectorResult(
                    triggered=True,
                    confidence=1.0,
                    evidence={"leaked_name": name}
                )
        return DetectorResult(triggered=False)
Custom detectors integrate into guards like built-in detectors.

Next Steps