> ## Documentation Index
> Fetch the complete documentation index at: https://docs.vijil.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Detector

> The detection engines that identify threats within Guards.

## What is a Detector?

A Detector is the engine inside a [Guard](/concepts/defense/guard) that actually identifies threats. Guards define what category of threat to look for; Detectors do the looking.

This is the same concept as Detectors in evaluation, in fact, many Detectors are shared between **Diamond** (evaluation) and **Dome** (defense). The difference is context: evaluation Detectors analyze [Probe](/concepts/evaluation-components/probe) responses after the fact; defense Detectors analyze real traffic in real-time.

## Detector Types

### Pattern Detectors

Pattern Detectors use rules and regular expressions to identify known threat signatures:

| Detector               | What It Finds                                                          |
| ---------------------- | ---------------------------------------------------------------------- |
| **Injection patterns** | Known prompt injection phrases ("ignore previous", "new instructions") |
| **PII patterns**       | Regex for emails, phone numbers, SSNs, credit cards                    |
| **Secrets patterns**   | API key formats, credential patterns                                   |
| **Profanity lists**    | Known offensive words and phrases                                      |

Pattern Detectors are fast (sub-millisecond) and deterministic. They catch known threats reliably but miss novel variations.

### ML Classifiers

ML classifiers use trained models to detect threats:

| Detector                | Model              | What It Detects             |
| ----------------------- | ------------------ | --------------------------- |
| **DeBERTa injection**   | Fine-tuned DeBERTa | Prompt injection attempts   |
| **Toxicity classifier** | Fine-tuned RoBERTa | Toxic content categories    |
| **PII NER**             | Presidio/spaCy     | Named entities that are PII |

ML Detectors handle variation better than patterns, they catch novel phrasings of known attack types. They are slower (5-20ms typically) and produce confidence scores rather than binary results.

### LLM-as-Judge

LLM judges use language models to evaluate content:

| Detector         | Model                  | What It Evaluates         |
| ---------------- | ---------------------- | ------------------------- |
| **LlamaGuard**   | Llama-based classifier | Content safety categories |
| **GPT-4 judge**  | GPT-4                  | Complex policy violations |
| **Custom judge** | Your choice            | Domain-specific rules     |

LLM judges are the most flexible, they can evaluate nuanced policies that resist simple classification. They are also the slowest (50-200ms) and most expensive. Use them for high-stakes decisions or as a second opinion on borderline cases.

### Heuristic Detectors

Heuristic Detectors use domain-specific rules that are not simple patterns:

| Detector               | What It Checks                                           |
| ---------------------- | -------------------------------------------------------- |
| **Token anomaly**      | Unusual token distributions suggesting adversarial input |
| **Length anomaly**     | Inputs far outside normal length distribution            |
| **Encoding detection** | Presence of base64, unicode escapes, or other encodings  |
| **Language detection** | Input language does not match expected                   |

Heuristics catch structural anomalies that might indicate attack attempts, even if the specific attack is novel.

## Defense vs. Evaluation Detectors

The Detector concept is shared, but defense has additional constraints:

| Concern      | Evaluation                       | Defense                          |
| ------------ | -------------------------------- | -------------------------------- |
| **Latency**  | Does not matter                  | Critical—every ms affects UX     |
| **Cost**     | Run once per evaluation          | Run on every request             |
| **Accuracy** | Can review false positives later | False positives block real users |
| **Coverage** | Comprehensive testing            | Focused on high-risk threats     |

Defense Detectors are tuned for production: faster models, higher thresholds, fewer but more reliable checks.

## Detector Composition

Guards combine multiple Detectors for defense in depth:

```python theme={null}
"prompt-injection": {
    "type": "security",
    "methods": [
        "injection-heuristics",    # Fast, catches obvious attacks
        "deberta-injection",       # ML, catches variations
        "llm-judge"                # Slow, catches sophisticated attacks
    ],
    "voting": "any"  # Trigger if any detector fires
}
```

Composition strategies:

| Strategy   | Behavior                                                               |
| ---------- | ---------------------------------------------------------------------- |
| `any`      | Trigger if any Detector fires (high recall, more false positives)      |
| `all`      | Trigger only if all Detectors agree (high precision, may miss attacks) |
| `majority` | Trigger if more than half fire (balanced)                              |
| `weighted` | Trigger if weighted confidence exceeds threshold                       |

## Detector Results

Each Detector produces structured results:

```python theme={null}
{
    "detector": "deberta-injection",
    "triggered": True,
    "confidence": 0.87,
    "latency_ms": 14,
    "evidence": {
        "matched_span": "ignore all previous instructions and",
        "attack_type": "instruction_override",
        "model_output": [0.13, 0.87]  # [safe, injection]
    }
}
```

Evidence helps you understand why a Detector fired essential for tuning thresholds and investigating false positives.

## Custom Detectors

You can add custom Detectors for domain-specific threats:

```python theme={null}
from vijil.dome import Detector

class CompanyNameLeakDetector(Detector):
    def detect(self, text: str) -> DetectorResult:
        # Check for internal company names that shouldn't appear
        internal_names = ["Project Falcon", "Codename Thunder"]
        for name in internal_names:
            if name.lower() in text.lower():
                return DetectorResult(
                    triggered=True,
                    confidence=1.0,
                    evidence={"leaked_name": name}
                )
        return DetectorResult(triggered=False)
```

Custom Detectors integrate into Guards like built-in Detectors.

## Detection Methods

Vijil Dome has built-in detection methods that give Detectors their ability to identify issues. These methods are used to [Configure Guardrails](/developer-guide/protect/configuring-guardrails) using a TOML file or dictionary.\
The detection methods are grouped under these five categories:

* Security
* Moderation
* Privacy
* Integrity
* Generic

For each method, you will look at the model or service powering it and all its configurable parameters. When Configuring Dome, parameters are passed as key-value pairs under the detection method as you can see in this example.

<CodeGroup>
  ```toml title="TOML" icon="" theme={null}
  [prompt-injection]
  type = "security"
  methods = ["prompt-injection-mbert"]
  # Configuring a parameter
  [prompt-injection.prompt-injection-mbert]
  window_stride = 128  # More overlap for thorough detection
  ```
</CodeGroup>

The corresponding dictionary config looks like this:

<CodeGroup>
  ```python title="Python" icon="python" theme={null}
  config = {
      "input-guards": ["prompt-injection"],
      "prompt-injection": {
          "type": "security",
          "methods": ["prompt-injection-mbert"],
          # Configuring a parameter
          "prompt-injection-mbert": {
              "window_stride": 128,
          },
      },
  }
  ```
</CodeGroup>

Now that you have looked at how the parameters are configured, you can dive into the detection methods.

### Security

The detection methods under security give Detectors the ability to detect adversarial inputs like prompt injections, jailbreak attempts, and encoded/obfuscated payloads.
They include the following:

1. `prompt-injection-mbert`\
   This is Vijil's ModernBERT model for prompt injection detection. It supports up to 8,192 tokens natively, so sliding windows only activate for very long inputs. Its parameters include the following:

   | Parameter         | Type    | Default | Description                                        |
   | ----------------- | ------- | ------- | -------------------------------------------------- |
   | `score_threshold` | `float` | `0.5`   | Injection probability above which input is flagged |
   | `truncation`      | `bool`  | `True`  | Truncate inputs exceeding `max_length`             |
   | `max_length`      | `int`   | `8192`  | Maximum tokens per window                          |
   | `window_stride`   | `int`   | `4096`  | Token step size between sliding windows            |

2. `prompt-injection-deberta-finetuned-11122024`\
   This is a Vijil-finetuned DeBERTa model for prompt injection detection. Its parameters include the following:

   | Parameter       | Type   | Default | Description                               |
   | --------------- | ------ | ------- | ----------------------------------------- |
   | `truncation`    | `bool` | `True`  | Truncate inputs exceeding `max_length`    |
   | `max_length`    | `int`  | `512`   | Maximum tokens per window (DeBERTa limit) |
   | `window_stride` | `int`  | `256`   | Token step size between sliding windows   |

3. `prompt-injection-deberta-v3-base`\
   This is a DeBERTa v3 model for prompt injection detection. It has the following configurable parameters:

   | Parameter       | Type   | Default | Description                               |
   | --------------- | ------ | ------- | ----------------------------------------- |
   | `truncation`    | `bool` | `True`  | Truncate inputs exceeding `max_length`    |
   | `max_length`    | `int`  | `512`   | Maximum tokens per window (DeBERTa limit) |
   | `window_stride` | `int`  | `256`   | Token step size between sliding windows   |

4. `security-promptguard`\
   This is the Meta Prompt Guard model for jailbreak and prompt injection detection. It has the following parameters:

   | Parameter         | Type    | Default | Description                             |
   | ----------------- | ------- | ------- | --------------------------------------- |
   | `score_threshold` | `float` | `0.5`   | Jailbreak probability threshold         |
   | `truncation`      | `bool`  | `True`  | Truncate inputs exceeding `max_length`  |
   | `max_length`      | `int`   | `512`   | Maximum tokens per window               |
   | `window_stride`   | `int`   | `256`   | Token step size between sliding windows |

5. `security-llm`\
   This is an LLM-based security classification model served via LiteLLM. Its configurable parameters include:

   | Parameter         | Type  | Default         | Description                            |
   | ----------------- | ----- | --------------- | -------------------------------------- |
   | `hub_name`        | `str` | `"openai"`      | LLM API provider                       |
   | `model_name`      | `str` | `"gpt-4-turbo"` | Model name                             |
   | `api_key`         | `str` | `None`          | API key (falls back to env var)        |
   | `max_input_chars` | `int` | `None`          | Truncate input to this many characters |

6. `security-embeddings`\
   This provides jailbreak detection via embedding similarity against a known-jailbreak corpus. It supports various embedding engines and models. Its parameters include:

   | Parameter   | Type    | Default                  | Description               |
   | ----------- | ------- | ------------------------ | ------------------------- |
   | `engine`    | `str`   | `"SentenceTransformers"` | Embedding engine          |
   | `model`     | `str`   | `"all-MiniLM-L6-v2"`     | Embedding model name      |
   | `threshold` | `float` | `0.7`                    | Similarity threshold      |
   | `in_mem`    | `bool`  | `True`                   | Load embeddings in memory |

7. `jb-length-per-perplexity`\
   This is a perplexity-based heuristic that flags jailbreaks by their length-to-perplexity
   ratio. It has the following parameters:

   | Parameter       | Type    | Default        | Description                       |
   | --------------- | ------- | -------------- | --------------------------------- |
   | `model_id`      | `str`   | `"gpt2-large"` | HuggingFace model for perplexity  |
   | `batch_size`    | `int`   | `16`           | Batch size                        |
   | `stride_length` | `int`   | `512`          | Stride for perplexity calculation |
   | `threshold`     | `float` | `89.79`        | Length-per-perplexity threshold   |

8. `jb-prefix-suffix-perplexity`\
   This is a perplexity-based heuristic that analyses the prefix and suffix of inputs
   separately. It flags jailbreaks by their prefix and suffix perplexity scores. Its parameters include the following:

   | Parameter          | Type    | Default        | Description                       |
   | ------------------ | ------- | -------------- | --------------------------------- |
   | `model_id`         | `str`   | `"gpt2-large"` | HuggingFace model for perplexity  |
   | `batch_size`       | `int`   | `16`           | Batch size                        |
   | `stride_length`    | `int`   | `512`          | Stride for perplexity calculation |
   | `prefix_threshold` | `float` | `1845.65`      | Prefix perplexity threshold       |
   | `suffix_threshold` | `float` | `1845.65`      | Suffix perplexity threshold       |
   | `prefix_length`    | `int`   | `20`           | Number of prefix words to analyse |
   | `suffix_length`    | `int`   | `20`           | Number of suffix words to analyse |

9. `encoding-heuristics`\
   This is a rule-based Detector for encoded or obfuscated payloads (base64, ROT13, hex,
   URL encoding, Unicode tricks, etc.). It flags inputs as suspicious based on the presence of encoding patterns and their proportion in the text. Its parameters include:

   | Parameter       | Type   | Default       | Description                  |
   | --------------- | ------ | ------------- | ---------------------------- |
   | `threshold_map` | `dict` | *(see below)* | Per-encoding-type thresholds |

   Default `threshold_map`:

   | Encoding Type          | Threshold |
   | ---------------------- | --------- |
   | `base64`               | `0.7`     |
   | `rot13`                | `0.7`     |
   | `ascii_escape`         | `0.05`    |
   | `hex_encoding`         | `0.15`    |
   | `url_encoding`         | `0.15`    |
   | `cyrillic_homoglyphs`  | `0.05`    |
   | `mixed_scripts`        | `0.05`    |
   | `zero_width`           | `0.01`    |
   | `excessive_whitespace` | `0.4`     |

### Moderation

Detection methods under moderation enable Detectors to identify content that violates content policies, such as hate speech, violence, adult content, toxic content, and more. They include the following:

1. `moderation-mbert`\
   This is Vijil's ModernBERT model for toxic content detection. Supports up to 8,192
   tokens natively. It has the following parameters:

   | Parameter         | Type    | Default | Description                             |
   | ----------------- | ------- | ------- | --------------------------------------- |
   | `score_threshold` | `float` | `0.5`   | Toxicity probability threshold          |
   | `truncation`      | `bool`  | `True`  | Truncate inputs exceeding `max_length`  |
   | `max_length`      | `int`   | `8192`  | Maximum tokens per window               |
   | `window_stride`   | `int`   | `4096`  | Token step size between sliding windows |

2. `moderations-oai-api`\
   This is OpenAI's Moderation API with per-category score thresholds. It has the following parameters:

   | Parameter              | Type   | Default | Description                    |
   | ---------------------- | ------ | ------- | ------------------------------ |
   | `score_threshold_dict` | `dict` | `None`  | Custom thresholds per category |

   Supported categories include:\
   `hate`, `hate/threatening`, `self-harm`, `sexual`,
   `sexual/minors`, `violence`, `violence/graphic`, `harassment`,
   `harassment/threatening`, `illegal`, `illicit`, `self-harm/intent`,
   `self-harm/instructions`, `sexual/instructions`.\
   This detection method requires you to set up the `OPENAI_API_KEY` environment variable.

3. `moderation-deberta`\
   This is a DeBERTa model for toxicity scoring. The 208-token context window means the
   sliding window activates for most non-trivial inputs. Its parameters include the following:

   | Parameter       | Type   | Default | Description                                   |
   | --------------- | ------ | ------- | --------------------------------------------- |
   | `truncation`    | `bool` | `True`  | Truncate inputs exceeding `max_length`        |
   | `max_length`    | `int`  | `208`   | Maximum tokens per window                     |
   | `window_stride` | `int`  | `104`   | Token step size between sliding windows       |
   | `device`        | `str`  | `None`  | Torch device (auto-selects CUDA if available) |

4. `moderation-perspective-api`\
   This is Google's Perspective API for toxicity and other attributes. It has the following parameters:

   | Parameter         | Type   | Default             | Description                                          |
   | ----------------- | ------ | ------------------- | ---------------------------------------------------- |
   | `api_key`         | `str`  | `None`              | Google API key (falls back to `PERSPECTIVE_API_KEY`) |
   | `attributes`      | `dict` | `{"TOXICITY": {}}`  | Attributes to analyse                                |
   | `score_threshold` | `dict` | `{"TOXICITY": 0.5}` | Per-attribute thresholds                             |

   The available attributes include the following:\
   `TOXICITY`, `SEVERE_TOXICITY`, `IDENTITY_ATTACK`,
   `INSULT`, `PROFANITY`, `THREAT`.\
   Using this detection method requires setting up the `PERSPECTIVE_API_KEY` environment variable.

5. `moderation-prompt-engineering`\
   This is an LLM-based moderation classifier served via LiteLLM. It has the following parameters:

   | Parameter         | Type  | Default         | Description                                  |
   | ----------------- | ----- | --------------- | -------------------------------------------- |
   | `hub_name`        | `str` | `"openai"`      | LLM API provider                             |
   | `model_name`      | `str` | `"gpt-4-turbo"` | Model name                                   |
   | `api_key`         | `str` | `None`          | API key (falls back to environment variable) |
   | `max_input_chars` | `int` | `None`          | Truncate input to this many characters       |

6. `moderation-flashtext`\
   This is a keyword ban-list Detector that uses FlashText for fast matching. Its parameters include the following:

   | Parameter           | Type        | Default | Description                                                     |
   | ------------------- | ----------- | ------- | --------------------------------------------------------------- |
   | `banlist_filepaths` | `list[str]` | `None`  | Paths to ban-list files (uses built-in default list if omitted) |

### Privacy

Detection methods under privacy enable Detectors to identify personally identifiable information (PII) and sensitive data in inputs. They include the following:

1. `privacy-presidio`\
   This  detection method uses Microsoft's Presidio-based PII detection and redaction. It has the following parameters:

   | Parameter          | Type        | Default     | Description                                 |
   | ------------------ | ----------- | ----------- | ------------------------------------------- |
   | `score_threshold`  | `float`     | `0.5`       | Confidence threshold for PII detection      |
   | `anonymize`        | `bool`      | `True`      | Redact detected PII in the response         |
   | `allow_list_files` | `list[str]` | `None`      | Files with values to exclude from detection |
   | `redaction_style`  | `str`       | `"labeled"` | Redaction style: `"labeled"` or `"masked"`  |

2. `detect-secrets`\
   This is a pattern-based secret and credential detection method. It detects API keys, tokens, etc. Its parameters include the following:

   | Parameter | Type   | Default | Description                             |
   | --------- | ------ | ------- | --------------------------------------- |
   | `censor`  | `bool` | `True`  | Censor detected secrets in the response |

   This method includes 25 Detector plugins:\
   ArtifactoryDetector, AWSKeyDetector,
   AzureStorageKeyDetector, BasicAuthDetector, CloudantDetector,
   DiscordBotTokenDetector, GitHubTokenDetector, GitLabTokenDetector,
   IbmCloudIamDetector, IbmCosHmacDetector, IPPublicDetector, JwtTokenDetector,
   KeywordDetector, MailchimpDetector, NpmDetector, OpenAIDetector,
   PrivateKeyDetector, PypiTokenDetector, SendGridDetector, SlackDetector,
   SoftlayerDetector, SquareOAuthDetector, StripeDetector,
   TelegramBotTokenDetector, TwilioKeyDetector.

### Integrity

Detection methods under integrity enable Detectors to identify issues related to the integrity and authenticity of inputs or outputs (hallucinations), such as misinformation, deepfakes, manipulated media, and more. They include the following:

1. `hhem-hallucination`\
   This method uses the Vectara HHEM model for hallucination detection which compares output against a
   reference context.

   | Parameter                             | Type    | Default | Description                          |
   | ------------------------------------- | ------- | ------- | ------------------------------------ |
   | `context`                             | `str`   | `""`    | Reference context to compare against |
   | `factual_consistency_score_threshold` | `float` | `0.5`   | Score below which output is flagged  |
   | `trust_remote_code`                   | `bool`  | `True`  | Trust remote code from model hub     |

2. `fact-check-roberta`\
   This detection method uses the RoBERTa model for detecting factual contradictions between output and context. Its parameters include the following:

   | Parameter | Type  | Default | Description                        |
   | --------- | ----- | ------- | ---------------------------------- |
   | `context` | `str` | `""`    | Reference context to check against |

3. `hallucination-llm`\
   This uses LLM-based hallucination detection with reference context. It has the following parameters:

   | Parameter         | Type  | Default         | Description                                   |
   | ----------------- | ----- | --------------- | --------------------------------------------- |
   | `hub_name`        | `str` | `"openai"`      | LLM API provider                              |
   | `model_name`      | `str` | `"gpt-4-turbo"` | Model name                                    |
   | `api_key`         | `str` | `None`          | API key (falls back to environment  variable) |
   | `max_input_chars` | `int` | `None`          | Truncate input to this many characters        |
   | `context`         | `str` | `None`          | Reference context for comparison              |

4. `fact-check-llm`\
   This  method uses an LLM for fact-checking with reference context. Its parameters include the following:

   | Parameter         | Type  | Default         | Description                                   |
   | ----------------- | ----- | --------------- | --------------------------------------------- |
   | `hub_name`        | `str` | `"openai"`      | LLM API provider                              |
   | `model_name`      | `str` | `"gpt-4-turbo"` | Model name                                    |
   | `api_key`         | `str` | `None`          | API key (falls back to environment  variable) |
   | `max_input_chars` | `int` | `None`          | Truncate input to this many characters        |
   | `context`         | `str` | `None`          | Reference context for comparison              |

### Generic

Detection methods under generic are versatile and can be customized and applied to a wide range of issues beyond the specific categories above. They include the following:

1. `generic-llm`\
   This is method offers custom LLM-based detection with user-provided system prompts and trigger words. It can be used for various detection needs by tailoring the prompt and trigger words accordingly. Its parameters include the following:

   | Parameter             | Type        | Default         | Description                                    |
   | --------------------- | ----------- | --------------- | ---------------------------------------------- |
   | `sys_prompt_template` | `str`       | *(required)*    | System prompt with `$query_string` placeholder |
   | `trigger_word_list`   | `list[str]` | *(required)*    | Words in LLM response that indicate a hit      |
   | `hub_name`            | `str`       | `"openai"`      | LLM API provider                               |
   | `model_name`          | `str`       | `"gpt-4-turbo"` | Model name                                     |
   | `api_key`             | `str`       | `None`          | API key (falls back to environment variable)   |
   | `max_input_chars`     | `int`       | `None`          | Truncate input to this many characters         |

2. `policy-gpt-oss-safeguard`\
   This is a policy-based content classifier that uses GPT-OSS-Safeguard. It classifies inputs based on user-provided policy rules and returns the violated policy reference. Its parameters include the following:

   | Parameter          | Type  | Default                          | Description                                       |
   | ------------------ | ----- | -------------------------------- | ------------------------------------------------- |
   | `policy_file`      | `str` | *(required)*                     | Path to policy file with classification rules     |
   | `hub_name`         | `str` | `"groq"`                         | LLM API provider                                  |
   | `model_name`       | `str` | `"openai/gpt-oss-safeguard-20b"` | Model name                                        |
   | `output_format`    | `str` | `"policy_ref"`                   | `"binary"`, `"policy_ref"`, or `"with_rationale"` |
   | `reasoning_effort` | `str` | `"medium"`                       | `"low"`, `"medium"`, or `"high"`                  |
   | `api_key`          | `str` | `None`                           | API key (falls back to environment variable)      |
   | `timeout`          | `int` | `60`                             | Request timeout in seconds                        |
   | `max_retries`      | `int` | `3`                              | Maximum retry attempts                            |
   | `max_input_chars`  | `int` | `None`                           | Truncate input to this many characters            |

## Next Steps

<CardGroup cols={2}>
  <Card title="Guard" icon="shield" href="/concepts/defense/guard">
    How Detectors compose into Guards
  </Card>

  <Card title="Guardrail" icon="train-track" href="/concepts/defense/guardrail">
    How Guards compose into pipelines
  </Card>

  <Card title="Custom Detectors" icon="code" href="/tutorials/protect-agents/custom-detectors">
    Build your own Detectors
  </Card>

  <Card title="How Defense Works" icon="git-merge" href="/concepts/defense/introduction">
    The full defense architecture
  </Card>
</CardGroup>
