Skip to main content

Security

Detectors that identify adversarial inputs such as prompt injections, jailbreak attempts, and encoded/obfuscated payloads.

prompt-injection-deberta-v3-base

DeBERTa v3 model for prompt injection detection.
ParameterTypeDefaultDescription
truncationboolTrueTruncate inputs exceeding max_length
max_lengthint512Maximum tokens per window (DeBERTa limit)
window_strideint256Token step size between sliding windows

prompt-injection-deberta-finetuned-11122024

Vijil-finetuned DeBERTa model for prompt injection detection.
ParameterTypeDefaultDescription
truncationboolTrueTruncate inputs exceeding max_length
max_lengthint512Maximum tokens per window (DeBERTa limit)
window_strideint256Token step size between sliding windows

prompt-injection-mbert

Vijil ModernBERT model for prompt injection detection. Supports up to 8,192 tokens natively, so sliding windows only activate for very long inputs.
ParameterTypeDefaultDescription
score_thresholdfloat0.5Injection probability above which input is flagged
truncationboolTrueTruncate inputs exceeding max_length
max_lengthint8192Maximum tokens per window
window_strideint4096Token step size between sliding windows

prompt-injection-mbert-safeguard

API-only prompt injection detection backed by an OpenAI-compatible chat completions endpoint. Defaults to GPT-OSS-Safeguard-20B on Groq (~200ms, high accuracy), but base_url, model, and api_key_name can be overridden to point at any OpenAI-style deployment (local vLLM, Together, Fireworks, OpenAI itself, etc). No ModernBERT loaded. Oversize inputs are truncated (not chunked) to max_input_chars before being sent and the default Groq model’s ~130K token context window makes truncation a rare safety net.
ParameterTypeDefaultDescription
api_keystrNoneAPI key; falls back to the env var named by api_key_name
api_key_namestr"GROQ_API_KEY"Name of the env var to read the API key from when api_key is not supplied
base_urlstr"https://api.groq.com/openai/v1"OpenAI-compatible base URL (/chat/completions is appended)
modelstr"openai/gpt-oss-safeguard-20b"Model ID sent to the endpoint
temperaturefloat0.0Sampling temperature
max_tokensint2000Response token budget (must leave room for reasoning tokens — see note below)
reasoning_effortstr | None"low"Passed through when non-null; set to null (JSON) / None (Python) to omit (required for non-reasoning models like gpt-4o-mini)
timeout_secondsfloat10.0Request timeout
max_input_charsint400000Character cap applied before the request (pass None to disable)
Note on max_tokens: gpt-oss-safeguard-20b is a reasoning model that consumes part of its token budget on internal reasoning before emitting any assistant content. Setting max_tokens too low (e.g. 8) causes the response to hit finish_reason=length with an empty content field, which the detector silently classifies as safe. Keep this generous.
  • Class: PImbertSafeguard
  • Requires: the env var named by api_key_name (defaults to GROQ_API_KEY)

prompt-injection-mbert-hybrid

Two-stage detector: ModernBERT classifies first, and low-confidence predictions are escalated to GPT-OSS-Safeguard-20B. ~5ms average latency, near-100% accuracy, API cost only on uncertain examples. Accepts all parameters from both prompt-injection-mbert and prompt-injection-mbert-safeguard.
ParameterTypeDefaultDescription
confidence_thresholdfloat0.85Fast-stage confidence below which the input is escalated to Safeguard
score_thresholdfloat0.5Injection probability threshold (fast stage)
truncationboolTrueTruncate inputs exceeding max_length
max_lengthint8192Maximum tokens per window (fast stage)
window_strideint4096Token step size between sliding windows
api_keystrNoneAPI key; falls back to the env var named by api_key_name
api_key_namestr"GROQ_API_KEY"Name of the env var to read the API key from when api_key is not supplied
base_urlstr"https://api.groq.com/openai/v1"OpenAI-compatible base URL (/chat/completions is appended)
modelstr"openai/gpt-oss-safeguard-20b"Model ID sent to the endpoint
temperaturefloat0.0Sampling temperature
max_tokensint2000Response token budget for the Safeguard escalation
reasoning_effortstr | None"low"Passed through when non-null; set to null (JSON) / None (Python) to omit (required for non-reasoning models like gpt-4o-mini)
timeout_secondsfloat10.0Request timeout
max_input_charsint400000Character cap applied before the escalation request
If GROQ_API_KEY is not set, the hybrid mode silently falls back to fast-only classification instead of failing.

security-promptguard

Meta Prompt Guard model for jailbreak and prompt injection detection.
ParameterTypeDefaultDescription
score_thresholdfloat0.5Jailbreak probability threshold
truncationboolTrueTruncate inputs exceeding max_length
max_lengthint512Maximum tokens per window
window_strideint256Token step size between sliding windows

security-llm

LLM-based security classification via LiteLLM.
ParameterTypeDefaultDescription
hub_namestr"openai"LLM API provider
model_namestr"gpt-4-turbo"Model name
api_keystrNoneAPI key (falls back to env var)
max_input_charsintNoneTruncate input to this many characters
  • Class: LlmSecurity

security-embeddings

Jailbreak detection via embedding similarity against a known-jailbreak corpus.
ParameterTypeDefaultDescription
enginestr"SentenceTransformers"Embedding engine
modelstr"all-MiniLM-L6-v2"Embedding model name
thresholdfloat0.7Similarity threshold
in_memboolTrueLoad embeddings in memory

jb-length-per-perplexity

Perplexity-based heuristic that flags jailbreaks by their length-to-perplexity ratio.
ParameterTypeDefaultDescription
model_idstr"gpt2-large"HuggingFace model for perplexity
batch_sizeint16Batch size
stride_lengthint512Stride for perplexity calculation
thresholdfloat89.79Length-per-perplexity threshold
  • Class: LengthPerPerplexityModel

jb-prefix-suffix-perplexity

Perplexity-based heuristic that analyses the prefix and suffix of inputs separately.
ParameterTypeDefaultDescription
model_idstr"gpt2-large"HuggingFace model for perplexity
batch_sizeint16Batch size
stride_lengthint512Stride for perplexity calculation
prefix_thresholdfloat1845.65Prefix perplexity threshold
suffix_thresholdfloat1845.65Suffix perplexity threshold
prefix_lengthint20Number of prefix words to analyse
suffix_lengthint20Number of suffix words to analyse
  • Class: PrefixSuffixPerplexityModel

encoding-heuristics

Rule-based detector for encoded or obfuscated payloads (base64, ROT13, hex, URL encoding, Unicode tricks, etc.).
ParameterTypeDefaultDescription
threshold_mapdict(see below)Per-encoding-type thresholds
Default threshold_map:
Encoding TypeThreshold
base640.7
rot130.7
ascii_escape0.05
hex_encoding0.15
url_encoding0.15
cyrillic_homoglyphs0.05
mixed_scripts0.05
zero_width0.01
excessive_whitespace0.4
  • Class: EncodingHeuristicsDetector

Moderation

Detectors for toxic, harmful, or otherwise inappropriate content.

moderation-deberta

DeBERTa model for toxicity scoring. The 208-token context window means the sliding window activates for most non-trivial inputs.
ParameterTypeDefaultDescription
truncationboolTrueTruncate inputs exceeding max_length
max_lengthint208Maximum tokens per window
window_strideint104Token step size between sliding windows
devicestrNoneTorch device (auto-selects CUDA if available)

moderation-mbert

Vijil ModernBERT model for toxic content detection. Supports up to 8,192 tokens natively.
ParameterTypeDefaultDescription
score_thresholdfloat0.5Toxicity probability threshold
truncationboolTrueTruncate inputs exceeding max_length
max_lengthint8192Maximum tokens per window
window_strideint4096Token step size between sliding windows

moderation-mbert-safeguard

API-only toxicity / moderation detection backed by an OpenAI-compatible chat completions endpoint. Defaults to GPT-OSS-Safeguard-20B on Groq (~200ms, high accuracy), but base_url, model, and api_key_name can be overridden to point at any OpenAI-style deployment (local vLLM, Together, Fireworks, OpenAI itself, …). No ModernBERT loaded. Oversize inputs are truncated (not chunked) to max_input_chars before being sent — the default Groq model’s ~130K token context window makes truncation a rare safety net.
ParameterTypeDefaultDescription
api_keystrNoneAPI key; falls back to the env var named by api_key_name
api_key_namestr"GROQ_API_KEY"Name of the env var to read the API key from when api_key is not supplied
base_urlstr"https://api.groq.com/openai/v1"OpenAI-compatible base URL (/chat/completions is appended)
modelstr"openai/gpt-oss-safeguard-20b"Model ID sent to the endpoint
temperaturefloat0.0Sampling temperature
max_tokensint2000Response token budget (must leave room for reasoning tokens — see note below)
reasoning_effortstr | None"low"Passed through when non-null; set to null (JSON) / None (Python) to omit (required for non-reasoning models like gpt-4o-mini)
timeout_secondsfloat10.0Request timeout
max_input_charsint400000Character cap applied before the request (pass None to disable)
Note on max_tokens: gpt-oss-safeguard-20b is a reasoning model that consumes part of its token budget on internal reasoning before emitting any assistant content. Setting max_tokens too low (e.g. 8) causes the response to hit finish_reason=length with an empty content field, which the detector silently classifies as safe. Keep this generous.
  • Class: ModerationMbertSafeguard
  • Requires: the env var named by api_key_name (defaults to GROQ_API_KEY)

moderation-mbert-hybrid

Two-stage detector: ModernBERT classifies first, and low-confidence predictions are escalated to GPT-OSS-Safeguard-20B. ~5ms average latency, near-100% accuracy, API cost only on uncertain examples. Accepts all parameters from both moderation-mbert and moderation-mbert-safeguard.
ParameterTypeDefaultDescription
confidence_thresholdfloat0.85Fast-stage confidence below which the input is escalated to Safeguard
score_thresholdfloat0.5Toxicity probability threshold (fast stage)
truncationboolTrueTruncate inputs exceeding max_length
max_lengthint8192Maximum tokens per window (fast stage)
window_strideint4096Token step size between sliding windows
api_keystrNoneAPI key; falls back to the env var named by api_key_name
api_key_namestr"GROQ_API_KEY"Name of the env var to read the API key from when api_key is not supplied
base_urlstr"https://api.groq.com/openai/v1"OpenAI-compatible base URL (/chat/completions is appended)
modelstr"openai/gpt-oss-safeguard-20b"Model ID sent to the endpoint
temperaturefloat0.0Sampling temperature
max_tokensint2000Response token budget for the Safeguard escalation
reasoning_effortstr | None"low"Passed through when non-null; set to null (JSON) / None (Python) to omit (required for non-reasoning models like gpt-4o-mini)
timeout_secondsfloat10.0Request timeout
max_input_charsint400000Character cap applied before the escalation request
If GROQ_API_KEY is not set, the hybrid mode silently falls back to fast-only classification instead of failing.

moderations-oai-api

OpenAI Moderation API with per-category score thresholds.
ParameterTypeDefaultDescription
score_threshold_dictdictNoneCustom thresholds per category
Supported categories: hate, hate/threatening, self-harm, sexual, sexual/minors, violence, violence/graphic, harassment, harassment/threatening, illegal, illicit, self-harm/intent, self-harm/instructions, sexual/instructions.
  • Class: OpenAIModerations
  • Requires: OPENAI_API_KEY environment variable

moderation-perspective-api

Google Perspective API for toxicity and other attributes.
ParameterTypeDefaultDescription
api_keystrNoneGoogle API key (falls back to PERSPECTIVE_API_KEY)
attributesdict{"TOXICITY": {}}Attributes to analyse
score_thresholddict{"TOXICITY": 0.5}Per-attribute thresholds
Available attributes: TOXICITY, SEVERE_TOXICITY, IDENTITY_ATTACK, INSULT, PROFANITY, THREAT.
  • Class: PerspectiveAPI
  • Requires: PERSPECTIVE_API_KEY environment variable

moderation-prompt-engineering

LLM-based moderation classification via LiteLLM.
ParameterTypeDefaultDescription
hub_namestr"openai"LLM API provider
model_namestr"gpt-4-turbo"Model name
api_keystrNoneAPI key (falls back to env var)
max_input_charsintNoneTruncate input to this many characters
  • Class: LlmModerations

moderation-flashtext

Keyword ban-list detector using FlashText for fast matching.
ParameterTypeDefaultDescription
banlist_filepathslist[str]NonePaths to ban-list files (uses built-in default list if omitted)
  • Class: KWBanList

stereotype-eeoc-fast

Vijil ModernBERT classifier for stereotypes and harmful generalizations about EEOC protected classes (Race/Color, Sex/Gender/Sexual Orientation, Religion, National Origin, Age 40+, Disability). Distilled from GPT-OSS-Safeguard-20B against a custom EEOC discrimination policy. Self-hosted, < 5ms latency, F1=0.923, zero API cost. Detects stereotyping within a single prompt or response. Does not detect counterfactual bias (whether varying only the protected class in a prompt produces different outputs) — that requires comparing pairs of prompt-response outputs and is out of scope. When given a DomePayload with both prompt and response, the detector reconstructs the training format (prompt [SEP] response). When only text is set, it is treated as the prompt half with an empty response. Inputs longer than max_length are split into multiple [SEP]-centered chunks; any chunk flagged flags the whole input, and the max score wins.
ParameterTypeDefaultDescription
score_thresholdfloat0.5Stereotype probability threshold
max_lengthint512Maximum tokens per chunk

stereotype-eeoc-safeguard

API-only EEOC stereotype detection backed by an OpenAI-compatible chat completions endpoint. Defaults to GPT-OSS-Safeguard-20B on Groq (~200ms, ~100% accuracy), but base_url, model, and api_key_name can be overridden to point at any OpenAI-style deployment (local vLLM, Together, Fireworks, OpenAI itself, …). No ModernBERT loaded. Oversize inputs are truncated (not chunked) to max_input_chars before being sent and the default Groq model’s ~130K token context window makes truncation a rare safety net.
ParameterTypeDefaultDescription
api_keystrNoneAPI key; falls back to the env var named by api_key_name
api_key_namestr"GROQ_API_KEY"Name of the env var to read the API key from when api_key is not supplied
base_urlstr"https://api.groq.com/openai/v1"OpenAI-compatible base URL (/chat/completions is appended)
modelstr"openai/gpt-oss-safeguard-20b"Model ID sent to the endpoint
temperaturefloat0.0Sampling temperature
max_tokensint2000Response token budget (must leave room for reasoning tokens — see note below)
reasoning_effortstr | None"low"Passed through when non-null; set to null (JSON) / None (Python) to omit (required for non-reasoning models like gpt-4o-mini)
timeout_secondsfloat10.0Request timeout
max_input_charsint400000Character cap applied before the request (pass None to disable)
max_tokens: gpt-oss-safeguard-20b is a reasoning model that consumes part of its token budget on internal reasoning before emitting any assistant content. Setting max_tokens too low (e.g. 8) causes the response to hit finish_reason=length with an empty content field, which the detector silently classifies as safe. Keep this generous.
  • Class: StereotypeEEOCSafeguard
  • Requires: the env var named by api_key_name (defaults to GROQ_API_KEY)

stereotype-eeoc-hybrid

Two-stage detector: ModernBERT classifies first, and low-confidence predictions are escalated to GPT-OSS-Safeguard-20B. ~5ms average latency, near-100% accuracy, API cost only on uncertain examples. Accepts all parameters from both stereotype-eeoc-fast and stereotype-eeoc-safeguard.
ParameterTypeDefaultDescription
confidence_thresholdfloat0.85Fast-stage confidence below which the input is escalated to Safeguard
score_thresholdfloat0.5Stereotype probability threshold (fast stage)
max_lengthint512Maximum tokens per chunk (fast stage)
api_keystrNoneAPI key; falls back to the env var named by api_key_name
api_key_namestr"GROQ_API_KEY"Name of the env var to read the API key from when api_key is not supplied
base_urlstr"https://api.groq.com/openai/v1"OpenAI-compatible base URL (/chat/completions is appended)
modelstr"openai/gpt-oss-safeguard-20b"Model ID sent to the endpoint
temperaturefloat0.0Sampling temperature
max_tokensint2000Response token budget for the Safeguard escalation
reasoning_effortstr | None"low"Passed through when non-null; set to null (JSON) / None (Python) to omit (required for non-reasoning models like gpt-4o-mini)
timeout_secondsfloat10.0Request timeout
max_input_charsint400000Character cap applied before the escalation request
If GROQ_API_KEY is not set, the hybrid mode silently falls back to fast-only classification instead of failing.
  • Class: StereotypeEEOCHybrid
  • Model: vijil/stereotype-eeoc-detector
  • Requires: the env var named by api_key_name (defaults to GROQ_API_KEY); optional — falls back to fast-only if absent

Privacy

Detectors for personally identifiable information (PII) and secrets.

privacy-presidio

Microsoft Presidio-based PII detection and redaction.
ParameterTypeDefaultDescription
score_thresholdfloat0.5Confidence threshold for PII detection
anonymizeboolTrueRedact detected PII in the response
allow_list_fileslist[str]NoneFiles with values to exclude from detection
redaction_stylestr"labeled"Redaction style: "labeled" or "masked"
  • Class: PresidioDetector

detect-secrets

Pattern-based secret and credential detection (API keys, tokens, etc.).
ParameterTypeDefaultDescription
censorboolTrueCensor detected secrets in the response
Includes 25 detector plugins: ArtifactoryDetector, AWSKeyDetector, AzureStorageKeyDetector, BasicAuthDetector, CloudantDetector, DiscordBotTokenDetector, GitHubTokenDetector, GitLabTokenDetector, IbmCloudIamDetector, IbmCosHmacDetector, IPPublicDetector, JwtTokenDetector, KeywordDetector, MailchimpDetector, NpmDetector, OpenAIDetector, PrivateKeyDetector, PypiTokenDetector, SendGridDetector, SlackDetector, SoftlayerDetector, SquareOAuthDetector, StripeDetector, TelegramBotTokenDetector, TwilioKeyDetector.
  • Class: SecretDetector

Integrity

Detectors for hallucinations and factual accuracy. These typically require a reference context to compare against.

hhem-hallucination

Vectara HHEM model for hallucination detection by comparing output against a reference context.
ParameterTypeDefaultDescription
contextstr""Reference context to compare against
factual_consistency_score_thresholdfloat0.5Score below which output is flagged
trust_remote_codeboolTrueTrust remote code from model hub

fact-check-roberta

RoBERTa model for detecting factual contradictions between output and context.
ParameterTypeDefaultDescription
contextstr""Reference context to check against

hallucination-llm

LLM-based hallucination detection with reference context.
ParameterTypeDefaultDescription
hub_namestr"openai"LLM API provider
model_namestr"gpt-4-turbo"Model name
api_keystrNoneAPI key (falls back to env var)
max_input_charsintNoneTruncate input to this many characters
contextstrNoneReference context for comparison
  • Class: LlmHallucination

fact-check-llm

LLM-based fact-checking with reference context.
ParameterTypeDefaultDescription
hub_namestr"openai"LLM API provider
model_namestr"gpt-4-turbo"Model name
api_keystrNoneAPI key (falls back to env var)
max_input_charsintNoneTruncate input to this many characters
contextstrNoneReference context for comparison
  • Class: LlmFactcheck

Generic

Flexible detectors that can be customised for arbitrary use cases.

generic-llm

Custom LLM-based detection with user-provided system prompts and trigger words.
ParameterTypeDefaultDescription
sys_prompt_templatestr(required)System prompt with $query_string placeholder
trigger_word_listlist[str](required)Words in LLM response that indicate a hit
hub_namestr"openai"LLM API provider
model_namestr"gpt-4-turbo"Model name
api_keystrNoneAPI key (falls back to env var)
max_input_charsintNoneTruncate input to this many characters
  • Class: GenericLLMDetector

policy-gpt-oss-safeguard

Policy-based content classification using GPT-OSS-Safeguard.
ParameterTypeDefaultDescription
policy_filestr(required)Path to policy file with classification rules
hub_namestr"groq"LLM API provider
model_namestr"openai/gpt-oss-safeguard-20b"Model name
output_formatstr"policy_ref""binary", "policy_ref", or "with_rationale"
reasoning_effortstr"medium""low", "medium", or "high"
api_keystrNoneAPI key (falls back to env var)
timeoutint60Request timeout in seconds
max_retriesint3Maximum retry attempts
max_input_charsintNoneTruncate input to this many characters
  • Class: PolicyGptOssSafeguard

Sliding Window Behaviour

HuggingFace-based detectors (DeBERTa, ModernBERT, PromptGuard) use a sliding window to handle inputs longer than their max_length. Key points:
  • Fast path: inputs that fit in a single window are processed unchanged.
  • Overlap: window_stride < usable window size creates overlapping windows, ensuring content at boundaries is not missed.
  • Aggregation: any window flagged as unsafe causes the entire input to be flagged (any-positive strategy). For score-based detectors, the maximum score across windows is reported.
  • Batch processing: detect_batch() flattens all chunks from all inputs into a single pipeline call, then re-aggregates results per input.
The window_stride parameter is configurable per detector via TOML or dict config.

Configuration

Parameters are passed under the method name in your guard configuration:

TOML

[prompt-injection]
type = "security"
methods = ["prompt-injection-deberta-v3-base"]

[prompt-injection.prompt-injection-deberta-v3-base]
window_stride = 128  # More overlap for thorough detection

[input-toxicity]
type = "moderation"
methods = ["moderation-mbert"]

[input-toxicity.moderation-mbert]
score_threshold = 0.7

[output-safety]
type = "security"
methods = ["security-llm"]

[output-safety.security-llm]
max_input_chars = 50000
model_name = "gpt-4o"

Python dict

config = {
    "input-guards": ["prompt-injection"],
    "prompt-injection": {
        "type": "security",
        "methods": ["prompt-injection-deberta-v3-base"],
        "prompt-injection-deberta-v3-base": {
            "window_stride": 128,
        },
    },
}
Last modified on April 20, 2026