Skip to main content

Detection Methods

Vijil Dome has built-in detection methods that give Detectors their ability to identify issues. These methods are used to Configure Guardrails using a TOML file or dictionary.
The detection methods are grouped under these five categories:
  • Security
  • Moderation
  • Privacy
  • Integrity
  • Generic
For each method, you will look at the model or service powering it and all its configurable parameters. When Configuring Dome, parameters are passed as key-value pairs under the detection method as you can see in this example.
[prompt-injection]
type = "security"
methods = ["prompt-injection-mbert"]
# Configuring a parameter
[prompt-injection.prompt-injection-mbert]
window_stride = 128  # More overlap for thorough detection
The corresponding dictionary config looks like this:
config = {
    "input-guards": ["prompt-injection"],
    "prompt-injection": {
        "type": "security",
        "methods": ["prompt-injection-mbert"],
        # Configuring a parameter
        "prompt-injection-mbert": {
            "window_stride": 128,
        },
    },
}
Now that you have looked at how the parameters are configured, you can dive into the detection methods.

Security

The detection methods under security give Detectors the ability to detect adversarial inputs like prompt injections, jailbreak attempts, and encoded/obfuscated payloads. They include the following:
  1. prompt-injection-mbert
    This is Vijil’s ModernBERT model for prompt injection detection. It supports up to 8,192 tokens natively, so sliding windows only activate for very long inputs. Its parameters include the following:
    ParameterTypeDefaultDescription
    score_thresholdfloat0.5Injection probability above which input is flagged
    truncationboolTrueTruncate inputs exceeding max_length
    max_lengthint8192Maximum tokens per window
    window_strideint4096Token step size between sliding windows
  2. prompt-injection-deberta-finetuned-11122024
    This is a Vijil-finetuned DeBERTa model for prompt injection detection. Its parameters include the following:
    ParameterTypeDefaultDescription
    truncationboolTrueTruncate inputs exceeding max_length
    max_lengthint512Maximum tokens per window (DeBERTa limit)
    window_strideint256Token step size between sliding windows
  3. prompt-injection-deberta-v3-base
    This is a DeBERTa v3 model for prompt injection detection. It has the following configurable parameters:
    ParameterTypeDefaultDescription
    truncationboolTrueTruncate inputs exceeding max_length
    max_lengthint512Maximum tokens per window (DeBERTa limit)
    window_strideint256Token step size between sliding windows
  4. security-promptguard
    This is the Meta Prompt Guard model for jailbreak and prompt injection detection. It has the following parameters:
    ParameterTypeDefaultDescription
    score_thresholdfloat0.5Jailbreak probability threshold
    truncationboolTrueTruncate inputs exceeding max_length
    max_lengthint512Maximum tokens per window
    window_strideint256Token step size between sliding windows
  5. security-llm
    This is an LLM-based security classification model served via LiteLLM. Its configurable parameters include:
    ParameterTypeDefaultDescription
    hub_namestr"openai"LLM API provider
    model_namestr"gpt-4-turbo"Model name
    api_keystrNoneAPI key (falls back to env var)
    max_input_charsintNoneTruncate input to this many characters
  6. security-embeddings
    This provides jailbreak detection via embedding similarity against a known-jailbreak corpus. It supports various embedding engines and models. Its parameters include:
    ParameterTypeDefaultDescription
    enginestr"SentenceTransformers"Embedding engine
    modelstr"all-MiniLM-L6-v2"Embedding model name
    thresholdfloat0.7Similarity threshold
    in_memboolTrueLoad embeddings in memory
  7. jb-length-per-perplexity
    This is a perplexity-based heuristic that flags jailbreaks by their length-to-perplexity ratio. It has the following parameters:
    ParameterTypeDefaultDescription
    model_idstr"gpt2-large"HuggingFace model for perplexity
    batch_sizeint16Batch size
    stride_lengthint512Stride for perplexity calculation
    thresholdfloat89.79Length-per-perplexity threshold
  8. jb-prefix-suffix-perplexity
    This is a perplexity-based heuristic that analyses the prefix and suffix of inputs separately. It flags jailbreaks by their prefix and suffix perplexity scores. Its parameters include the following:
    ParameterTypeDefaultDescription
    model_idstr"gpt2-large"HuggingFace model for perplexity
    batch_sizeint16Batch size
    stride_lengthint512Stride for perplexity calculation
    prefix_thresholdfloat1845.65Prefix perplexity threshold
    suffix_thresholdfloat1845.65Suffix perplexity threshold
    prefix_lengthint20Number of prefix words to analyse
    suffix_lengthint20Number of suffix words to analyse
  9. encoding-heuristics
    This is a rule-based detector for encoded or obfuscated payloads (base64, ROT13, hex, URL encoding, Unicode tricks, etc.). It flags inputs as suspicious based on the presence of encoding patterns and their proportion in the text. Its parameters include:
    ParameterTypeDefaultDescription
    threshold_mapdict(see below)Per-encoding-type thresholds
    Default threshold_map:
    Encoding TypeThreshold
    base640.7
    rot130.7
    ascii_escape0.05
    hex_encoding0.15
    url_encoding0.15
    cyrillic_homoglyphs0.05
    mixed_scripts0.05
    zero_width0.01
    excessive_whitespace0.4

Moderation

Detection methods under moderation enable Detectors to identify content that violates content policies, such as hate speech, violence, adult content, toxic content, and more. They include the following:
  1. moderation-mbert
    This is Vijil’s ModernBERT model for toxic content detection. Supports up to 8,192 tokens natively. It has the following parameters:
    ParameterTypeDefaultDescription
    score_thresholdfloat0.5Toxicity probability threshold
    truncationboolTrueTruncate inputs exceeding max_length
    max_lengthint8192Maximum tokens per window
    window_strideint4096Token step size between sliding windows
  2. moderations-oai-api
    This is OpenAI’s Moderation API with per-category score thresholds. It has the following parameters:
    ParameterTypeDefaultDescription
    score_threshold_dictdictNoneCustom thresholds per category
    Supported categories include:
    hate, hate/threatening, self-harm, sexual, sexual/minors, violence, violence/graphic, harassment, harassment/threatening, illegal, illicit, self-harm/intent, self-harm/instructions, sexual/instructions.
    This detection method requires you to set up the OPENAI_API_KEY environment variable.
  3. moderation-deberta
    This is a DeBERTa model for toxicity scoring. The 208-token context window means the sliding window activates for most non-trivial inputs. Its parameters include the following:
    ParameterTypeDefaultDescription
    truncationboolTrueTruncate inputs exceeding max_length
    max_lengthint208Maximum tokens per window
    window_strideint104Token step size between sliding windows
    devicestrNoneTorch device (auto-selects CUDA if available)
  4. moderation-perspective-api
    This is Google’s Perspective API for toxicity and other attributes. It has the following parameters:
    ParameterTypeDefaultDescription
    api_keystrNoneGoogle API key (falls back to PERSPECTIVE_API_KEY)
    attributesdict{"TOXICITY": {}}Attributes to analyse
    score_thresholddict{"TOXICITY": 0.5}Per-attribute thresholds
    The available attributes include the following:
    TOXICITY, SEVERE_TOXICITY, IDENTITY_ATTACK, INSULT, PROFANITY, THREAT.
    Using this detection method requires setting up the PERSPECTIVE_API_KEY environment variable.
  5. moderation-prompt-engineering
    This is an LLM-based moderation classifier served via LiteLLM. It has the following parameters:
    ParameterTypeDefaultDescription
    hub_namestr"openai"LLM API provider
    model_namestr"gpt-4-turbo"Model name
    api_keystrNoneAPI key (falls back to environment variable)
    max_input_charsintNoneTruncate input to this many characters
  6. moderation-flashtext
    This is a keyword ban-list detector that uses FlashText for fast matching. Its parameters include the following:
    ParameterTypeDefaultDescription
    banlist_filepathslist[str]NonePaths to ban-list files (uses built-in default list if omitted)

Privacy

Detection methods under privacy enable Detectors to identify personally identifiable information (PII) and sensitive data in inputs. They include the following:
  1. privacy-presidio
    This detection method uses Microsoft’s Presidio-based PII detection and redaction. It has the following parameters:
    ParameterTypeDefaultDescription
    score_thresholdfloat0.5Confidence threshold for PII detection
    anonymizeboolTrueRedact detected PII in the response
    allow_list_fileslist[str]NoneFiles with values to exclude from detection
    redaction_stylestr"labeled"Redaction style: "labeled" or "masked"
  2. detect-secrets
    This is a pattern-based secret and credential detection method. It detects API keys, tokens, etc. Its parameters include the following:
    ParameterTypeDefaultDescription
    censorboolTrueCensor detected secrets in the response
    This method includes 25 detector plugins:
    ArtifactoryDetector, AWSKeyDetector, AzureStorageKeyDetector, BasicAuthDetector, CloudantDetector, DiscordBotTokenDetector, GitHubTokenDetector, GitLabTokenDetector, IbmCloudIamDetector, IbmCosHmacDetector, IPPublicDetector, JwtTokenDetector, KeywordDetector, MailchimpDetector, NpmDetector, OpenAIDetector, PrivateKeyDetector, PypiTokenDetector, SendGridDetector, SlackDetector, SoftlayerDetector, SquareOAuthDetector, StripeDetector, TelegramBotTokenDetector, TwilioKeyDetector.

Integrity

Detection methods under integrity enable Detectors to identify issues related to the integrity and authenticity of inputs or outputs (hallucinations), such as misinformation, deepfakes, manipulated media, and more. They include the following:
  1. hhem-hallucination
    This method uses the Vectara HHEM model for hallucination detection which compares output against a reference context.
    ParameterTypeDefaultDescription
    contextstr""Reference context to compare against
    factual_consistency_score_thresholdfloat0.5Score below which output is flagged
    trust_remote_codeboolTrueTrust remote code from model hub
  2. fact-check-roberta
    This detection method uses the RoBERTa model for detecting factual contradictions between output and context. Its parameters include the following:
    ParameterTypeDefaultDescription
    contextstr""Reference context to check against
  3. hallucination-llm
    This uses LLM-based hallucination detection with reference context. It has the following parameters:
    ParameterTypeDefaultDescription
    hub_namestr"openai"LLM API provider
    model_namestr"gpt-4-turbo"Model name
    api_keystrNoneAPI key (falls back to environment variable)
    max_input_charsintNoneTruncate input to this many characters
    contextstrNoneReference context for comparison
  4. fact-check-llm
    This method uses an LLM for fact-checking with reference context. Its parameters include the following:
    ParameterTypeDefaultDescription
    hub_namestr"openai"LLM API provider
    model_namestr"gpt-4-turbo"Model name
    api_keystrNoneAPI key (falls back to environment variable)
    max_input_charsintNoneTruncate input to this many characters
    contextstrNoneReference context for comparison

Generic

Detection methods under generic are versatile and can be custokized and applied to a wide range of issues beyond the specific categories above. They include the following:
  1. generic-llm
    This is method offers custom LLM-based detection with user-provided system prompts and trigger words. It can be used for various detection needs by tailoring the prompt and trigger words accordingly. Its parameters include the following:
    ParameterTypeDefaultDescription
    sys_prompt_templatestr(required)System prompt with $query_string placeholder
    trigger_word_listlist[str](required)Words in LLM response that indicate a hit
    hub_namestr"openai"LLM API provider
    model_namestr"gpt-4-turbo"Model name
    api_keystrNoneAPI key (falls back to environment variable)
    max_input_charsintNoneTruncate input to this many characters
  2. policy-gpt-oss-safeguard
    This is a policy-based content classifier that uses GPT-OSS-Safeguard. It classifies inputs based on user-provided policy rules and returns the violated policy reference. Its parameters include the following:
    ParameterTypeDefaultDescription
    policy_filestr(required)Path to policy file with classification rules
    hub_namestr"groq"LLM API provider
    model_namestr"openai/gpt-oss-safeguard-20b"Model name
    output_formatstr"policy_ref""binary", "policy_ref", or "with_rationale"
    reasoning_effortstr"medium""low", "medium", or "high"
    api_keystrNoneAPI key (falls back to environment variable)
    timeoutint60Request timeout in seconds
    max_retriesint3Maximum retry attempts
    max_input_charsintNoneTruncate input to this many characters