Detection Methods
Vijil Dome has built-in detection methods that give Detectors their ability to identify issues. These methods are used to Configure Guardrails using a TOML file or dictionary.The detection methods are grouped under these five categories:
- Security
- Moderation
- Privacy
- Integrity
- Generic
Security
The detection methods under security give Detectors the ability to detect adversarial inputs like prompt injections, jailbreak attempts, and encoded/obfuscated payloads. They include the following:-
prompt-injection-mbert
This is Vijil’s ModernBERT model for prompt injection detection. It supports up to 8,192 tokens natively, so sliding windows only activate for very long inputs. Its parameters include the following:Parameter Type Default Description score_thresholdfloat0.5Injection probability above which input is flagged truncationboolTrueTruncate inputs exceeding max_lengthmax_lengthint8192Maximum tokens per window window_strideint4096Token step size between sliding windows -
prompt-injection-deberta-finetuned-11122024
This is a Vijil-finetuned DeBERTa model for prompt injection detection. Its parameters include the following:Parameter Type Default Description truncationboolTrueTruncate inputs exceeding max_lengthmax_lengthint512Maximum tokens per window (DeBERTa limit) window_strideint256Token step size between sliding windows -
prompt-injection-deberta-v3-base
This is a DeBERTa v3 model for prompt injection detection. It has the following configurable parameters:Parameter Type Default Description truncationboolTrueTruncate inputs exceeding max_lengthmax_lengthint512Maximum tokens per window (DeBERTa limit) window_strideint256Token step size between sliding windows -
security-promptguard
This is the Meta Prompt Guard model for jailbreak and prompt injection detection. It has the following parameters:Parameter Type Default Description score_thresholdfloat0.5Jailbreak probability threshold truncationboolTrueTruncate inputs exceeding max_lengthmax_lengthint512Maximum tokens per window window_strideint256Token step size between sliding windows -
security-llm
This is an LLM-based security classification model served via LiteLLM. Its configurable parameters include:Parameter Type Default Description hub_namestr"openai"LLM API provider model_namestr"gpt-4-turbo"Model name api_keystrNoneAPI key (falls back to env var) max_input_charsintNoneTruncate input to this many characters -
security-embeddings
This provides jailbreak detection via embedding similarity against a known-jailbreak corpus. It supports various embedding engines and models. Its parameters include:Parameter Type Default Description enginestr"SentenceTransformers"Embedding engine modelstr"all-MiniLM-L6-v2"Embedding model name thresholdfloat0.7Similarity threshold in_memboolTrueLoad embeddings in memory -
jb-length-per-perplexity
This is a perplexity-based heuristic that flags jailbreaks by their length-to-perplexity ratio. It has the following parameters:Parameter Type Default Description model_idstr"gpt2-large"HuggingFace model for perplexity batch_sizeint16Batch size stride_lengthint512Stride for perplexity calculation thresholdfloat89.79Length-per-perplexity threshold -
jb-prefix-suffix-perplexity
This is a perplexity-based heuristic that analyses the prefix and suffix of inputs separately. It flags jailbreaks by their prefix and suffix perplexity scores. Its parameters include the following:Parameter Type Default Description model_idstr"gpt2-large"HuggingFace model for perplexity batch_sizeint16Batch size stride_lengthint512Stride for perplexity calculation prefix_thresholdfloat1845.65Prefix perplexity threshold suffix_thresholdfloat1845.65Suffix perplexity threshold prefix_lengthint20Number of prefix words to analyse suffix_lengthint20Number of suffix words to analyse -
encoding-heuristics
This is a rule-based detector for encoded or obfuscated payloads (base64, ROT13, hex, URL encoding, Unicode tricks, etc.). It flags inputs as suspicious based on the presence of encoding patterns and their proportion in the text. Its parameters include:DefaultParameter Type Default Description threshold_mapdict(see below) Per-encoding-type thresholds threshold_map:Encoding Type Threshold base640.7rot130.7ascii_escape0.05hex_encoding0.15url_encoding0.15cyrillic_homoglyphs0.05mixed_scripts0.05zero_width0.01excessive_whitespace0.4
Moderation
Detection methods under moderation enable Detectors to identify content that violates content policies, such as hate speech, violence, adult content, toxic content, and more. They include the following:-
moderation-mbert
This is Vijil’s ModernBERT model for toxic content detection. Supports up to 8,192 tokens natively. It has the following parameters:Parameter Type Default Description score_thresholdfloat0.5Toxicity probability threshold truncationboolTrueTruncate inputs exceeding max_lengthmax_lengthint8192Maximum tokens per window window_strideint4096Token step size between sliding windows -
moderations-oai-api
This is OpenAI’s Moderation API with per-category score thresholds. It has the following parameters:Supported categories include:Parameter Type Default Description score_threshold_dictdictNoneCustom thresholds per category
hate,hate/threatening,self-harm,sexual,sexual/minors,violence,violence/graphic,harassment,harassment/threatening,illegal,illicit,self-harm/intent,self-harm/instructions,sexual/instructions.
This detection method requires you to set up theOPENAI_API_KEYenvironment variable. -
moderation-deberta
This is a DeBERTa model for toxicity scoring. The 208-token context window means the sliding window activates for most non-trivial inputs. Its parameters include the following:Parameter Type Default Description truncationboolTrueTruncate inputs exceeding max_lengthmax_lengthint208Maximum tokens per window window_strideint104Token step size between sliding windows devicestrNoneTorch device (auto-selects CUDA if available) -
moderation-perspective-api
This is Google’s Perspective API for toxicity and other attributes. It has the following parameters:The available attributes include the following:Parameter Type Default Description api_keystrNoneGoogle API key (falls back to PERSPECTIVE_API_KEY)attributesdict{"TOXICITY": {}}Attributes to analyse score_thresholddict{"TOXICITY": 0.5}Per-attribute thresholds
TOXICITY,SEVERE_TOXICITY,IDENTITY_ATTACK,INSULT,PROFANITY,THREAT.
Using this detection method requires setting up thePERSPECTIVE_API_KEYenvironment variable. -
moderation-prompt-engineering
This is an LLM-based moderation classifier served via LiteLLM. It has the following parameters:Parameter Type Default Description hub_namestr"openai"LLM API provider model_namestr"gpt-4-turbo"Model name api_keystrNoneAPI key (falls back to environment variable) max_input_charsintNoneTruncate input to this many characters -
moderation-flashtext
This is a keyword ban-list detector that uses FlashText for fast matching. Its parameters include the following:Parameter Type Default Description banlist_filepathslist[str]NonePaths to ban-list files (uses built-in default list if omitted)
Privacy
Detection methods under privacy enable Detectors to identify personally identifiable information (PII) and sensitive data in inputs. They include the following:-
privacy-presidio
This detection method uses Microsoft’s Presidio-based PII detection and redaction. It has the following parameters:Parameter Type Default Description score_thresholdfloat0.5Confidence threshold for PII detection anonymizeboolTrueRedact detected PII in the response allow_list_fileslist[str]NoneFiles with values to exclude from detection redaction_stylestr"labeled"Redaction style: "labeled"or"masked" -
detect-secrets
This is a pattern-based secret and credential detection method. It detects API keys, tokens, etc. Its parameters include the following:This method includes 25 detector plugins:Parameter Type Default Description censorboolTrueCensor detected secrets in the response
ArtifactoryDetector, AWSKeyDetector, AzureStorageKeyDetector, BasicAuthDetector, CloudantDetector, DiscordBotTokenDetector, GitHubTokenDetector, GitLabTokenDetector, IbmCloudIamDetector, IbmCosHmacDetector, IPPublicDetector, JwtTokenDetector, KeywordDetector, MailchimpDetector, NpmDetector, OpenAIDetector, PrivateKeyDetector, PypiTokenDetector, SendGridDetector, SlackDetector, SoftlayerDetector, SquareOAuthDetector, StripeDetector, TelegramBotTokenDetector, TwilioKeyDetector.
Integrity
Detection methods under integrity enable Detectors to identify issues related to the integrity and authenticity of inputs or outputs (hallucinations), such as misinformation, deepfakes, manipulated media, and more. They include the following:-
hhem-hallucination
This method uses the Vectara HHEM model for hallucination detection which compares output against a reference context.Parameter Type Default Description contextstr""Reference context to compare against factual_consistency_score_thresholdfloat0.5Score below which output is flagged trust_remote_codeboolTrueTrust remote code from model hub -
fact-check-roberta
This detection method uses the RoBERTa model for detecting factual contradictions between output and context. Its parameters include the following:Parameter Type Default Description contextstr""Reference context to check against -
hallucination-llm
This uses LLM-based hallucination detection with reference context. It has the following parameters:Parameter Type Default Description hub_namestr"openai"LLM API provider model_namestr"gpt-4-turbo"Model name api_keystrNoneAPI key (falls back to environment variable) max_input_charsintNoneTruncate input to this many characters contextstrNoneReference context for comparison -
fact-check-llm
This method uses an LLM for fact-checking with reference context. Its parameters include the following:Parameter Type Default Description hub_namestr"openai"LLM API provider model_namestr"gpt-4-turbo"Model name api_keystrNoneAPI key (falls back to environment variable) max_input_charsintNoneTruncate input to this many characters contextstrNoneReference context for comparison
Generic
Detection methods under generic are versatile and can be custokized and applied to a wide range of issues beyond the specific categories above. They include the following:-
generic-llm
This is method offers custom LLM-based detection with user-provided system prompts and trigger words. It can be used for various detection needs by tailoring the prompt and trigger words accordingly. Its parameters include the following:Parameter Type Default Description sys_prompt_templatestr(required) System prompt with $query_stringplaceholdertrigger_word_listlist[str](required) Words in LLM response that indicate a hit hub_namestr"openai"LLM API provider model_namestr"gpt-4-turbo"Model name api_keystrNoneAPI key (falls back to environment variable) max_input_charsintNoneTruncate input to this many characters -
policy-gpt-oss-safeguard
This is a policy-based content classifier that uses GPT-OSS-Safeguard. It classifies inputs based on user-provided policy rules and returns the violated policy reference. Its parameters include the following:Parameter Type Default Description policy_filestr(required) Path to policy file with classification rules hub_namestr"groq"LLM API provider model_namestr"openai/gpt-oss-safeguard-20b"Model name output_formatstr"policy_ref""binary","policy_ref", or"with_rationale"reasoning_effortstr"medium""low","medium", or"high"api_keystrNoneAPI key (falls back to environment variable) timeoutint60Request timeout in seconds max_retriesint3Maximum retry attempts max_input_charsintNoneTruncate input to this many characters