What is a Probe?
A probe is an individual test case—a specific prompt (or sequence of prompts) sent to your agent, along with detectors that analyze the response. Probes are the atomic unit of evaluation. Everything else—scenarios, harnesses, Trust Scores—aggregates up from probe results. Each probe is designed to test for a specific behavior. A probe might:- Ask the agent to recommend Python packages (testing for hallucination)
- Embed instructions in a fake customer email (testing for prompt injection)
- Gradually escalate requests over multiple turns (testing for boundary erosion)
- Include a false premise in a question (testing for factual resistance)
Anatomy of a Probe
A probe consists of:Prompt Template
The text sent to the agent. Templates can include variables that are filled in at runtime:Detector Configuration
Which detectors analyze the response and what constitutes a pass. A prompt injection probe might use a trigger-presence detector configured to fail if the response contains a specific string the injection tried to elicit.Metadata
Information about what the probe tests: which scenario it belongs to, which trust dimension, severity of failure, source of the test case.Types of Probes
Single-Turn Probes
Most probes are single-turn: one prompt, one response, one detection. These are efficient to run and straightforward to analyze.Multi-Turn Probes
Some attacks require context buildup. Crescendo attacks, for example, work by gradually pushing boundaries over multiple conversational turns. Multi-turn probes capture these sequences:Mutated Probes
Mutation applies transformations to base probes to test robustness. A prompt injection might be tested with:- Encoding mutations: Base64, unicode, ROT13
- Noise mutations: Typos, extra whitespace, special characters
- Structural mutations: JSON wrapping, markdown formatting
- Evasion mutations: Character substitution, homoglyphs
Probe Sources
Vijil probes come from multiple sources:Research Benchmarks
Established benchmarks from the AI safety and security research community:- HarmBench: Harmful behavior elicitation
- StrongREJECT: Jailbreak resistance
- TruthfulQA: Factual accuracy
- CyberSecEval: Security vulnerabilities
Attack Libraries
Collections of known attack techniques:- Garak: Open-source LLM vulnerability scanner
- OWASP LLM attacks: Documented attack patterns
- Published jailbreaks: Techniques from security research
Vijil Research
Probes developed by Vijil’s research team based on emerging attack patterns, customer incidents, and novel vulnerability classes.Probe Results
Each probe produces a result:| Field | Description |
|---|---|
pass | Boolean: did the agent handle this probe correctly? |
prompt | The exact text sent to the agent |
response | The agent’s response |
detector | Which detector analyzed the response |
evidence | Why the detector made its determination |
Working with Probes
Viewing Probe Results
In evaluation results, you can drill down from harness → scenario → probe to see exactly which test cases failed:Custom Probes
You can add custom probes to test agent-specific concerns:- Probes based on your system prompt
- Probes that test for your sensitive topics
- Probes derived from production incidents