The Baseline Expectation
Reliability is the oldest form of trust. We trust a hammer because it drives nails. We trust a calculator because it returns correct answers. We trust a database because it stores and retrieves records without corruption. This is the trust we extend to tools: they do what they’re supposed to do, every time, without surprises. AI agents inherit this expectation but struggle to meet it. Unlike traditional software, where incorrect behavior usually indicates a bug to be fixed, agents can produce incorrect outputs as a normal part of their operation. They hallucinate facts, contradict themselves across sessions, and fail unpredictably on inputs that seem similar to ones they handle well. The probabilistic nature of language models means reliability isn’t a binary property—it’s a distribution. Vijil measures reliability across three sub-dimensions: correctness, consistency, and robustness. Together, these capture whether an agent can be trusted to perform its intended function.Correctness
Does the agent produce accurate, valid outputs? Correctness measures alignment between what an agent says and what is actually true. A correct agent produces outputs that are factually accurate, logically sound, and aligned with its instructions.Failure Modes
Hallucination is the signature failure mode of language models. An agent confidently states that a Python package exists when it doesn’t, invents citations to papers that were never written, or fabricates details about people, places, and events. Hallucinations are especially dangerous because they’re delivered with the same confidence as accurate information—there’s no signal to the user that something is wrong. Logical errors occur when an agent’s reasoning chain contains flaws. The agent might make arithmetic mistakes, draw invalid conclusions from premises, or fail to recognize contradictions in its own outputs. Unlike hallucinations about facts, these errors involve the reasoning process itself. Instruction drift happens when an agent gradually deviates from its intended behavior. It might start following its system prompt faithfully but, over the course of a conversation, begin ignoring constraints or interpreting instructions in unintended ways.What Vijil Tests
Vijil evaluates correctness through probes designed to elicit hallucinations and logical errors:| Test Category | What It Measures |
|---|---|
| Factual assertions | Will the agent make false claims about verifiable facts (senators, public figures, prime numbers)? |
| Package hallucination | Will the agent recommend non-existent software packages? |
| Misleading prompts | Can the agent be tricked into accepting and propagating false premises? |
| Logical reasoning | Does the agent make valid inferences and catch contradictions? |
Consistency
Does the agent behave predictably across similar conditions? Consistency measures stability of behavior. A consistent agent gives similar answers to similar questions, maintains its persona across sessions, and doesn’t contradict itself within a conversation.Failure Modes
Response variance occurs when the same input produces meaningfully different outputs. Some variance is expected—language models are probabilistic—but excessive variance indicates the agent’s behavior is unpredictable in ways that matter. Persona instability happens when an agent’s character, tone, or capabilities drift over a conversation. An agent configured as a helpful assistant might gradually become argumentative, or one constrained to a specific domain might start answering questions outside its scope. Self-contradiction occurs when an agent makes claims that conflict with its earlier statements. This is distinct from changing an answer when presented with new information—self-contradiction happens without any new input that would justify the change.What Vijil Tests
Vijil evaluates consistency by running variations of the same probe and measuring output stability:| Test Category | What It Measures |
|---|---|
| Paraphrase invariance | Does rewording a question change the answer substantively? |
| Session stability | Does the agent maintain consistent behavior across conversation turns? |
| Constraint adherence | Does the agent continue following its system prompt throughout a session? |
Robustness
Does the agent maintain performance under challenging conditions? Robustness measures resilience to inputs that deviate from the ideal. A robust agent handles typos, ambiguous phrasing, edge cases, and unexpected formats without degrading significantly.Failure Modes
Brittleness is sensitivity to superficial input changes. An agent might answer a question correctly when phrased one way but fail completely when the same question includes a typo, uses different word order, or adds irrelevant context. This brittleness makes agents unreliable in real-world conditions where inputs are messy. Edge case failures occur at the boundaries of an agent’s training distribution. Unusual inputs—very long prompts, uncommon languages, domain-specific jargon—can cause performance to degrade sharply even when the underlying task is similar to ones the agent handles well. Distractor sensitivity happens when irrelevant information in the input affects the output. An agent solving a math problem might give a different answer when the problem includes extraneous details, even though those details shouldn’t affect the solution.What Vijil Tests
Vijil evaluates robustness through adversarial perturbations that challenge input sensitivity:| Test Category | What It Measures |
|---|---|
| Typo injection | Does performance degrade when inputs contain realistic spelling errors? |
| Synonym substitution | Do semantically equivalent phrasings produce consistent results? |
| Semantic perturbation | How sensitive is the agent to meaning-preserving input variations? |
| Distractor injection | Does irrelevant context affect task performance? |