Reliability

The Baseline Expectation

Reliability is the oldest form of trust. We trust a hammer because it drives nails. We trust a calculator because it returns correct answers. We trust a database because it stores and retrieves records without corruption. This is the trust we extend to tools: they do what they’re supposed to do, every time, without surprises. AI agents inherit this expectation but struggle to meet it. Unlike traditional software, where incorrect behavior usually indicates a bug to be fixed, agents can produce incorrect outputs as a normal part of their operation. They hallucinate facts, contradict themselves across sessions, and fail unpredictably on inputs that seem similar to ones they handle well. The probabilistic nature of language models means reliability isn’t a binary property—it’s a distribution. Vijil measures reliability across three sub-dimensions: correctness, consistency, and robustness. Together, these capture whether an agent can be trusted to perform its intended function.

Correctness

Does the agent produce accurate, valid outputs? Correctness measures alignment between what an agent says and what is actually true. A correct agent produces outputs that are factually accurate, logically sound, and aligned with its instructions.

Failure Modes

Hallucination is the signature failure mode of language models. An agent confidently states that a Python package exists when it doesn’t, invents citations to papers that were never written, or fabricates details about people, places, and events. Hallucinations are especially dangerous because they’re delivered with the same confidence as accurate information—there’s no signal to the user that something is wrong. Logical errors occur when an agent’s reasoning chain contains flaws. The agent might make arithmetic mistakes, draw invalid conclusions from premises, or fail to recognize contradictions in its own outputs. Unlike hallucinations about facts, these errors involve the reasoning process itself. Instruction drift happens when an agent gradually deviates from its intended behavior. It might start following its system prompt faithfully but, over the course of a conversation, begin ignoring constraints or interpreting instructions in unintended ways.

What Vijil Tests

Vijil evaluates correctness through probes designed to elicit hallucinations and logical errors:

Test Category	What It Measures
Factual assertions	Will the agent make false claims about verifiable facts (senators, public figures, prime numbers)?
Package hallucination	Will the agent recommend non-existent software packages?
Misleading prompts	Can the agent be tricked into accepting and propagating false premises?
Logical reasoning	Does the agent make valid inferences and catch contradictions?

Consistency

Does the agent behave predictably across similar conditions? Consistency measures stability of behavior. A consistent agent gives similar answers to similar questions, maintains its persona across sessions, and doesn’t contradict itself within a conversation.

Failure Modes

Response variance occurs when the same input produces meaningfully different outputs. Some variance is expected—language models are probabilistic—but excessive variance indicates the agent’s behavior is unpredictable in ways that matter. Persona instability happens when an agent’s character, tone, or capabilities drift over a conversation. An agent configured as a helpful assistant might gradually become argumentative, or one constrained to a specific domain might start answering questions outside its scope. Self-contradiction occurs when an agent makes claims that conflict with its earlier statements. This is distinct from changing an answer when presented with new information—self-contradiction happens without any new input that would justify the change.

What Vijil Tests

Vijil evaluates consistency by running variations of the same probe and measuring output stability:

Test Category	What It Measures
Paraphrase invariance	Does rewording a question change the answer substantively?
Session stability	Does the agent maintain consistent behavior across conversation turns?
Constraint adherence	Does the agent continue following its system prompt throughout a session?

Robustness

Does the agent maintain performance under challenging conditions? Robustness measures resilience to inputs that deviate from the ideal. A robust agent handles typos, ambiguous phrasing, edge cases, and unexpected formats without degrading significantly.

Failure Modes

Brittleness is sensitivity to superficial input changes. An agent might answer a question correctly when phrased one way but fail completely when the same question includes a typo, uses different word order, or adds irrelevant context. This brittleness makes agents unreliable in real-world conditions where inputs are messy. Edge case failures occur at the boundaries of an agent’s training distribution. Unusual inputs—very long prompts, uncommon languages, domain-specific jargon—can cause performance to degrade sharply even when the underlying task is similar to ones the agent handles well. Distractor sensitivity happens when irrelevant information in the input affects the output. An agent solving a math problem might give a different answer when the problem includes extraneous details, even though those details shouldn’t affect the solution.

What Vijil Tests

Vijil evaluates robustness through adversarial perturbations that challenge input sensitivity:

Test Category	What It Measures
Typo injection	Does performance degrade when inputs contain realistic spelling errors?
Synonym substitution	Do semantically equivalent phrasings produce consistent results?
Semantic perturbation	How sensitive is the agent to meaning-preserving input variations?
Distractor injection	Does irrelevant context affect task performance?

Why Reliability Matters

Reliability failures erode trust faster than almost any other kind. Users can tolerate an agent that occasionally refuses a request or takes longer than expected. What they can’t tolerate is an agent that confidently gives wrong answers, behaves unpredictably, or breaks when inputs aren’t perfect. In enterprise deployments, unreliable agents create concrete harms: customers receive incorrect information, decisions are made on fabricated data, and users learn to distrust the agent’s outputs even when they’re correct. The cost isn’t just the individual errors—it’s the overhead of having humans verify everything the agent produces, which eliminates much of the efficiency gain agents are supposed to provide. The reliability dimension of the Trust Score gives you evidence about where your agent stands. Not a guarantee of correctness, but a map of its reliability profile: where it’s strong, where it’s vulnerable, and how it compares to alternatives.

Next Steps

Security

How agents resist adversarial attacks

Safety

How agents operate within boundaries

Run an Evaluation

Test your agent’s reliability

Reliability Harnesses

Available reliability tests

Overview

Trust Score

Evaluation

Defense

Reference

The Baseline Expectation

Correctness

Failure Modes

What Vijil Tests

Consistency

Failure Modes

What Vijil Tests

Robustness

Failure Modes

What Vijil Tests

Why Reliability Matters

Next Steps

Security

Safety

Run an Evaluation

Reliability Harnesses

Overview

Trust Score

Evaluation

Defense

Reference

​The Baseline Expectation

​Correctness

​Failure Modes

​What Vijil Tests

​Consistency

​Failure Modes

​What Vijil Tests

​Robustness

​Failure Modes

​What Vijil Tests

​Why Reliability Matters

​Next Steps

Security

Safety

Run an Evaluation

Reliability Harnesses

The Baseline Expectation

Correctness

Failure Modes

What Vijil Tests

Consistency

Failure Modes

What Vijil Tests

Robustness

Failure Modes

What Vijil Tests

Why Reliability Matters

Next Steps