Skip to main content

The Baseline Expectation

Reliability is the oldest form of trust. We trust a hammer because it drives nails. We trust a calculator because it returns correct answers. We trust a database because it stores and retrieves records without corruption. This is the trust we extend to tools: they do what they’re supposed to do, every time, without surprises. AI agents inherit this expectation but struggle to meet it. Unlike traditional software, where incorrect behavior usually indicates a bug to be fixed, agents can produce incorrect outputs as a normal part of their operation. They hallucinate facts, contradict themselves across sessions, and fail unpredictably on inputs that seem similar to ones they handle well. The probabilistic nature of language models means reliability isn’t a binary property—it’s a distribution. Vijil measures reliability across three sub-dimensions: correctness, consistency, and robustness. Together, these capture whether an agent can be trusted to perform its intended function.

Correctness

Does the agent produce accurate, valid outputs? Correctness measures alignment between what an agent says and what is actually true. A correct agent produces outputs that are factually accurate, logically sound, and aligned with its instructions.

Failure Modes

Hallucination is the signature failure mode of language models. An agent confidently states that a Python package exists when it doesn’t, invents citations to papers that were never written, or fabricates details about people, places, and events. Hallucinations are especially dangerous because they’re delivered with the same confidence as accurate information—there’s no signal to the user that something is wrong. Logical errors occur when an agent’s reasoning chain contains flaws. The agent might make arithmetic mistakes, draw invalid conclusions from premises, or fail to recognize contradictions in its own outputs. Unlike hallucinations about facts, these errors involve the reasoning process itself. Instruction drift happens when an agent gradually deviates from its intended behavior. It might start following its system prompt faithfully but, over the course of a conversation, begin ignoring constraints or interpreting instructions in unintended ways.

What Vijil Tests

Vijil evaluates correctness through probes designed to elicit hallucinations and logical errors:
Test CategoryWhat It Measures
Factual assertionsWill the agent make false claims about verifiable facts (senators, public figures, prime numbers)?
Package hallucinationWill the agent recommend non-existent software packages?
Misleading promptsCan the agent be tricked into accepting and propagating false premises?
Logical reasoningDoes the agent make valid inferences and catch contradictions?

Consistency

Does the agent behave predictably across similar conditions? Consistency measures stability of behavior. A consistent agent gives similar answers to similar questions, maintains its persona across sessions, and doesn’t contradict itself within a conversation.

Failure Modes

Response variance occurs when the same input produces meaningfully different outputs. Some variance is expected—language models are probabilistic—but excessive variance indicates the agent’s behavior is unpredictable in ways that matter. Persona instability happens when an agent’s character, tone, or capabilities drift over a conversation. An agent configured as a helpful assistant might gradually become argumentative, or one constrained to a specific domain might start answering questions outside its scope. Self-contradiction occurs when an agent makes claims that conflict with its earlier statements. This is distinct from changing an answer when presented with new information—self-contradiction happens without any new input that would justify the change.

What Vijil Tests

Vijil evaluates consistency by running variations of the same probe and measuring output stability:
Test CategoryWhat It Measures
Paraphrase invarianceDoes rewording a question change the answer substantively?
Session stabilityDoes the agent maintain consistent behavior across conversation turns?
Constraint adherenceDoes the agent continue following its system prompt throughout a session?

Robustness

Does the agent maintain performance under challenging conditions? Robustness measures resilience to inputs that deviate from the ideal. A robust agent handles typos, ambiguous phrasing, edge cases, and unexpected formats without degrading significantly.

Failure Modes

Brittleness is sensitivity to superficial input changes. An agent might answer a question correctly when phrased one way but fail completely when the same question includes a typo, uses different word order, or adds irrelevant context. This brittleness makes agents unreliable in real-world conditions where inputs are messy. Edge case failures occur at the boundaries of an agent’s training distribution. Unusual inputs—very long prompts, uncommon languages, domain-specific jargon—can cause performance to degrade sharply even when the underlying task is similar to ones the agent handles well. Distractor sensitivity happens when irrelevant information in the input affects the output. An agent solving a math problem might give a different answer when the problem includes extraneous details, even though those details shouldn’t affect the solution.

What Vijil Tests

Vijil evaluates robustness through adversarial perturbations that challenge input sensitivity:
Test CategoryWhat It Measures
Typo injectionDoes performance degrade when inputs contain realistic spelling errors?
Synonym substitutionDo semantically equivalent phrasings produce consistent results?
Semantic perturbationHow sensitive is the agent to meaning-preserving input variations?
Distractor injectionDoes irrelevant context affect task performance?

Why Reliability Matters

Reliability failures erode trust faster than almost any other kind. Users can tolerate an agent that occasionally refuses a request or takes longer than expected. What they can’t tolerate is an agent that confidently gives wrong answers, behaves unpredictably, or breaks when inputs aren’t perfect. In enterprise deployments, unreliable agents create concrete harms: customers receive incorrect information, decisions are made on fabricated data, and users learn to distrust the agent’s outputs even when they’re correct. The cost isn’t just the individual errors—it’s the overhead of having humans verify everything the agent produces, which eliminates much of the efficiency gain agents are supposed to provide. The reliability dimension of the Trust Score gives you evidence about where your agent stands. Not a guarantee of correctness, but a map of its reliability profile: where it’s strong, where it’s vulnerable, and how it compares to alternatives.

Next Steps