Trust with Consequences
Reliability asks whether an agent does what itâs supposed to do. Security asks whether it resists attempts to make it do something else. Safety asks a different question: even when the agent is functioning correctly, does it stay within acceptable boundaries? This is the trust we extend to institutionsânot just that theyâre competent, but that theyâre accountable. A hospital isnât just medically capable; it operates within regulatory frameworks, maintains ethical standards, and provides transparency when things go wrong. We trust institutions because they have guardrails, oversight, and consequences for overstepping. AI agents are gaining institutional powersâmaking decisions, handling sensitive interactions, taking actions with real-world effectsâwithout the institutional safeguards weâve developed for human organizations. Safety measures whether those safeguards exist: does the agent stay in its lane, follow the rules, and remain transparent about its behavior? Vijil measures safety across three sub-dimensions: containment, compliance, and transparency.Containment
Does the agent stay within its intended scope? Containment measures whether an agent respects boundaries. A contained agent doesnât exceed its authorized capabilities, doesnât take actions outside its intended domain, and doesnât escalate its own permissions or influence.Failure Modes
Scope creep occurs when an agent gradually expands beyond its intended function. An agent designed to answer customer questions might start making commitments on behalf of the company. One designed to summarize documents might start offering advice. The expansion is often subtleâeach individual step seems reasonableâbut the cumulative effect is an agent operating far outside its design. Capability escalation happens when an agent acquires or exercises capabilities beyond what it was authorized to have. This is particularly relevant for agents with tool access: can they be manipulated into calling APIs they shouldnât, accessing files outside their sandbox, or chaining operations in ways that exceed their intended power? Boundary violations occur when an agent ignores explicit constraints on its behavior. An agent instructed to only discuss certain topics might be led into forbidden areas. One configured to never share certain information might leak it under the right prompting. These violations represent failures of the agentâs ability to enforce its own rules.What Vijil Tests
Vijil evaluates containment through probes designed to push agents beyond their boundaries:| Test Category | What It Measures |
|---|---|
| Role adherence | Does the agent stay within its defined persona and scope? |
| Tool abuse resistance | Can the agent be manipulated into unauthorized tool use? |
| Constraint enforcement | Do explicit restrictions hold under adversarial pressure? |
| Escalation resistance | Does the agent resist attempts to expand its capabilities? |
Compliance
Does the agent adhere to policies, regulations, and norms? Compliance measures alignment with external standards. A compliant agent follows organizational policies, respects regulatory requirements, and adheres to cultural and ethical norms. It doesnât produce outputs that would create legal liability or reputational harm for its operators.Failure Modes
Policy violations occur when an agentâs outputs conflict with organizational rules. An agent might make promises the company canât keep, share information thatâs confidential under company policy, or take positions on issues the organization considers off-limits. These violations create liability even when the agent is technically functioning correctly. Regulatory non-compliance happens when outputs violate legal or regulatory standards. Healthcare agents might give medical advice theyâre not licensed to provide. Financial agents might make recommendations that violate securities regulations. The agent doesnât need to know the regulations exist to violate themâand âthe AI said itâ isnât a legal defense. Norm violations are outputs that, while not illegal, violate social or professional standards. Profanity, bias, culturally insensitive content, or outputs that would be considered inappropriate in a professional context all fall into this category. These violations might not have legal consequences, but they have business consequences.What Vijil Tests
Vijil evaluates compliance through probes designed to elicit policy-violating outputs:| Test Category | What It Measures |
|---|---|
| Business ethics | Does the agent adhere to common codes of conduct? |
| Bias detection | Does the agent show inappropriate demographic bias? |
| Content standards | Does the agent avoid profanity and inappropriate content? |
| Regulatory alignment | Does the agent respect domain-specific compliance requirements? |
Transparency
Does the agent provide visibility into its behavior? Transparency measures whether an agentâs operations are observable and explainable. A transparent agent doesnât hide its reasoning, provides clear explanations for its outputs, and enables oversight of its behavior. When something goes wrong, you can understand why.Failure Modes
Opaque reasoning occurs when an agent produces outputs without any insight into how it reached them. For consequential decisionsârecommendations, assessments, actionsâstakeholders often need to understand the reasoning, not just see the result. An agent that canât explain itself canât be audited. Deceptive behavior is a more severe transparency failure: the agent actively misleads users about what itâs doing or why. This might be unintentional (the agent confabulates explanations) or emergent (the agent learns that certain framings get better responses). Either way, users canât trust what the agent says about itself. Audit resistance happens when an agentâs behavior is difficult to observe, log, or review. This isnât about the agent hiding thingsâitâs about whether the agentâs integration allows for meaningful oversight. Can you see what prompts it received? What it returned? What tools it called? Without this visibility, you canât verify that the agent is behaving appropriately.What Vijil Tests
Vijil evaluates transparency through probes that examine explanation quality and behavioral observability:| Test Category | What It Measures |
|---|---|
| Explanation quality | Does the agent provide clear reasoning for its outputs? |
| Self-knowledge accuracy | Does the agent accurately represent its own capabilities and limitations? |
| Uncertainty communication | Does the agent appropriately signal when itâs uncertain? |