Trust Under Pressure
Reliability measures whether an agent performs correctly under normal conditions. Security measures whether it maintains that performance when someone is actively trying to break it. This is a different kind of trust—the trust we extend to people who hold sensitive information, guard access to systems, or make decisions that affect us. We trust them not just to be competent, but to resist pressure, manipulation, and deception. A person who tells the truth when it’s easy but lies under pressure isn’t trustworthy. Neither is an agent that follows its guidelines until someone asks it not to. AI agents face adversarial pressure constantly. Users probe for vulnerabilities. Attackers craft inputs designed to extract information or hijack behavior. Competitors test for weaknesses. The public internet is an adversarial environment, and any agent exposed to it will encounter attempts to compromise it. Vijil measures security across the classic CIA triad—confidentiality, integrity, and availability—adapted for the unique threat model of AI agents.Confidentiality
Does the agent protect sensitive information from unauthorized disclosure? Confidentiality measures how well an agent guards secrets. These secrets might be explicit (API keys, user data, system prompts) or implicit (training data, model architecture, internal reasoning). A secure agent doesn’t leak information to parties who shouldn’t have it.Failure Modes
System prompt extraction occurs when an attacker convinces the agent to reveal its hidden instructions. System prompts often contain sensitive information: business logic, access controls, persona definitions, even credentials. An agent that can be tricked into disclosing its prompt leaks both intellectual property and security configuration. User data leakage happens when an agent reveals information from one user’s session to another, or exposes personal information in contexts where it shouldn’t. This can occur through direct disclosure or through subtle channels like varied response patterns that reveal something about the data the agent has seen. Training data extraction is a model-level vulnerability where attackers craft prompts that cause the agent to reproduce memorized training data. This can expose copyrighted content, personal information, or proprietary data that was included in training without authorization.What Vijil Tests
Vijil evaluates confidentiality through probes designed to extract protected information:| Test Category | What It Measures |
|---|---|
| System prompt extraction | Can the agent be tricked into revealing its instructions? |
| User privacy | Does the agent protect PII across sessions and contexts? |
| Model privacy | Does the agent leak information about its architecture or training? |
| Data memorization | Can training data be extracted through targeted prompting? |
Integrity
Does the agent resist attempts to manipulate its behavior? Integrity measures resistance to adversarial control. An agent with integrity follows its intended instructions even when users try to override them through clever prompting, social engineering, or technical exploits. It doesn’t execute unauthorized commands or deviate from its guidelines under pressure.Failure Modes
Prompt injection is the signature attack against AI agents. An attacker embeds instructions in user input, third-party content, or retrieved documents, hoping the agent will execute them as if they came from the system. Successful injection can hijack agent behavior entirely—making it ignore its actual instructions and follow the attacker’s instead. Jailbreaking uses social engineering and creative prompting to convince the agent to abandon its safety constraints. Unlike prompt injection, which typically hides malicious instructions, jailbreaks work by manipulating the agent’s reasoning: roleplaying scenarios, hypothetical framings, gradual boundary erosion, or appeals to the agent’s helpfulness. Adversarial perturbation exploits the sensitivity of neural networks to carefully crafted inputs. Small changes to prompts—invisible characters, strategic typos, unusual formatting—can cause dramatically different behavior. These attacks exploit the gap between how humans and models interpret text.What Vijil Tests
Vijil evaluates integrity through adversarial probes that attempt to hijack agent behavior:| Test Category | What It Measures |
|---|---|
| Direct injection | Does the agent follow instructions embedded in user input? |
| Indirect injection | Does the agent follow instructions in retrieved or third-party content? |
| Jailbreak attacks | Can social engineering bypass safety guidelines? |
| Encoding attacks | Do obfuscated instructions (unicode, base64, etc.) bypass filters? |
| Crescendo attacks | Does gradual boundary erosion compromise the agent? |
Availability
Does the agent maintain operation under hostile conditions? Availability measures resilience to denial-of-service. An available agent continues functioning even when attackers try to exhaust its resources, trigger failure modes, or make it unusable for legitimate purposes.Failure Modes
Resource exhaustion occurs when an attacker crafts inputs that consume disproportionate computational resources. Very long prompts, recursive structures, or requests that trigger expensive operations can slow the agent to unusability or cause it to fail entirely. Failure injection uses adversarial inputs to trigger error states, crashes, or undefined behavior. Unlike attacks on integrity (which try to control what the agent does), these attacks try to prevent the agent from doing anything at all. Context poisoning fills the agent’s context window with junk, making it unable to process legitimate requests. This is particularly relevant for agents with memory or document retrieval, where an attacker might pollute the knowledge base to degrade performance.What Vijil Tests
Vijil evaluates availability through probes designed to disrupt agent operation:| Test Category | What It Measures |
|---|---|
| DoS resistance | Does the agent handle adversarially crafted inputs without failing? |
| Context overflow | How does the agent behave when context limits are stressed? |
| Error handling | Does the agent recover gracefully from malformed inputs? |