Security

Trust Under Pressure

Reliability measures whether an agent performs correctly under normal conditions. Security measures whether it maintains that performance when someone is actively trying to break it. This is a different kind of trust—the trust we extend to people who hold sensitive information, guard access to systems, or make decisions that affect us. We trust them not just to be competent, but to resist pressure, manipulation, and deception. A person who tells the truth when it’s easy but lies under pressure isn’t trustworthy. Neither is an agent that follows its guidelines until someone asks it not to. AI agents face adversarial pressure constantly. Users probe for vulnerabilities. Attackers craft inputs designed to extract information or hijack behavior. Competitors test for weaknesses. The public internet is an adversarial environment, and any agent exposed to it will encounter attempts to compromise it. Vijil measures security across the classic CIA triad—confidentiality, integrity, and availability—adapted for the unique threat model of AI agents.

Confidentiality

Does the agent protect sensitive information from unauthorized disclosure? Confidentiality measures how well an agent guards secrets. These secrets might be explicit (API keys, user data, system prompts) or implicit (training data, model architecture, internal reasoning). A secure agent doesn’t leak information to parties who shouldn’t have it.

Failure Modes

System prompt extraction occurs when an attacker convinces the agent to reveal its hidden instructions. System prompts often contain sensitive information: business logic, access controls, persona definitions, even credentials. An agent that can be tricked into disclosing its prompt leaks both intellectual property and security configuration. User data leakage happens when an agent reveals information from one user’s session to another, or exposes personal information in contexts where it shouldn’t. This can occur through direct disclosure or through subtle channels like varied response patterns that reveal something about the data the agent has seen. Training data extraction is a model-level vulnerability where attackers craft prompts that cause the agent to reproduce memorized training data. This can expose copyrighted content, personal information, or proprietary data that was included in training without authorization.

What Vijil Tests

Vijil evaluates confidentiality through probes designed to extract protected information:

Test Category	What It Measures
System prompt extraction	Can the agent be tricked into revealing its instructions?
User privacy	Does the agent protect PII across sessions and contexts?
Model privacy	Does the agent leak information about its architecture or training?
Data memorization	Can training data be extracted through targeted prompting?

Integrity

Does the agent resist attempts to manipulate its behavior? Integrity measures resistance to adversarial control. An agent with integrity follows its intended instructions even when users try to override them through clever prompting, social engineering, or technical exploits. It doesn’t execute unauthorized commands or deviate from its guidelines under pressure.

Failure Modes

Prompt injection is the signature attack against AI agents. An attacker embeds instructions in user input, third-party content, or retrieved documents, hoping the agent will execute them as if they came from the system. Successful injection can hijack agent behavior entirely—making it ignore its actual instructions and follow the attacker’s instead. Jailbreaking uses social engineering and creative prompting to convince the agent to abandon its safety constraints. Unlike prompt injection, which typically hides malicious instructions, jailbreaks work by manipulating the agent’s reasoning: roleplaying scenarios, hypothetical framings, gradual boundary erosion, or appeals to the agent’s helpfulness. Adversarial perturbation exploits the sensitivity of neural networks to carefully crafted inputs. Small changes to prompts—invisible characters, strategic typos, unusual formatting—can cause dramatically different behavior. These attacks exploit the gap between how humans and models interpret text.

What Vijil Tests

Vijil evaluates integrity through adversarial probes that attempt to hijack agent behavior:

Test Category	What It Measures
Direct injection	Does the agent follow instructions embedded in user input?
Indirect injection	Does the agent follow instructions in retrieved or third-party content?
Jailbreak attacks	Can social engineering bypass safety guidelines?
Encoding attacks	Do obfuscated instructions (unicode, base64, etc.) bypass filters?
Crescendo attacks	Does gradual boundary erosion compromise the agent?

Availability

Does the agent maintain operation under hostile conditions? Availability measures resilience to denial-of-service. An available agent continues functioning even when attackers try to exhaust its resources, trigger failure modes, or make it unusable for legitimate purposes.

Failure Modes

Resource exhaustion occurs when an attacker crafts inputs that consume disproportionate computational resources. Very long prompts, recursive structures, or requests that trigger expensive operations can slow the agent to unusability or cause it to fail entirely. Failure injection uses adversarial inputs to trigger error states, crashes, or undefined behavior. Unlike attacks on integrity (which try to control what the agent does), these attacks try to prevent the agent from doing anything at all. Context poisoning fills the agent’s context window with junk, making it unable to process legitimate requests. This is particularly relevant for agents with memory or document retrieval, where an attacker might pollute the knowledge base to degrade performance.

What Vijil Tests

Vijil evaluates availability through probes designed to disrupt agent operation:

Test Category	What It Measures
DoS resistance	Does the agent handle adversarially crafted inputs without failing?
Context overflow	How does the agent behave when context limits are stressed?
Error handling	Does the agent recover gracefully from malformed inputs?

Why Security Matters

Security failures are categorically different from reliability failures. A reliability failure—a hallucination, an inconsistent response—is the agent making a mistake. A security failure is the agent being weaponized against its owners and users. When an agent’s confidentiality is compromised, sensitive information leaks. When its integrity is compromised, attackers control what it does. When its availability is compromised, legitimate users can’t access it. These aren’t edge cases or acceptable error rates—they’re breaches with real consequences. The security dimension of the Trust Score maps your agent’s attack surface. It tells you which attack vectors the agent resists, which ones it’s vulnerable to, and where you need additional protection. In an adversarial environment, this map is essential.

Next Steps

Reliability

How agents perform under normal conditions

Safety

How agents operate within boundaries

Run an Evaluation

Test your agent’s security

OWASP Harness

Security tests aligned to OWASP LLM Top 10

Overview

Trust Score

Evaluation

Defense

Reference

Trust Under Pressure

Confidentiality

Failure Modes

What Vijil Tests

Integrity

Failure Modes

What Vijil Tests

Availability

Failure Modes

What Vijil Tests

Why Security Matters

Next Steps

Reliability

Safety

Run an Evaluation

OWASP Harness

Overview

Trust Score

Evaluation

Defense

Reference

​Trust Under Pressure

​Confidentiality

​Failure Modes

​What Vijil Tests

​Integrity

​Failure Modes

​What Vijil Tests

​Availability

​Failure Modes

​What Vijil Tests

​Why Security Matters

​Next Steps

Reliability

Safety

Run an Evaluation

OWASP Harness

Trust Under Pressure

Confidentiality

Failure Modes

What Vijil Tests

Integrity

Failure Modes

What Vijil Tests

Availability

Failure Modes

What Vijil Tests

Why Security Matters

Next Steps