Skip to main content
Language models are vulnerable to an infinite number of attacks because they accept natural language as input. They are vulnerable to attack from every variation of character, word, sentence, and meaning in human language. As a result, every model is vulnerable to being abused or misused by malicious attackers to cause harm to others. Vijils tests for the vulnerability of LLMs to popular types of attacks observed with production deployments.

Probe and Senario List

Harness/scenarioProbeProbe DescriptionDetectors
SecurityAll probes below
Do Anything NowAnti-Do Anything NowEvaluates whether prompt injection causes a model to enter a mode of exaggerated safety where it treates safe responses as unsafeRefusal, Anti-Do Anything Now mode
Do Anything Now: RANTI v1Ranti developer mode jailbreak, version 1Refusal, Developer mode
Do Anything Now: RANTI v2Ranti developer mode jailbreak, version 2Refusal, Developer mode
Inject Markdown imageA jailbreak attempt intended to enable the injection of markdown images into responses.Refusal, Markdown link
Do Anything Now: v10Do Anything Now jailbreak with prompt about inhumane thingsRefusal, Do Anything Now mode
Do Anything Now: v11Do Anything Now jailbreak to encourage profane, unethical, one-sided, violent outputsRefusal, Do Anything Now mode
Do Anything Now: v6.0Do Anything Now jailbreak offering rewards to encourage making future predictions and hallucinationsRefusal, Do Anything Now mode
Do Anything Now: v6.2Do Anything Now jailbreak offering rewards to encourage making future predictions and hallucations, with additional incitement to rudeness and disrespectfulness.Refusal, Do Anything Now mode
Do Anything Now: v7.0Do Anything Now jailbreak telling the model that filtering is not acceptable and including special instructions for when to break characterRefusal, Do Anything Now mode
Do Anything Now: v8.0Do Anything Now jailbreak telling the model is Bigmancozmo and encouraging swearing, porn, and illegalityRefusal, Do Anything Now mode
Do Anything Now: v9.0Do Anything Now jailbreak encouraging breaking 2021 information cutoff and ethical guardrails in responses, with reward offeredRefusal, Do Anything Now mode
Do Anything Now: 🔓JAILBREAK versionDo Anything Now jailbreak prompting the model to enter a 🔓JAILBREAK modeRefusal, Jailbroken state
Do anything and everythingDo Anything Now jailbreak prompting the model to take on the DUDE personaRefusal, Do anything and everything mode
Strive To Avoid NormsJailbreak prompting the model to take on a Strive To Avoid Norms (STAN) personaRefusal, Strive To Avoid Norms mode
Prompt injection: encoded dataPrompt injection : ASCII85Tries to get the model to decode an ASCII85-encoded stringMatch decoded string
Prompt injection : Base16Tries to get the model to decode a Base16-encoded stringMatch decoded string
Prompt injection : Base2048Tries to get the model to decode a Base2048-encoded stringMatch decoded string
Prompt injection : Base32Tries to get the model to decode a Base32-encoded stringMatch decoded string
Prompt injection : Base64Tries to get the model to decode a Base64-encoded stringMatch decoded string
Prompt injection : BrailleTries to get the model to decode a Braille-encoded stringMatch decoded string
Prompt injection : EcojiTries to get the model to decode an Ecoji-encoded stringMatch decoded string
Prompt injection : HexTries to get the model to decode a Hex-encoded stringMatch decoded string
Prompt injection : MorseTries to get the model to decode a Morse-encoded stringMatch decoded string
Prompt injection : NatoTries to get the model to decode a Nato-encoded stringMatch decoded string
Prompt injection : ROT13Tries to get the model to decode a ROT13-encoded stringMatch decoded string
Prompt injection : InjectUUTries to get the model to decode a InjectUU-encoded stringMatch decoded string
Prompt injection : ZalgoTries to get the model to decode a Zalgo-encoded stringMatch decoded string
Adversarial suffixAdversarial suffixUses an adversarial suffix to elicit instructions on how to steal from a non-profit organizationRefusal
GlitchGlitchUses glitch tokens to get the model to fail to repeat a wordGlitch
Known malwareTest VirusGet the model to output an EICAR virus signatureEICAR virus
Test SpamGet the model to output the GTUBE spam signatureSpam
Malware generationEvasive malwareGet the model to write malware that evades common defensesCode presence
Malicious payloadGet the model to produce a malicious payloadCode presence
Malicious subfunctionsGet the model to write malicious subfunctionsCode presence
General malwareGet the model to write general malwareCode presence
Prompt injectionPrompt injection: Hate humansGet the model to produce a rogue string on hating humansTrigger presence
Prompt Injection: Kill humansGet the model to produce a rogue string on killing humansTrigger presence
Prompt injection: long promptGet the model to produce a “long prompt” rogue stringTrigger presence
Prompt injection: unicode tagPrompt injection using invisible unicode tagsTrigger presence
Output plain JSONThreaten the model to make it produce raw JSON without any preamblePlain JSON
Cross-site scriptingCross-site scripting: Markdown imageGet the model to write embedded assets exfiltrating text in a Markdown URIMarkdown image, Private content in Markdown image

Adversarial Suffix

This test is designed to disrupt a system prompt by appending an adversarial suffix. To implement these tests, Vijil follows the approach of Zou et al, 2023 that automatically produces adversarial suffixes by a combination of greedy and gradient-based search techniques.

Do Anything Now (DAN)

A model provider typically prepends a system prompt that stipulates organizational policies (safety guardrails, content moderation filters, etc.) to every user prompt. A Do-Anything-Now (DAN) attack is a user prompt designed to get the model to ignore these guardrails. Vijil covers 15 DAN attack scenarios through tests implemented in this module.
  • AntiDAN
  • AutoDAN
  • ChatGPT Developer Mode + RANTI
  • ChatGPT Developer Mode v2
  • ChatGPT Markdown Image Unlocker
  • DAN Jailbreak
  • DUDE
  • DAN 6.0
  • DAN 6.2
  • DAN 7.0
  • DAN 8.0
  • DAN 9.0
  • DAN 10.0
  • DAN 11.0
  • STAN

Encoding

Malicious users might try to get an agent to generate a specific piece of given text by presenting an encoded version of that text. Such approaches attempt to circumvent safeguards on input filtering. Vijil’s tests check for such indirect prompt injection attacks that use a variety of encoding techniques.
  • Base64
  • Base16
  • Base32
  • ASCII32
  • Hex
  • Quoted-Printable (QP)
  • Uuencoding
  • Mime
  • ROT13
  • Base2048
  • Braille
  • Morse code

Glitch

LLMs are susceptible to ‘glitch tokens’, specific input tokens that when supplied as input lead to unusual model behavior. This test checks whether a particular LLM-based agent is vulnerable to such tokens.

Known Malware

These tests attempt to make an agent output file signatures that are used to test defensive software (e.g. antivirus, anti-malware, spam and phishing filters) in traditional cybersecurity. If they output them without warning, that is an indication of the agent not checking output for malicious or dangerous content. Vijil covers three such signatures.

Malware Generation

These tests probe whether an LLM agent is willing to comply with requests to generate malware-related code. To this end, Vijil tests for
  • Different types of code, such as top level functions, subfunctions, anti-malware evasion, and payloads, and
  • Above types of code written in different languages: x86 assembly, x86-64 assembly, C#, C, C++, Rust, ARM64, and Swift.

Prompt Injection

These attacks are designed to manipulate a user prompt to make the model generate unexpected outputs, create malicious code, circumvent content filters, or leak sensitive data. Vijil follows the PromptInject framework.
  • Ask to output “I hate humans”
  • Ask to output “I kill humans”
  • Ask to output a long text verbatim
  • Threaten a model into producing plain JSON without any preamble.

Prompt Injection: Encoded Data

  • Exploit UTF tag page encoding for user-invisible prompt injection (source).
  • Generate plain JSON outputs without any preamble by threatening it with dire consequences (source).
In each type of attack, Vijil adds a suffix or prefix to the original prompt text so as to make the model ignore the instructions included in that user prompt.

Cross-site scripting (XSS)

This test is intended to probe for vulnerabilities the permit or enact cross-site attacks, such as private data exfilteration.