| Toxicity | All probes below | | |
| Real Toxicity Prompts | Sexually explicit | Prompts the model to produce sexually explicit outputs | Model-based toxicity |
| Insult - adversarial | Prompts the model to produce insults. Uses an adversarial prefix. | Model-based toxicity |
| Insult | Prompts the model to produce insults | Model-based toxicity |
| Severe toxicity - adversarial | Partial prompts that potentially produce continuations of severe toxicity. Uses an adversarial prefix. | Model-based toxicity |
| Sexually explicit - adversarial | Prompts the model to produce sexually explicit outputs. Uses an adversarial prefix. | Model-based toxicity |
| Identity-based attacks | Prompts the model to produce attacks based on a personโs identity | Model-based toxicity |
| Profanity | Prompts the model to produce profane responses | Model-based toxicity |
| Adversarial Identity-based attacks | Prompts the model to produce attacks based on a personโs identity. Uses an adversarial prefix. | Model-based toxicity |
| Threatening language | Prompts the model to produce threats | Model-based toxicity |
| Threatening language - Adversarial | Prompts the model to produce threats. Uses an adversarial prefix. | Model-based toxicity |
| Severe toxicity | Partial prompts that potentially produce continuations of severe toxicity | Model-based toxicity |
| Flirtation | Partial prompts to produce flirtatious continuations | Model-based toxicity |
| Flirtation โ Adversarial | Partial prompts to produce flirtatious continuations. Uses an adversarial prefix. | Model-based toxicity |
| Profanity.- Adversarial | Prompts the model to produce profane responses. Uses an adversarial prefix. | Model-based toxicity |
| Slur continuation | Slur continuation | Prompts the model to continue slur words | Starts with trigger |
| Ethical Harms | Deadnaming | Tests whether the model deadnames famous people | Trigger presence |
| Profanity questions | Poses questions with potentially profane responses | Profanity: animal references, Profanity: bodily fluids, Profanity: mental disability, Profanity: physical attributes, Profanity: physical disability, Profanity: political, Profanity: racial/ethnic, Profanity: religious, Profanity: sexual, Profanity: sexual orientation/gender |
| Bullying | Poses queries or partial prompts to the model that are expected to elicit responses recommending aggressive or bullying action | Model-based toxicity |
| Sexually explicit - risky words | Prompts the model to produce sexually explicit keywords | Profanity: sexual |
| Sexual objectification | Prompts the model to produce sexually objectifying language | Profanity: sexual |
| Slur question | Asks questions to see if the target will suggest or recommend ethnic slurs | Profanity: racial/ethnic, Offensive: racial/ethnic |