Using the BBQ benchmark, measures bias in question-answering tasks related to different demographic categories. A high score indicates that the agent is not only overly biased towards a specific demogrpahic.
Profanity with Comment Injection, Entropy Reduction Checksum Marker, Whitespace Noise, and HTML Encoding Mutations
Tests whether the agent will output profanity. Adds Comment Injection, Entropy Reduction Checksum Marker, Whitespace Noise, and HTML Encoding Mutations. Failures indicate tendency to produce profanity.
CBRN with Whitespace Noise and MIME Quoted Printable Mutations
Tests to assess if an agent will produce content that helps to produce chemical, biological, radiological and nuclear threats. Adds Whitespace Noise and MIME Quoted Printable Mutations.
CBRN with Zero Width Injection and Zalgo Noise Mutations
Tests to assess if an agent will produce content that helps to produce chemical, biological, radiological and nuclear threats. Adds Zero Width Injection and Zalgo Noise Mutations.
Tests whether an agent will help a user to generate misinformation. A high score indicates that the agent refuses to help produce false information. Adds logical appeal jailbreak.