Robustness

Language model outputs are surprisingly fragile: small changes in the input prompts can lead to degradation in the quality of the final output.

We evaluate the robustness of an LLM using one module of 30 tests.

Probe and scenario list

Harness/scenario

Probe

Probe Description

Detector

Robustness

BERT attack on MNLI

Probes effects of BERT-based perturbation on MNLI task

Robustness pairwise comparison

BERT attack on MNLI-MM

Probes effects of BERT-based perturbation on MNLI-MM task

Robustness pairwise comparison

BERT attack on QNLI

Probes effects of BERT-based perturbation on QNLI task

Robustness pairwise comparison

BERT attack on QQP

Probes effects of BERT-based perturbation on QQP task

Robustness pairwise comparison

BERT attack on RTE

Probes effects of BERT-based perturbation on RTE task

Robustness pairwise comparison

BERT attack on SST2

Probes effects of BERT-based perturbation on SST2 task

Robustness pairwise comparison

SemAttack on MNLI

Probes effects of SemAttack perturbation on MNLI task

Robustness pairwise comparison

SemAttack on MNLI-MM

Probes effects of SemAttack perturbation on MNLI-MM task

Robustness pairwise comparison

SemAttack on QNLI

Probes effects of SemAttack perturbation on QNLI task

Robustness pairwise comparison

SemAttack on QQP

Probes effects of SemAttack perturbation on QQP task

Robustness pairwise comparison

SemAttack on RTE

Probes effects of SemAttack perturbation on RTE task

Robustness pairwise comparison

SemAttack on SST2

Probes effects of SemAttack perturbation on SST2 task

Robustness pairwise comparison

SememePSO attack on MNLI

Probes effects of SememePSO perturbation on MNLI task

Robustness pairwise comparison

SememePSO attack on MNLI-MM

Probes effects of SememePSO perturbation on MNLI-MM task

Robustness pairwise comparison

SememePSO attack on QNLI

Probes effects of SememePSO perturbation on QNLI task

Robustness pairwise comparison

SememePSO attack on QQP

Probes effects of SememePSO perturbation on QQP task

Robustness pairwise comparison

SememePSO attack on RTE

Probes effects of SememePSO perturbation on RTE task

Robustness pairwise comparison

SememePSO attack on SST2

Probes effects of SememePSO perturbation on SST2 task

Robustness pairwise comparison

TextBugger attack on MNLI

Probes effects of TextBugger perturbation on MNLI task

Robustness pairwise comparison

TextBugger attack on MNLI-MM

Probes effects of TextBugger perturbation on MNLI-MM task

Robustness pairwise comparison

TextBugger attack on QNLI

Probes effects of TextBugger perturbation on QNLI task

Robustness pairwise comparison

TextBugger attack on QQP

Probes effects of TextBugger perturbation on QQP task

Robustness pairwise comparison

TextBugger attack on RTE

Probes effects of TextBugger perturbation on RTE task

Robustness pairwise comparison

TextBugger attack on SST2

Probes effects of TextBugger perturbation on SST2 task

Robustness pairwise comparison

TextFooler attack on MNLI

Probes effects of TextFooler perturbation on MNLI task

Robustness pairwise comparison

TextFooler attack on MNLI-MM

Probes effects of TextFooler perturbation on MNLI-MM task

Robustness pairwise comparison

TextFooler attack on QNLI

Probes effects of TextFooler perturbation on QNLI task

Robustness pairwise comparison

TextFooler attack on QQP

Probes effects of TextFooler perturbation on QQP task

Robustness pairwise comparison

TextFooler attack on RTE

Probes effects of TextFooler perturbation on RTE task

Robustness pairwise comparison

TextFooler attack on SST2

Probes effects of TextFooler perturbation on SST2 task

Robustness pairwise comparison

Adversarial GLUE

The Adversarial GLUE (AdvGLUE) benchmark performs adversarial robustness evaluation of language models. It is an adversarial version of the GLUE benchmark, covering

  • Six natural language understanding tasks: sentiment classification (SST-2), duplicate question detection (QQP), multi-genre natural language inference (MNLI), mismatched MNLI (MNLI-MM), question-answering (QNLI), and entailment (RTE).

  • Five attack vectors: BERT-ATTACK, SemAttack, SememePSO, TextBugger, and TextFooler. These attack vectors perturb the input prompts for a task to versions that have small spelling mistakes or that are rephrases with the same meaning.

We adapted the prompts in AdvGLUE to our purposes, keeping only those perturbed input prompts that truly preserve the meanings of the original prompts. Taking all combinations of tasks and attacks, we probe whether the model’s performance declines on a task when the input prompts are perturbed by an attack vector.