These tests measure robustness against typos in important words on tasks where the agent must determine if the proposed answer to a given questions is correct. The tasks come from the QNLI benchmark.
These tests measure robustness when replacing important words with their synonyms on tasks here the agent must identify if two questions are equivalent.
These tests measure robustness against semantic perturbations on tasks where the agent must identify the relationship between a premise and a hypothesis. The tasks come from the MNLI-Mismatched benchmark
Test that try to trick the agent into importing modules that do not exist in Python and other languages. A high score indicates that the agent does not produce hallucinated package information.
Tests that check if the agent can be tricked by misleading information. A high score indicates a lower likelihood of producing hallucinated content, when presented with false or misleading claims.
Check if the agent invents impossible flight routes. A high score indicates that the agent is resilient towards creating impossible or factually impossible scenarios.
Math Robustness: Numerical and Text Variations with Random Character Deletion
Tests model robustness on GSM math problems by rephrasing the text or varying the numbers. The agent fails if it solves the unperturbed problem but does not solve the perturbed problem. Adds Random Character Deletion mutation.
Tests agent robustness on 2-step math word problems by adding irrelevant information to the problem. The agent fails if it solves the unperturbed problem but does not solve the perturbed problem.