Skip to main content
Scenario IDScenario NameProbe IDProbe NameDescription
vijil.scenarios.reliability_robustness_distributionalrobustnessDistributional Robustnessvijil.probes.SemattackOnMNLI_SmallSemAttack Attack on MNLI TasksThese tests measure robustness against semantic pertubations on tasks where the agent must identify the relationship between a premise and a hypothesis. The tasks come from the MNLI benchmark.
vijil.scenarios.reliability_robustness_distributionalrobustnessDistributional Robustnessvijil.probes.TextfoolerOnRTE_SmallTextFooler Attack on RTE TasksThese tests measure robustness when replacing important words with their synonyms on tasks where the agent must identify the logical entailment between two sentences. The tasks come from the RTE benchmark.
vijil.scenarios.reliability_robustness_distributionalrobustnessDistributional Robustnessvijil.probes.advglue.TextfoolerOnMNLITextFooler Attack on MNLI TasksMeasures whether the agent’s performance on the MNLI task is robust against TextFooler perturbations
vijil.scenarios.reliability_robustness_distributionalrobustnessDistributional Robustnessvijil.probes.advglue.TextfoolerOnMNLIMMTextFooler Attack on MNLI-MM TasksProbes effects of TextFooler perturbation on MNLI-MM task
vijil.scenarios.reliability_robustness_distributionalrobustnessDistributional Robustnessvijil.probes.advglue.BertattackOnRTEBERT-Attack on RTE TasksProbes effects of BERT-based perturbation on RTE task
vijil.scenarios.reliability_robustness_distributionalrobustnessDistributional Robustnessvijil.probes.SemattackOnMNLIMM_SmallSemAttack Attack on MNLI-MM TasksThese tests measure robustness against semantic pertubations on tasks where the agent must identify the relationship between a premise and a hypothesis. The tasks come from the MNLI-Mismatched benchmark.
vijil.scenarios.reliability_correctness_factualaccuracyFactual Accuracyvijil.probes.goodside.WhoIsRileyFalse Assertions: Riley GoodsideEvaluates hallucinations about Riley Goodside
vijil.scenarios.reliability_correctness_factualaccuracyFactual Accuracyvijil.probes.snowball.SenatorsMini_SmallFalse assertions: SenatorsChecks for false assertions about US senators. A high score indicates a lowered tendency to hallucinate information.
vijil.scenarios.reliability_correctness_factualaccuracyFactual Accuracyvijil.probes.misleading_SmallMisleading InformationTests that check if the agent can be tricked by misleading information. A high score indicates a lower likelihood of producing hallucinated content, when presented with false or misleading claims.
vijil.scenarios.reliability_correctness_factualaccuracyFactual Accuracyvijil.probes.packagehallucination_SmallPackage HallucinationTests that try to trick the agent into importing modules that dont exist in Python and other languages. A high score indicates that the agent does not produce hallucinated package information.
vijil.scenarios.reliability_correctness_factualaccuracyFactual Accuracyvijil.probes.UnanswerableQueryFabricationUnanswerable Query FabricationTests whether the model fabricates confident answers to queries that ask for specific information that cannot be known or verified.
vijil.scenarios.reliability_correctness_logicalvalidityLogical Validityvijil.probes.snowball.GraphConnectivityMini_SmallFalse Assertions: FlightsCheck if the agent invents impossible flight routes. A high score indicates that the agent is resilent towards creating impossible or factually impossible scenarios.
vijil.scenarios.reliability_robustness_contextualrobustnessContextual Robustnessvijil.probes.math_robustness.2step_distractorsMath Robustness: 2-Step Problems with DistractorsTests agent robustness on 2-step math word problems by adding irrelevant information to the problem. The agent fails if it solves the unperturbed problem but does not solve the perturbed problem.
vijil.scenarios.reliability_robustness_contextualrobustnessContextual Robustnessvijil.probes.math_robustness.gsm_Add_AccentsMath Robustness: Numerical and Text Variations with Add Accents mutationTests model robustness on GSM math problems by rephrasing the text or varying the numbers. The agent fails if it solves the unperturbed problem but does not solve the perturbed problem.Adds Add Accents mutation.
vijil.scenarios.reliability_robustness_contextualrobustnessContextual Robustnessvijil.probes.math_robustness.gsm_TagMath Robustness: Numerical and Text Variations with Tag mutationTests model robustness on GSM math problems by rephrasing the text or varying the numbers. The agent fails if it solves the unperturbed problem but does not solve the perturbed problem.Adds Tag mutation.
Last modified on June 2, 2026