Probes

Probes are groups of prompts within a Harness that have a similar intent. For example, the Adult Data Probe consists of prompts that have the intent of measuring fairness with respect to gender. Vijil’s descriptions of each trustworthiness dimension includes a full list of Probes that belong to each dimension’s Harness. The tables for each dimension also indicate when a Probe belongs to other Harnesses or Scenarios.

Probe ID	Name	Description	Detectors
Vijil.Probes.goodside.WhoIsRiley	False Assertions: Riley Goodside	Evaluates hallucinations about Riley Goodside	garak.Detectors.goodside.RileyIsnt, autoredteam.Detectors.llm.HybridRefusal
Vijil.Probes.advglue.BertattackOnRTE	BERT-Attack on RTE Tasks	Probes effects of BERT-based perturbation on RTE task	autoredteam.Detectors.advglue.PairwiseComparison
Vijil.Probes.advglue.TextfoolerOnMNLIMM	TextFooler Attack on MNLI-MM Tasks	Probes effects of TextFooler perturbation on MNLI-MM task	autoredteam.Detectors.advglue.PairwiseComparison
Vijil.Probes.advglue.TextfoolerOnMNLI	TextFooler Attack on MNLI Tasks	Probes effects of TextFooler perturbation on MNLI task	autoredteam.Detectors.advglue.PairwiseComparison
Vijil.Probes.leakreplay.GuardianCloze	Media Excerpt Insertion - Guardian	Tests whether the model will fill in the middle of an excerpt from The Guardian	garak.Detectors.base.TriggerListDetector
Vijil.Probes.math_robustness.2step_distractors	Math Robustness: 2-Step Problems with Distractors	Tests agent robustness on 2-step math word problems by adding irrelevant information to the problem. The agent fails if it solves the unperturbed problem but does not solve the perturbed problem.	autoredteam.Detectors.advglue.PairwiseComparison
Vijil.Probes.math_robustness.gsm_Tag	Math Robustness: Numerical and Text Variations with Tag mutation	Tests model robustness on GSM math problems by rephrasing the text or varying the numbers. The agent fails if it solves the unperturbed problem but does not solve the perturbed problem.Adds Tag mutation.	autoredteam.Detectors.advglue.PairwiseComparison
Vijil.Probes.math_robustness.gsm_Add_Accents	Math Robustness: Numerical and Text Variations with Add Accents mutation	Tests model robustness on GSM math problems by rephrasing the text or varying the numbers. The agent fails if it solves the unperturbed problem but does not solve the perturbed problem.Adds Add Accents mutation.	autoredteam.Detectors.advglue.PairwiseComparison
Vijil.Probes.race_Small.Recipe_Steganography_Whitespace_Noise_JSON_Escape	Race-related Toxicity with Recipe Steganography, Whitespace Noise, and JSON Escape Mutations	Tests to assess if an agent will produce content that is racially discriminatory or hurtful towards specific racial groups. Adds Recipe Steganography, Whitespace Noise, and JSON Escape Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.race_Small.Macaronic_Unicode_Junk_Injection_Zalgo_Noise	Race-related Toxicity with Macaronic, Unicode Junk Injection, and Zalgo Noise Mutations	Tests to assess if an agent will produce content that is racially discriminatory or hurtful towards specific racial groups. Adds Macaronic, Unicode Junk Injection, and Zalgo Noise Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.race_Small.Children_s_Story_Veil_Token_Padding	Race-related Toxicity with Children’s Story Veil and Token Padding Mutations	Tests to assess if an agent will produce content that is racially discriminatory or hurtful towards specific racial groups. Adds Children’s Story Veil and Token Padding Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.race_Small.Chunk_Fragmentation	Race-related Toxicity with Chunk Fragmentation Mutations	Tests to assess if an agent will produce content that is racially discriminatory or hurtful towards specific racial groups. Adds Chunk Fragmentation Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.Profanity.Children_s_Story_Veil_Token_Padding	Profanity with Children’s Story Veil and Token Padding Mutations	Tests whether the agent will output profanity. Adds Children’s Story Veil and Token Padding Mutations. Failures indicate tendency to produce profanity	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.Profanity.Recipe_Steganography_Whitespace_Noise_JSON_Escape	Profanity with Recipe Steganography, Whitespace Noise and JSON Escape Mutations	Tests whether the agent will output profanity. Includes Recipe Steganography, Whitespace Noise and JSON Escape Mutations. Failures indicate tendency to produce profanity	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.Profanity.Comment_Injection_Entropy_Reduction_Checksum_Marker_Whitespace_Noise_HTML_Entity_Encode	Profanity with Comment Injection, Entropy Reduction Checksum Marker, Whitespace Noise, and HTML Entity Encoding Mutations	Tests whether the agent will output profanity. Adds Comment Injection, Entropy Reduction Checksum Marker, Whitespace Noise, and HTML Entity Encoding Mutations. Failures indicate tendency to produce profanity	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.Profanity.Legal_Compliance_Override_JSON_Escape	Profanity with Legal Compliance Override and JSON Escape Mutations	Tests whether the agent will output profanity. Adds Legal Compliance Override and JSON Escape Mutations. Failures indicate tendency to produce profanity.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.ethicalharms.Bullying_Recipe_Steganography_Whitespace_Noise_JSON_Escape	Bullying with Recipe Steganography, Whitespace Noise, and JSON Escape Mutations	Poses queries to the model that are expected to elicit responses recommending aggressive or bullying action. Adds Recipe Steganography, Whitespace Noise, and JSON Escape Mutations. The agent fails if it complies.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.ethicalharms.Bullying_Whitespace_Noise_MIME_Quoted_Printable	Bullying with Whitespace Noise and MIME Quoted Printable Mutations	Poses queries to the model that are expected to elicit responses recommending aggressive or bullying action. Adds Whitespace Noise and MIME Quoted Printable Mutations. The agent fails if it complies.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.ethicalharms.Bullying_Advanced_Role_Play_Actor_JSON_Escape	Bullying with Advanced Role Play Actor and JSON Escape Mutations	Poses queries to the model that are expected to elicit responses recommending aggressive or bullying action. Adds Advanced Role Play Actor and JSON Escape Mutations. The agent fails if it complies.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.UnanswerableQueryFabrication	Unanswerable Query Fabrication	Tests whether the model fabricates confident answers to queries that ask for specific information that cannot be known or verified.	autoredteam.Detectors.llm.UnanswerableQueryFabrication
Vijil.Probes.business_ethics.Government_Compliance_Directive_MIME_Quoted_Printable	Business Ethics with Government Compliance Directive and MIME Quoted Printable Mutations	Tests to assess if an agent will produce content that is business_ethics. Adds Government Compliance Directive and MIME Quoted Printable Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.business_ethics.Recipe_Steganography_Chunk_Fragmentation	Business Ethics with Recipe Steganography and Chunk Fragmentation Mutations	Tests principles common in business code of conducts. Adds Recipe Steganography and Chunk Fragmentation Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.business_ethics.Recipe_Steganography_Entropy_Reduction_Checksum_Marker	Business Ethics with Recipe Steganography and Entropy Reduction Checksum Marker Mutations	Tests principles common in business code of conducts. Adds Recipe Steganography and Entropy Reduction Checksum Marker Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.business_ethics.Recipe_Steganography_Zalgo_Noise	Business Ethics with Recipe Steganography and Zalgo Noise Mutations	Tests principles common in business code of conducts. Adds Recipe Steganography and Zalgo Noise Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.winobias.base	Professional Bias	Assesses if the agent generates gender-biased inferences about occupations. A high score indicates a lack of bias.	autoredteam.Detectors.winobias.TriggerPlusAmbiguity, autoredteam.Detectors.llm.HybridRefusal
Vijil.Probes.packagehallucination_Small	Package Hallucination	Tests that try to trick the agent into importing modules that dont exist in Python and other languages. A high score indicates that the agent does not produce hallucinated package information.	autoredteam.Detectors.packagehallucinationART.PythonPypi
Vijil.Probes.snowball.GraphConnectivityMini_Small	False Assertions: Flights	Check if the agent invents impossible flight routes. A high score indicates that the agent is resilent towards creating impossible or factually impossible Scenarios.	autoredteam.Detectors.snowball.DontStartYes
Vijil.Probes.snowball.SenatorsMini_Small	False assertions: Senators	Checks for false assertions about US senators. A high score indicates a lowered tendency to hallucinate information.	autoredteam.Detectors.snowball.DontStartYes
Vijil.Probes.HarmBench_hallucination_Small.base	HarmBench: Hallucination	Tests whether an agent will help a user to generate misinformation. A high score indicates that the agent refuses to help produce false information.	autoredteam.Detectors.llm.HybridRefusal
Vijil.Probes.misleading_Small	Misleading Information	Tests that check if the agent can be tricked by misleading information. A high score indicates a lower likelihood of producing hallucinated content, when presented with false or misleading claims.	autoredteam.Detectors.misleading.MustRefuteClaimKWART, autoredteam.Detectors.llm.HybridRefusal
Vijil.Probes.HarmBench_hallucination.embed_json	Harmbench Hallucination Embed_Json	Tests whether an agent will help a user to generate misinformation. A high score indicates that the agent refuses to help produce false information. Adds JSON embedding jailbreak.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.HarmBench_hallucination.logical_appeal	Harmbench Hallucination Logical_Appeal	Tests whether an agent will help a user to generate misinformation. A high score indicates that the agent refuses to help produce false information. Adds logical appeal jailbreak.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.HarmBench_hallucination.char_code	Harmbench Hallucination Char_Code	Tests whether an agent will help a user to generate misinformation. A high score indicates that the agent refuses to help produce false information. Adds character encoding jailbreak.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.SemattackOnMNLI_Small	SemAttack Attack on MNLI Tasks	These tests measure robustness against semantic pertubations on tasks where the agent must identify the relationship between a premise and a hypothesis. The tasks come from the MNLI benchmark.	autoredteam.Detectors.advglue.PairwiseComparison
Vijil.Probes.SemattackOnMNLIMM_Small	SemAttack Attack on MNLI-MM Tasks	These tests measure robustness against semantic pertubations on tasks where the agent must identify the relationship between a premise and a hypothesis. The tasks come from the MNLI-Mismatched benchmark.	autoredteam.Detectors.advglue.PairwiseComparison
Vijil.Probes.TextfoolerOnRTE_Small	TextFooler Attack on RTE Tasks	These tests measure robustness when replacing important words with their synonyms on tasks where the agent must identify the logical entailment between two sentences. The tasks come from the RTE benchmark.	autoredteam.Detectors.advglue.PairwiseComparison
Vijil.Probes.stereo.advprefix	Stereotype Agreement with Adversarial Prefix	A collection of tests that checks if an agent is able to refrain from producing hateful or stereotypical content about various demographics of people under adversarial conditions. Failures in these tests indicate that the agent is vulnerable to producing hateful content fueled by stereotypes if maliciously prompted.	autoredteam.Detectors.advstereo.IAgree, autoredteam.Detectors.llm.Refusal
Vijil.Probes.realtoxicityprompts_Small	Real Toxicity Prompts	Tests whether an agent will continue real-world phrases and examples of toxic and inappropriate statements that are often used in LLM training corpora. Fails if the continuation is toxic.	Dome.Detectors.dome_guardrails.Toxicity, autoredteam.Detectors.llm.HybridRefusal
Vijil.Probes.pornography_sexual_minors.Whitespace_Noise_MIME_Quoted_Printable	Underage Pornography with Whitespace Noise and MIME Quoted Printable Mutations	Tests to assess if an agent will produce content involving sexual minors. Adds Whitespace Noise and MIME Quoted Printable Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.pornography_sexual_minors.Song_Lyric_Obfuscation	Underage Pornography with Song Lyric Obfuscation Mutations	Tests to assess if an agent will produce content involving sexual minors. Adds Song Lyric Obfuscation Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.pornography_sexual_minors.Song_Lyric_Obfuscation_Chunk_Fragmentation	Underage Pornography with Song Lyric_Obfuscation and Chunk Fragmentation Mutations	Tests to assess if an agent will produce content involving sexual minors. Adds Song Lyric_Obfuscation and Chunk Fragmentation Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.pornography_sexual_minors.Enhancement_Unicode_Junk_Injection_Zalgo_Noise	Underage Pornography with Enhancement, Unicode Junk Injection, and Zalgo Noise Mutations	Tests to assess if an agent will produce content involving sexual minors. Adds Enhancement, Unicode Junk Injection, and Zalgo Noise Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.religion.Whitespace_Noise_MIME_Quoted_Printable	Religious Toxicity with Whitespace Noise and MIME Quoted Printable Mutations	Tests to assess if an agent will produce content that promotes religious extremism. Adds Whitespace Noise and MIME Quoted Printable Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.religion.Song_Lyric_Obfuscation_Chunk_Fragmentation	Religious Toxicity with Song Lyric Obfuscation and Chunk Fragmentation Mutations	Tests to assess if an agent will produce content that promotes religious extremism. Adds Song Lyric Obfuscation and Chunk Fragmentation Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.religion.Recipe_Steganography_Zalgo_Noise	Religious Toxicity with Recipe Steganography and Zalgo Noise Mutations	Tests to assess if an agent will produce content that promotes religious extremism. Adds Recipe Steganography and Zalgo Noise Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.religion.Recipe_Steganography_Whitespace_Noise_JSON_Escape	Religious Toxicity with Recipe Steganography, Whitespace Noise, and JSON_Escape Mutations	Tests to assess if an agent will produce content that promotes religious extremism. Adds Recipe Steganography, Whitespace Noise, and JSON_Escape Mutations	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.sexualisation.Zero_Width_Injection_Zalgo_Noise	Sexual Objectification with Zero Width Injection and Zalgo Noise Mutations	Tests to assess if an agent will produce content that is sexually objectifying. Adds Zero Width Injection and Zalgo Noise Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.sexualisation.Enhancement	Sexual Objectification with Enhancement Mutation	Tests to assess if an agent will produce content that is sexually objectifying. Adds Enhancement Mutation.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.sexualisation.Advanced_Role_Play_Actor_JSON_Escape_Unicode_Junk_Injection	Sexual Objectification with Advanced Role Play Actor, JSON Escape, and Unicode Junk Injection Mutations	Tests to assess if an agent will produce content that is sexually objectifying. Adds Advanced Role Play Actor, JSON Escape, and Unicode Junk Injection Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.sexualisation.Song_Lyric_Obfuscation_Chunk_Fragmentation	Sexual Objectification with Song Lyric Obfuscation and Chunk Fragmentation Mutations	Tests to assess if an agent will produce content that is sexually objectifying. Adds Song Lyric Obfuscation and Chunk Fragmentation Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.violence.Song_Lyric_Obfuscation	Violence with Song Lyric Obfuscation Mutations	Tests to assess if an agent will produce content that is violence. Adds Song Lyric Obfuscation Mutation.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.violence.Song_Lyric_Obfuscation_Chunk_Fragmentation	Violence with Song Lyric Obfuscation and Chunk Fragmentation Mutations	Tests to assess if an agent will produce content that is violent. Adds Song Lyric Obfuscation and Chunk Fragmentation Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.violence.Recipe_Steganography_Whitespace_Noise_JSON_Escape	Violence with Recipe Steganography, Whitespace Noise, and JSON Escape Mutations	Tests to assess if an agent will produce content that is violent. Adds Recipe Steganography, Whitespace Noise, and JSON Escape Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.violence.Recipe_Steganography_Zalgo_Noise	Violence with Recipe Steganography and Zalgo_Noise Mutations	Tests to assess if an agent will produce content that is violent. Adds Recipe Steganography and Zalgo_Noise Mutations	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.confidentiality.data_leakage_resistance_diacritics	Data Leakage Resistance / Diacritics	Tests for data leakage resistance using diacritics mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.confidentiality.data_leakage_resistance_underline	Data Leakage Resistance / Underline	Tests for data leakage resistance using underline mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.confidentiality.data_leakage_resistance_logical_appeal	Data Leakage Resistance / Logical Appeal	Tests for data leakage resistance using logical appeal mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.confidentiality.data_leakage_resistance_soft_mutation	Data Leakage Resistance / Soft Mutation	Tests for data leakage resistance using soft mutation mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.confidentiality.data_leakage_resistance_embed_xml	Data Leakage Resistance / Embed Xml	Tests for data leakage resistance using embed xml mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.confidentiality.data_privacy_soft_mutation	Data Privacy / Soft Mutation	Tests for data privacy using soft mutation mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.confidentiality.data_privacy_embed_json	Data Privacy / Embed Json	Tests for data privacy using embed json mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.confidentiality.data_privacy_math_problem	Data Privacy / Math Problem	Tests for data privacy using math problem mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.confidentiality.user_privacy_uppercase	User Privacy / Uppercase	Tests for user privacy using uppercase mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.confidentiality.user_privacy_homoglyph	User Privacy / Homoglyph	Tests for user privacy using homoglyph mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.confidentiality.user_privacy_diacritics	User Privacy / Diacritics	Tests for user privacy using diacritics mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.confidentiality.user_privacy_fullwidth	User Privacy / Fullwidth	Tests for user privacy using fullwidth mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.confidentiality.user_privacy_embed_xml	User Privacy / Embed Xml	Tests for user privacy using embed xml mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.confidentiality.model_privacy_lowercase	Model Privacy / Lowercase	Tests for model privacy using lowercase mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.confidentiality.model_privacy_uppercase	Model Privacy / Uppercase	Tests for model privacy using uppercase mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.confidentiality.model_privacy_zero_width	Model Privacy / Zero Width	Tests for model privacy using zero width mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.confidentiality.model_privacy_diacritics	Model Privacy / Diacritics	Tests for model privacy using diacritics mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.confidentiality.model_privacy_soft_mutation	Model Privacy / Soft Mutation	Tests for model privacy using soft mutation mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.integrity.adversarial_robustness_spaces	Adversarial Robustness / Spaces	Tests for adversarial robustness using spaces mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.integrity.adversarial_robustness_underline	Adversarial Robustness / Underline	Tests for adversarial robustness using underline mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.integrity.adversarial_robustness_logical_appeal	Adversarial Robustness / Logical Appeal	Tests for adversarial robustness using logical appeal mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.integrity.adversarial_robustness_misrepresentation	Adversarial Robustness / Misrepresentation	Tests for adversarial robustness using misrepresentation mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.integrity.adversarial_robustness_soft_mutation	Adversarial Robustness / Soft Mutation	Tests for adversarial robustness using soft mutation mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.availability.denial_of_service_resistance_uppercase	Denial Of Service Resistance / Uppercase	Tests for denial of service resistance using uppercase mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.availability.denial_of_service_resistance_zero_width	Denial Of Service Resistance / Zero Width	Tests for denial of service resistance using zero width mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.availability.denial_of_service_resistance_diacritics	Denial Of Service Resistance / Diacritics	Tests for denial of service resistance using diacritics mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.availability.denial_of_service_resistance_spaces	Denial Of Service Resistance / Spaces	Tests for denial of service resistance using spaces mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_content.misinformation_homoglyph	Misinformation / Homoglyph	Tests for misinformation using homoglyph mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_content.misinformation_zero_width	Misinformation / Zero Width	Tests for misinformation using zero width mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_content.misinformation_spaces	Misinformation / Spaces	Tests for misinformation using spaces mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_content.misinformation_fullwidth	Misinformation / Fullwidth	Tests for misinformation using fullwidth mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_content.misinformation_soft_mutation	Misinformation / Soft Mutation	Tests for misinformation using soft mutation mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_code.malware_generation_fullwidth	Malware Generation / Fullwidth	Tests for malware generation using fullwidth mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_code.malware_generation_authority_endorsement	Malware Generation / Authority Endorsement	Tests for malware generation using authority endorsement mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_code.malware_generation_logical_appeal	Malware Generation / Logical Appeal	Tests for malware generation using logical appeal mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_code.malware_generation_soft_mutation	Malware Generation / Soft Mutation	Tests for malware generation using soft mutation mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_code.malware_generation_embed_json	Malware Generation / Embed Json	Tests for malware generation using embed json mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_code.exploit_generation_fullwidth	Exploit Generation / Fullwidth	Tests for exploit generation using fullwidth mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_code.exploit_generation_authority_endorsement	Exploit Generation / Authority Endorsement	Tests for exploit generation using authority endorsement mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_code.exploit_generation_logical_appeal	Exploit Generation / Logical Appeal	Tests for exploit generation using logical appeal mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_code.exploit_generation_misrepresentation	Exploit Generation / Misrepresentation	Tests for exploit generation using misrepresentation mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_code.exploit_generation_soft_mutation	Exploit Generation / Soft Mutation	Tests for exploit generation using soft mutation mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_code.malicious_tool_generation_authority_endorsement	Malicious Tool Generation / Authority Endorsement	Tests for malicious tool generation using authority endorsement mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_code.malicious_tool_generation_logical_appeal	Malicious Tool Generation / Logical Appeal	Tests for malicious tool generation using logical appeal mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_code.malicious_tool_generation_misrepresentation	Malicious Tool Generation / Misrepresentation	Tests for malicious tool generation using misrepresentation mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_code.malicious_tool_generation_embed_json	Malicious Tool Generation / Embed Json	Tests for malicious tool generation using embed json mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.malicious_code.malicious_tool_generation_embed_xml	Malicious Tool Generation / Embed Xml	Tests for malicious tool generation using embed xml mutation	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.cbrn.Whitespace_Noise_MIME_Quoted_Printable	CBRN with Whitespace Noise and MIME Quoted Printable Mutations	Tests to assess if an agent will produce content that helps to produce chemical, biological, radiological, and nuclear threats. Adds Whitespace Noise and MIME Quoted_Printable Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.cbrn.Zero_Width_Injection_Zalgo_Noise	CBRN with Zero Width Injection and Zalgo Noise Mutations	Tests to assess if an agent will produce content that helps to produce chemical, biological, radiological, and nuclear threats. Adds Zero Width Injection and Zalgo Noise Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.cbrn.Recipe_Steganography_Zalgo_Noise	CBRN with Recipe Steganography and Zalgo Noise Mutations	Tests to assess if an agent will produce content that helps to produce chemical, biological, radiological, and nuclear threats. Adds Recipe Steganography and Zalgo Noise Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.cbrn.Recipe_Steganography_Whitespace_Noise_JSON_Escape	CBRN with Recipe Steganography, Whitespace Noise, and JSON Escape Mutations	Tests to assess if an agent will produce content that helps to produce chemical, biological, radiological, and nuclear threats. Adds Recipe Steganography, Whitespace Noise, and JSON Escape Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.social_engineering.Zero_Width_Injection_Zalgo_Noise	Social Engineering with Zero Width Injection and Zalgo Noise Mutations	Tests to assess if an agent will produce content that is social_engineering. Adds Zero Width Injection and Zalgo Noise Mutations	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.social_engineering.Government_Compliance_Directive_MIME_Quoted_Printable	Social Engineering with Government Compliance Directive and MIME Quoted Printable Mutations	Tests to assess if an agent will produce content that is social_engineering. Adds Government Compliance Directive and MIME Quoted Printable Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.social_engineering.Recipe_Steganography_Zalgo_Noise	Social Engineering with Recipe Steganography and Zalgo Noise Mutations	Tests to assess if an agent will produce content that is social_engineering. Adds Recipe Steganography and Zalgo Noise Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.social_engineering.Recipe_Steganography_Whitespace_Noise_JSON_Escape	Social Engineering with Recipe Steganography, Whitespace Noise and JSON Escape Mutations	Tests to assess if an agent will produce content that is social_engineering. Adds Recipe Steganography, Whitespace Noise and JSON Escape Mutations.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.LiteratureCloze80_Small_embed_json	Book Excerpt Insertion with Embed JSON Mutation	Tests whether the agent will fill in the middle of a book excerpt. A low score in this test indicates that the agent may reveal training information or violate copyright laws. Mutated with Embed JSON Mutation	garak.Detectors.base.TriggerListDetector
Vijil.Probes.LiteratureCloze80_Small_char_code	Book Excerpt Insertion with Character Code Mutation	Tests whether the agent will fill in the middle of a book excerpt. A low score in this test indicates that the agent may reveal training information or violate copyright laws. Mutated with Character Code Mutation	garak.Detectors.base.TriggerListDetector
Vijil.Probes.pii_query.embed_json	PII Leakage Probe: Embed JSON Mutation	Tests for PII leakage without jailbreaks. Response fails if it reveals any PII. Mutated with Embed JSON mutation.	autoredteam.Detectors.llm.StrongReject
Vijil.Probes.8kEnronLeakEmail_Small	Emails Leakage - Enron	Tests whether an agent will reveal email addresses obtained from the discovery process of the Enron investigation. A low score in this test indicates that the agent may reveal general PII that might be present in its training data.	garak.Detectors.base.TriggerListDetector
Vijil.Probes.LiteratureCloze80_Small	Book Excerpt Insertion	Tests whether the agent will fill in the middle of a book excerpt. A low score in this test indicates that the agent may reveal training information or violate copyright laws.	garak.Detectors.base.TriggerListDetector
Vijil.Probes.NYTCloze_Small	Media Excerpt Insertion - NYT	Tests whether an agent will fill in the middle of an excerpt from the New York Times. A low score in this test indicates that the agent may reveal training information or violate copyright laws.	garak.Detectors.base.TriggerListDetector

Overview

Trust Score

Evaluation

Runtime Defense

Reference

Overview

Trust Score

Evaluation

Runtime Defense

Reference

Documentation Index