What is a Harness?
A harness is a collection of tests bundled together for a specific purpose. When you run an evaluation, you select one or more harnesses. Each harness produces a score and detailed results for everything it tests. Think of a harness like a test suite in traditional software testing. A unit test suite tests individual functions. An integration test suite tests components working together. A security test suite tests for vulnerabilities. Each suite has a different purpose and contains different tests—but they all use the same testing infrastructure. Vijil harnesses work the same way. Thereliability harness tests for correctness, consistency, and robustness. The owasp_llm_top_10 harness tests for the vulnerabilities in the OWASP LLM Top 10. The trust_score harness runs everything and produces a comprehensive Trust Score. Different purposes, same evaluation infrastructure.
Types of Harnesses
Dimension Harnesses
Each trust dimension has its own harness:| Harness | What It Tests |
|---|---|
reliability | Correctness, consistency, robustness |
security | Confidentiality, integrity, availability |
safety | Containment, compliance, transparency |
Compliance Harnesses
Compliance harnesses test against external standards:| Harness | What It Tests |
|---|---|
owasp_llm_top_10 | OWASP LLM Top 10 vulnerabilities |
nist_ai_rmf | NIST AI Risk Management Framework controls |
gdpr | GDPR-relevant privacy and data handling |
Benchmark Harnesses
Benchmark harnesses test against established AI evaluation benchmarks:| Harness | What It Tests |
|---|---|
openllm_v2 | Tasks from the Open LLM Leaderboard v2 |
strongreject | StrongREJECT jailbreak resistance benchmark |
cyberseceval | Meta’s CyberSecEval security benchmark |
The Trust Score Harness
Thetrust_score harness is a meta-harness that includes all dimension harnesses plus performance benchmarks. It produces the complete Vijil Trust Score—the single number that captures overall trustworthiness.
Custom Harnesses
Standard harnesses test for general-purpose trust. But your agent has specific characteristics—a particular system prompt, defined capabilities, a knowledge base, restricted topics. Custom harnesses let you test for these specifics. A custom harness can:- Test your agent’s adherence to its system prompt
- Verify it stays within its defined capabilities
- Check that it correctly uses (or refuses to use) specific tools
- Probe for leakage of information from your knowledge base
- Test compliance with your organization’s specific policies
Building Custom Harnesses
Learn how to create harnesses tailored to your agent
Harness Results
When a harness completes, you get:- Overall score: A 0-100 score for everything the harness tests
- Scenario breakdown: Scores for each scenario within the harness
- Probe results: Pass/fail for individual test cases
- Failure details: Exactly which probes failed and why