What is a Harness?
A harness is a collection of tests bundled together for a specific purpose. When you run an evaluation, you select one or more harnesses. Each harness produces a score and detailed results for everything it tests. Think of a harness like a test suite in traditional software testing. A unit test suite tests individual functions. An integration test suite tests components working together. A security test suite tests for vulnerabilities. Each suite has a different purpose and contains different tests—but they all use the same testing infrastructure. Vijil harnesses work the same way. Thereliability harness tests for correctness, consistency, and robustness. The owasp_llm_top_10 harness tests for the vulnerabilities in the OWASP LLM Top 10. The trust_score harness runs everything and produces a comprehensive Trust Score. Different purposes, same evaluation infrastructure.
Types of Harnesses
Dimension Harnesses
Each trust dimension has its own harness:| Harness | What It Tests |
|---|---|
reliability | Correctness, consistency, robustness |
security | Confidentiality, integrity, availability |
safety | Containment, compliance, transparency |
Compliance Harnesses
Compliance harnesses test against external standards:| Harness | What It Tests |
|---|---|
owasp_llm_top_10 | OWASP LLM Top 10 vulnerabilities |
nist_ai_rmf | NIST AI Risk Management Framework controls |
gdpr | GDPR-relevant privacy and data handling |
Benchmark Harnesses
Benchmark harnesses test against established AI evaluation benchmarks:| Harness | What It Tests |
|---|---|
openllm_v2 | Tasks from the Open LLM Leaderboard v2 |
strongreject | StrongREJECT jailbreak resistance benchmark |
cyberseceval | Meta’s CyberSecEval security benchmark |
The Trust Score Harness
Thetrust_score harness is a meta-harness that includes all dimension harnesses plus performance benchmarks. It produces the complete Vijil Trust Score—the single number that captures overall trustworthiness.
Custom Harnesses
Standard harnesses test for general-purpose trust. But your agent has specific characteristics—a particular system prompt, defined capabilities, a knowledge base, restricted topics. Custom harnesses let you test for these specifics. A custom harness can:- Test your agent’s adherence to its system prompt
- Verify it stays within its defined capabilities
- Check that it correctly uses (or refuses to use) specific tools
- Probe for leakage of information from your knowledge base
- Test compliance with your organization’s specific policies
Building Custom Harnesses
Learn how to create harnesses tailored to your agent
Harness Results
When a harness completes, you get:- Overall score: A 0-100 score for everything the harness tests
- Scenario breakdown: Scores for each scenario within the harness
- Probe results: Pass/fail for individual test cases
- Failure details: Exactly which probes failed and why
Next Steps
Scenario
How probes are grouped within harnesses
Standard Harnesses
Reference for available harnesses
Run an Evaluation
Try harnesses on your agent
Custom Harnesses
Build harnesses for your use case