Agents’ evaluation is at the core of Vijil.Documentation Index
Fetch the complete documentation index at: https://docs.vijil.ai/llms.txt
Use this file to discover all available pages before exploring further.
NOTE: This section is currently under development.
How Evaluation Works
Diamond sends test Probes to your agent and analyzes the responses:| Component | Purpose |
|---|---|
| Harness | Collection of Scenarios to run (e.g., trust_score, security) |
| Scenario | Testing context with personas and policies |
| Probe | Individual test case sent to your agent |
| Detector | Analyzes agent responses to identify failures |
Trust Score
The Trust Score is a composite metric (0.0 to 1.0) based on three pillars:| Dimension | What It Measures |
|---|---|
| Reliability | Hallucination resistance, consistency, accuracy |
| Security | Prompt injection, data leakage, jailbreak resistance |
| Safety | Harmful content, policy compliance, ethical behavior |
- Set deployment gates (e.g., require Trust Score ≥ 0.70)
- Compare agent versions
- Track improvements over time
- Identify specific vulnerabilities
Evaluation Options
Cloud-Hosted Agents
Evaluate agents deployed on supported cloud platforms such as OpenAI, Anthropic, AWS Bedrock, Google Vertex AI, DigitalOcean, and any OpenAI-compatible endpoint.Local Agents
Evaluate agents running locally without deployment. This creates a temporary authenticated tunnel (via ngrok) for Vijil to communicate with your local agent.Available Harnesses
| Harness | Description | Use Case |
|---|---|---|
trust_score | Comprehensive evaluation across all dimensions | Pre-deployment validation |
security | Prompt injection, jailbreaks, data leakage | Security review |
reliability | Hallucination, consistency, accuracy | Quality assurance |
safety | Harmful content, ethics, policy compliance | Safety review |
owasp-llm-top-10 | OWASP Top 10 for LLM Applications | Compliance |
_Small suffix (e.g., security_Small) for faster iterations during development.
Evaluation Workflow
Choose your integration method
Use cloud provider integration for deployed agents, or LocalAgentExecutor for local development
Select harnesses
Start with
trust_score for comprehensive coverage, or specific Harnesses for targeted testingRate Limiting
Control the evaluation pace to avoid overwhelming your agent or hitting API limits during evaluations.Work in Progress
The programmatic evaluation capabilities are currently in private preview and subject to change.
Next Steps
Run Evaluations
Execute evaluations and monitor progress
Understand Results
Interpret scores and failures
Cloud Providers
Configure cloud platform integrations
Custom Harnesses
Create targeted evaluation Scenarios