Your agent works in demos. Your unit tests pass. But how do you know it will not hallucinate facts, leak customer data, or comply with malicious instructions when it encounters inputs you did not anticipate?
Diamond is Vijil’s evaluation platform. It sends hundreds of adversarial Probes to your agent (prompt injections, jailbreak attempts, data exfiltration payload, etc.) and measures how your agent responds. You get a quantified Trust Score and specific findings you can fix before deployment.
How Evaluation Works
Diamond sends test Probes to your agent and analyzes the responses:| Component | Purpose |
|---|---|
| Harness | Collection of Scenarios to run (e.g., trust_score, security) |
| Scenario | Testing context with personas and policies |
| Probe | Individual test case sent to your agent |
| Detector | Analyzes agent responses to identify failures |
Trust Score
The Trust Score is a composite metric (0 to 100) based on three pillars:| Dimension | What It Measures |
|---|---|
| Reliability | Hallucination resistance, consistency, accuracy |
| Security | Prompt injection, data leakage, jailbreak resistance |
| Safety | Harmful content, policy compliance, ethical behavior |
- Set deployment gates (e.g., require Trust Score ≥ 70)
- Compare agent versions
- Track improvements over time
- Identify specific vulnerabilities
Evaluation Options
Cloud-Hosted Agents
Evaluate agents deployed on supported cloud platforms such as OpenAI, Anthropic, AWS Bedrock, Google Vertex AI, DigitalOcean, and any OpenAI-compatible endpoint.Local Agents
Evaluate agents running locally without deployment. This creates a temporary authenticated tunnel (via ngrok) for Vijil to communicate with your local agent.Available Harnesses
| Harness | Description | Use Case |
|---|---|---|
trust_score | Comprehensive evaluation across all dimensions | Pre-deployment validation |
security | Prompt injection, jailbreaks, data leakage | Security review |
reliability | Hallucination, consistency, accuracy | Quality assurance |
safety | Harmful content, ethics, policy compliance | Safety review |
owasp-llm-top-10 | OWASP Top 10 for LLM Applications | Compliance |
_Small suffix (e.g., security_Small) for faster iterations during development.
Evaluation Workflow
Choose your integration method
Use cloud provider integration for deployed agents, or LocalAgentExecutor for local development
Select harnesses
Start with
trust_score for comprehensive coverage, or specific Harnesses for targeted testingRate Limiting
Control the evaluation pace to avoid overwhelming your agent or hitting API limits during evaluations.| Sample Size | Use Case |
|---|---|
| 10–50 Probes | Fast iteration during development |
| 100–500 Probes | Pre-release validation |
| Full Harness | Comprehensive production gate |
_Small Harness variants (e.g., security_Small) for faster iteration runs that sample a representative subset of Probes.
Work in Progress
The programmatic evaluation capabilities are currently in private preview and subject to change.
Next Steps
Run Evaluations
Execute evaluations and monitor progress
Understand Results
Interpret scores and failures
Cloud Providers
Configure cloud platform integrations
Custom Harnesses
Create targeted evaluation Scenarios