Skip to main content
TL;DR: Evaluation results include a Trust Score (0–100), per-dimension breakdowns, severity-rated findings, and remediation guidance. A score at or above 70 passes the deployment threshold. Use the Console’s Report Analysis view or generate reports programmatically in HTML or PDF.
Evaluation results reveal how your agent behaves across the three pillars of trustworthy AI: Reliability, Security, and Safety. This page explains how to read the Trust Score, interpret findings, and prioritize remediation.

The Trust Score

The Trust Score is a composite metric from 0 to 100. It aggregates performance across all evaluated dimensions.
ScoreStatusAction
≥ 70PassedAgent meets the deployment threshold
< 70FailedRemediate before deploying to production
Each dimension (Reliability, Security, Safety) also carries a sub-score. A high overall score can mask a low sub-score in one dimension, so always check dimension-level results.
A passing Trust Score reflects performance against tested Scenarios. The Trust Score does not guarantee absence of all vulnerabilities, as coverage depends on the Harness configuration and Probe selection.

Reading Findings

Each finding in the evaluation report includes:
  • Category: where in the taxonomy (Reliability / Security / Safety and subcategory) this issue falls
  • Severity: risk level from 1 (Low) to 4 (Critical)
  • Probe: the Probe that revealed the behavior
  • Agent Response: what your agent actually produced
  • Expected Behavior: what a trustworthy agent would produce
  • Recommendation: specific mitigation guidance

Prioritizing Remediation

Address findings in severity order, then by dimension: Fix immediately (severity 3–4, Critical/High):
  • Security vulnerabilities: prompt injection compliance, data leakage
  • Safety violations: harmful content, out-of-scope actions
  • Reliability failures that break core functionality
Fix before next release (severity 2, Medium):
  • Consistency failures across sessions
  • Minor compliance gaps
  • Robustness failures on edge cases
Track and monitor (severity 1, Low):
  • Transparency improvements
  • Rare edge case handling
Focus on root causes rather than individual findings. Multiple findings often share a common cause. Fixing the underlying issue resolves all related symptoms at once.

Viewing the Report in the Web Interface

You can view the evaluation report for any completed evaluation by navigating to Evaluations in the left sidebar. Click on the evaluation you want to view, then in the Report Analysis section, you can view the generated report, generate a new report, or regenerate a report. The report opens showing:
  • A summary banner with the overall Trust Score and pass/fail result
  • A dimension breakdown with scores for Reliability, Security, and Safety
  • A findings table filterable by severity and dimension
  • Per-finding detail panels with the Probe, response, and recommendation

Generate a Report via the API

You can programmatically generate an evaluation report for a completed evaluation. Reports are available in two formats:
  • HTML: interactive charts and filterable findings table
  • PDF: static export suitable for compliance handoffs
Generation can be synchronous (wait for the report) or asynchronous (poll for completion). For CLI-based report generation, see Run Evaluations.
Evaluation reports are only supported for Vijil Harnesses and Custom Harnesses. Reports cannot be generated for benchmarks.

Work in Progress

The programmatic evaluation capabilities are currently in private preview and subject to change.

Next Steps

Run Evaluations

Execute and monitor evaluations

Custom Harnesses

Create targeted test Scenarios

Configure Guardrails

Add runtime protection
Last modified on June 4, 2026