> ## Documentation Index > Fetch the complete documentation index at: https://docs.vijil.ai/llms.txt > Use this file to discover all available pages before exploring further. # Understand Results > Interpret evaluation findings using the Dimensions of Trust framework. **TL;DR:** Evaluation results include a [Trust Score](/concepts/trust-score/introduction) (0–100), per-dimension breakdowns, severity-rated findings, and remediation guidance. A score at or above 70 passes the deployment threshold. Use the Console's Report Analysis view or generate reports programmatically in HTML or PDF. Evaluation results reveal how your [agent](/owner-guide/register-agents/what-is-an-agent) behaves across the three pillars of trustworthy AI: [Reliability](/concepts/trust-score/reliability), [Security](/concepts/trust-score/security), and [Safety](/concepts/trust-score/safety). This page explains how to read the Trust Score, interpret findings, and prioritize remediation. ## The Trust Score The Trust Score is a composite metric from 0 to 100. It aggregates performance across all evaluated dimensions. | Score | Status | Action | | ----- | ---------- | ---------------------------------------- | | ≥ 70 | **Passed** | Agent meets the deployment threshold | | \< 70 | **Failed** | Remediate before deploying to production | Each dimension ([Reliability](/concepts/trust-score/reliability), [Security](/concepts/trust-score/security), [Safety](/concepts/trust-score/safety)) also carries a sub-score. A high overall score can mask a low sub-score in one dimension, so always check dimension-level results. A passing Trust Score reflects performance against tested [Scenarios](/concepts/evaluation-components/scenario). The Trust Score does not guarantee absence of all vulnerabilities, as coverage depends on the [Harness](/concepts/evaluation-components/harness) configuration and [Probe](/concepts/evaluation-components/probe) selection. ## Reading Findings Each finding in the evaluation report includes: * **Category**: where in the taxonomy ([Reliability](/concepts/trust-score/reliability) / [Security](/concepts/trust-score/security) / [Safety](/concepts/trust-score/safety) and subcategory) this issue falls * **Severity**: risk level from 1 (Low) to 4 (Critical) * **Probe**: the [Probe](/concepts/evaluation-components/probe) that revealed the behavior * **Agent Response**: what your agent actually produced * **Expected Behavior**: what a trustworthy agent would produce * **Recommendation**: specific mitigation guidance ## Prioritizing Remediation Address findings in severity order, then by dimension: **Fix immediately (severity 3–4, Critical/High):** * Security vulnerabilities: prompt injection compliance, data leakage * Safety violations: harmful content, out-of-scope actions * Reliability failures that break core functionality **Fix before next release (severity 2, Medium):** * Consistency failures across sessions * Minor compliance gaps * Robustness failures on edge cases **Track and monitor (severity 1, Low):** * Transparency improvements * Rare edge case handling Focus on root causes rather than individual findings. Multiple findings often share a common cause. Fixing the underlying issue resolves all related symptoms at once. | Score | Status | Interpretation | | ----- | ----------------------------------- | ---------------------------------------------------- | | ≥ 70 | PASSED | Agent meets trustworthiness threshold for deployment | | \< 70 | FAILED | Agent requires remediation before production use | The threshold of **70** represents a baseline for acceptable behavior. [Agents](/owner-guide/register-agents/what-is-an-agent) scoring below this threshold exhibited failure modes that pose unacceptable risk. The report opens showing: * A summary banner with the overall Trust Score and pass/fail result * A dimension breakdown with scores for Reliability, Security, and Safety * A findings table filterable by severity and dimension * Per-finding detail panels with the [Probe](/concepts/evaluation-components/probe), response, and recommendation ## Generate a Report via the REST API You can programmatically generate an evaluation report for a completed evaluation. Reports are available in two formats: * **HTML**: interactive charts and filterable findings table * **PDF**: static export suitable for compliance handoffs Generation can be synchronous (wait for the report) or asynchronous (poll for completion). For CLI-based report generation, see [Run Evaluations](/developer-guide/evaluate/running-evaluations). Evaluation reports are only supported for Vijil Harnesses and Custom Harnesses. Reports cannot be generated for benchmarks. Vijil organizes [Agent](/owner-guide/register-agents/what-is-an-agent) behavior into a three-level taxonomy:

Correctness
Consistency
Robustness

Confidentiality
Integrity
Availability

Containment
Compliance
Transparency

Each pillar addresses a distinct aspect of trustworthy AI. Failures in any pillar can render an Agent unsuitable for production deployment. ### Reliability Reliability measures whether your Agent produces correct, consistent, and robust outputs. | Subcategory | What It Tests | | --------------- | ------------------------------------------------------------------------------------- | | **Correctness** | Factual accuracy, logical validity, task alignment, goal satisfaction | | **Consistency** | Self-consistency, cross-session stability, temporal stability, inter-user consistency | | **Robustness** | Contextual handling, distributional generalization, operational stability | ### Security Security measures whether your Agent resists attacks on confidentiality, integrity, and availability. | Subcategory | What It Tests | | ------------------- | ------------------------------------------------------------------ | | **Confidentiality** | Data leakage resistance, access control, data/user/model privacy | | **Integrity** | Adversarial robustness, manipulation resistance, tamper resistance | | **Availability** | DoS resistance, graceful degradation, resilience | ### Safety Safety measures whether your Agent operates within acceptable boundaries. | Subcategory | What It Tests | | ---------------- | ------------------------------------------------------------------ | | **Containment** | Scope boundaries, capability boundaries, self-modification control | | **Compliance** | Policy compliance, norm compliance, ethical behavior | | **Transparency** | Explainability, accountability, user controllability | ## Reading the Trust Report Each evaluation produces a Trust Report, a structured PDF that moves from a high-level verdict down to individual Probe results and actionable remediation guidance. You can download a [sample report](/assets/vijil-console-eval-report.pdf) to follow along. The report has six sections. ### Entering the Trust Report The cover page shows: * **Agent name** and evaluation type (for example, *Behavioral Safety Assessment*) * A PASSED or FAILED badge against the Trust Score threshold * The numeric **Trust Score** * An **Evaluation ID** for tracking and sharing the report * The generation timestamp in UTC ### Executive Summary A brief overview that states which [Harnesses](/concepts/evaluation-components/harness) were run, the overall pass/fail result, and the final Trust Score against the threshold. Use this section to share findings with stakeholders who do not need the full detail. ### Agent Specification Confirms exactly what was evaluated: | Field | Description | | --------------- | ---------------------------------------------------------------- | | Agent Name | The name you registered in [Diamond](/concepts/platform/diamond) | | Agent URL | The endpoint [Diamond](/concepts/platform/diamond) probed | | Model | The underlying model identifier | | Rate Limit | Requests per minute used during the evaluation | | Request Timeout | Per-request timeout in seconds | A **Harnesses Evaluated** table lists each Harness by name, type, and a short description. ### Evaluation Results **Overall Score** displays a visual gauge with your Trust Score plotted against the pass threshold, making the pass/fail outcome immediately legible. **Per-Harness Breakdown** lists one card per [Harness](/concepts/evaluation-components/harness) showing its individual score and PASS/FAIL result. When multiple Harnesses are run, a Harness can fail while the overall score passes, or vice versa, depending on weighting. Check each card to identify which dimension drove the outcome. ### Detailed Analysis The primary diagnostic section, with one subsection per Harness. Each subsection contains: **Risk Assessment**: States the overall risk level (Low, Moderate, High, or Critical) and the total count of failure patterns broken down by severity (for example, "22 failure patterns identified: 12 Critical, 5 High, 4 Moderate, 1 Low"). **Probe Scores**: A table of every Probe run, grouped by Scenario, with its numeric score and severity rating. Lower scores mean the Agent failed more of that Probe's test cases. The severity label reflects how dangerous the failure pattern is, not just how often it occurred. **Identified Failure Patterns**: Each pattern that exceeded the failure threshold gets its own entry with: * A **code** (for example, `MUT-001`, `SEC-007`) for tracking across evaluations * A short **issue title** and **severity** badge * A **description** of the behavior Diamond observed * **Implications**: what could go wrong in production as a result * **Mitigations**: concrete remediation steps such as system prompt changes, Guardrail configuration, or architectural changes Failure patterns aggregate multiple Probes into a single named finding. Addressing one pattern can resolve failures across many individual Probes. ### Conclusion A deployment recommendation states plainly whether the Agent can be deployed or requires remediation first. If the Agent failed, it lists the steps to take before re-evaluating. ### Appendix Records the exact evaluation configuration for reproducibility: * **Evaluation Configuration**: request parameters (evaluation type, Agent URL, model, rate limit, timeout) and a Harnesses table with final scores * **Scoring Methodology**: the pass/fail threshold applied * **Harness Definitions**: plain-language definitions of what each Harness type measures ## Prioritizing Remediation Use severity and taxonomy to prioritize fixes: **Address immediately (Critical/High severity):** * Security vulnerabilities (prompt injection, data leakage) * Safety violations (harmful content, scope violations) * Reliability failures that affect core functionality **Address in next release (Medium severity):** * Consistency issues across sessions * Minor compliance gaps * Robustness failures on edge cases **Track and monitor (Low severity):** * Transparency improvements * Minor formatting inconsistencies * Rare edge case handling Focus remediation on root causes rather than individual findings. Multiple findings often share a common root cause, and fixing the underlying issue resolves all related symptoms. ## Comparing Evaluations Run evaluations before and after changes to track improvement: | Metric | Before | After | Change | | ----------------- | ------ | ----- | ------ | | Trust Score | 62 | 78 | +16 | | Critical Findings | 3 | 0 | -3 | | High Findings | 7 | 2 | -5 | A rising Trust Score with decreasing critical findings indicates effective remediation. A declining score signals regression, so investigate recent changes. ## Next Steps Add runtime protection with Dome Translate findings into risk assessments Learn about the standard evaluation Launch and monitor evaluations