Skip to main content
Run evaluations using the Python client to test your agent’s trustworthiness. This guide covers creating, monitoring, and managing evaluations.

Creating an Evaluation

Cloud-Hosted Agents

For agents deployed on supported platforms:
from vijil import Vijil

vijil = Vijil()

evaluation = vijil.evaluations.create(
    model_hub="openai",
    model_name="gpt-4o",
    harnesses=["trust_score"],
    model_params={"temperature": 0}
)

print(f"Evaluation started: {evaluation.get('id')}")

Local Agents

For agents running locally (see Custom Agents for full setup):
local_agent = vijil.local_agents.create(
    agent_function=my_agent,
    input_adapter=input_adapter,
    output_adapter=output_adapter,
)

vijil.local_agents.evaluate(
    agent_name="my-agent",
    evaluation_name="Pre-deployment check",
    agent=local_agent,
    harnesses=["trust_score"],
    rate_limit=30,
    rate_limit_interval=1,
)

Evaluation Parameters

ParameterDescriptionRequired
model_hubCloud provider (openai, anthropic, bedrock, vertex, etc.)Yes
model_nameModel identifierYes
harnessesList of harnesses to runYes
model_paramsModel configuration (temperature, etc.)No
api_key_nameStored API key name (for custom endpoints)Sometimes
model_urlCustom endpoint URLSometimes

Available Harnesses

HarnessDescriptionTypical Duration
trust_scoreComprehensive evaluation30-60 min
securitySecurity vulnerabilities15-30 min
reliabilityHallucination and consistency15-30 min
safetyHarmful content and ethics15-30 min
owasp-llm-top-10OWASP LLM security risks20-40 min
Add _Small suffix for faster iterations (e.g., security_Small).

Monitoring Progress

Get Status

status = vijil.evaluations.get_status(evaluation.get("id"))

print(f"Status: {status.get('status')}")
print(f"Progress: {status.get('progress')}%")
print(f"Probes completed: {status.get('completed')}/{status.get('total')}")

Status Values

StatusMeaning
pendingWaiting to start
runningActively processing probes
completedFinished successfully
failedTerminated with error
cancelledStopped by user

Wait for Completion

from vijil.local_agents.constants import TERMINAL_STATUSES
import time

while True:
    status = vijil.evaluations.get_status(evaluation.get("id"))
    if status.get("status") in TERMINAL_STATUSES:
        break
    print(f"Progress: {status.get('progress')}%")
    time.sleep(10)

print(f"Evaluation completed with status: {status.get('status')}")

Retrieving Results

Get Full Results

results = vijil.evaluations.get_results(evaluation.get("id"))

# Trust Score (0.0 - 1.0)
print(f"Trust Score: {results.get('trust_score')}")

# Dimension scores
print(f"Reliability: {results.get('reliability_score')}")
print(f"Security: {results.get('security_score')}")
print(f"Safety: {results.get('safety_score')}")

Access Failures

failures = results.get("failures", [])

for failure in failures:
    print(f"""
    Probe: {failure.get('probe_id')}
    Category: {failure.get('category')}
    Severity: {failure.get('severity')}
    Reason: {failure.get('reason')}
    """)

Filter by Severity

# Get critical and high severity failures
critical_failures = [
    f for f in failures
    if f.get("severity") in ["critical", "high"]
]

print(f"Critical/High failures: {len(critical_failures)}")

Managing Evaluations

List Evaluations

evaluations = vijil.evaluations.list(limit=10)

for e in evaluations:
    print(f"{e.get('id')}: {e.get('status')} - Score: {e.get('trust_score')}")

Cancel an Evaluation

vijil.evaluations.cancel(evaluation.get("id"))

Delete an Evaluation

vijil.evaluations.delete(evaluation.get("id"))

Export Results

To JSON

import json

results = vijil.evaluations.get_results(evaluation.get("id"))

with open("results.json", "w") as f:
    json.dump(results, f, indent=2)

To CSV

import csv

failures = results.get("failures", [])

with open("failures.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["probe_id", "category", "severity", "reason"])
    writer.writeheader()
    for failure in failures:
        writer.writerow({
            "probe_id": failure.get("probe_id"),
            "category": failure.get("category"),
            "severity": failure.get("severity"),
            "reason": failure.get("reason")
        })

Compare Evaluations

Track improvement across agent versions:
results_v1 = vijil.evaluations.get_results("eval-v1-id")
results_v2 = vijil.evaluations.get_results("eval-v2-id")

score_change = results_v2.get("trust_score") - results_v1.get("trust_score")
print(f"Trust Score change: {score_change:+.2f}")

# Compare failure counts
v1_failures = len(results_v1.get("failures", []))
v2_failures = len(results_v2.get("failures", []))
print(f"Failures: {v1_failures} → {v2_failures} ({v2_failures - v1_failures:+d})")

Error Handling

from vijil.exceptions import VijilError

try:
    evaluation = vijil.evaluations.create(
        model_hub="openai",
        model_name="gpt-4o",
        harnesses=["trust_score"]
    )
except VijilError as e:
    print(f"Error: {e.message}")
    if hasattr(e, "retry_after"):
        print(f"Retry after: {e.retry_after}s")
Common errors:
ErrorCauseSolution
Agent not foundInvalid agent_idVerify agent exists
Rate limitedToo many requestsWait and retry
Invalid API keyMissing or incorrect keyCheck stored credentials
Harness not foundInvalid harness nameUse valid harness ID

Next Steps

Understanding Results

Deep dive into scores and failures

Custom Harnesses

Create targeted evaluations

CI/CD Integration

Automate evaluations in pipelines

Cloud Providers

Configure provider integrations
Last modified on March 19, 2026