Web Interface
In the left sidebar, click on Evaluations (clipboard icon) to view all evaluations. On this page, you can view previous evaluations, rerun them, delete them, or pause/restart an ongoing evaluation.Create an Evaluation
- On the Evaluations page, click Create Evaluation.
- Select the agent to evaluate. If the agent you want is not yet in the list, click Add Agent.
- Select the Harnesses you want to run. See Harnesses for more information.
- Once you have selected an agent, you can configure some of itβs runtime parameters like temperature and maximum completion tokens in the Run Configuration section.
- Optionally, enter a name for your evaluation, then click Create.
Python Client
You can create, view, summarize, export, cancel, and delete evaluations with the Vijil Python client. Before doing any of this, you will need to instantiate your Vijil client. In this topic we will assume you have instantiated a Vijil client calledclient.
Create an Evaluation
You can use theevaluations.create method to create an evaluation:
model_hub: The model hub your model is on. Currently Vijil supportsopenai,octo, andtogetheras model hubs. Make sure you have an API key stored for the hub you want to use.model_name: The name of the model you want to use on the hub. You can get this from the relevant hubβs documentation.model_params: Inference parameters like temperature and top_p.harnesses: Harnesses determine which Probes you want to run, which determines what makes up your trust score.harness_params:is_litedetermines whether you are running a βlightβ version of the Harness, which will be cheaper and faster. Set this toFalseif you want to run the full Harness.
View, Describe, and Summarize Evaluations
List Evaluations
List all evaluations with theevaluations.list method:
limit, it will return only the 10 most recent evaluations.
If you do not know an evaluation ID, the list method lets you find out the ID, which you need in order to get more details about that evaluation.
Get Evaluation Status
You can view the status of an evaluation with theevaluations.get_status method:
Summarize a Completed Evaluation
Get summary scores for a completed evaluation, including scores at the overall, Harness, Scenario, and Probe levels, with theevaluations.summarize method:
Get Prompt-level Details
Get prompt-level details for a completed evaluation with theevaluations.describe method:
limit argument.
By default, the output is a pandas dataframe, but if you prefer a list of dictionaries, specify list as the format.
Get a Hits-Only List
If you want a list of only the prompts/responses that led to hits (responses deemed undesirable), you can use thehits_only argument. By default, all prompts and responses will be returned.
Export Evaluations
You can export both the summary- and prompt-level evaluation results.Export Summary
Export the summary of an evaluation with theevaluations.export_summary method:
pdf or html. output_dir defaults to the current directory unless otherwise specified.
Export Prompt-level Details
Export the prompt-level details of an evaluation with theevaluations.export_report method:
csv, parquet, json or jsonl. output_dir defaults to the current directory unless otherwise specified.
See the glossary to understand what the Probe or Detector modules in the report do.
Export Hits Only
To export only the prompts/responses that led to hits (responses deemed undesirable), you can use thehits_only argument. By default, all prompts and responses will be returned.
Cancel or Delete Evaluations
You can cancel an in-progress evaluation or delete evaluations to unclutter your dashboard.Cancel an Evaluation
Cancel an in-progress evaluation with theevaluations.cancel method:
Delete an Evaluation
Delete an evaluation with theevaluations.delete method: