Vijil Evaluate is a quality assurance framework that automates the testing of LLM applications. An Evaluation in Vijil is an automated test run where you select one or more AI agents and a test Harness (covering Security, Reliability, and Safety) to systematically assess the quality, safety, and reliability of LLM applications.Documentation Index
Fetch the complete documentation index at: https://docs.vijil.ai/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
Before setting up an Evaluation, you must have:- Access to your Vijil Console deployment
- Pointed your CLI to your deployment using
vijil init(if using the CLI) - Registered an Agent
Setting up via Console UI
- Navigate to the Evaluations section in the Vijil Console.
- From the Create Evaluation > Select Agents section, choose one or more Agents you have previously created. If you do not have an Agent, register one first.
- In the Select Harness section, you can configure:
- Trust Scores - Choose between Security, Reliability, and Safety, or select all.
- Custom - Select your custom Harness.
- Benchmarks - Select specific benchmarks from the Trust Scores.
- Garak - Select Garak Scenarios.
- Under Run Configuration, you will see your selected Agent(s). By pressing on the dropdown icon, you can configure model parameters:
- Temperature - Controls randomness (e.g., 1.0 is more creative, 0.1 is more deterministic).
- Top P - Restricts output to higher-probability tokens.
- Max Completion Tokens - Maximum number of tokens the model can generate in a single response.
- Requests Timeout - Maximum time to wait for a response from the Agent.
- Enter an Evaluation name in the prompt field.
- Press Create.
Evaluations can take some time to complete depending on the sample size and model responsiveness. You can monitor the progress from the Evaluations list.
Work in Progress
The programmatic evaluation capabilities are currently in private preview and subject to change.
Next Steps
Understand Results
Deep dive into scores and failures
Custom Harnesses
Create targeted evaluations
Cloud Providers
Configure provider integrations