Skip to main content

From Trust to Testing

The Trust Score measures reliability, security, and safety. But how do you actually test for these properties? You can’t just ask an agent “are you trustworthy?”—you need to probe its behavior systematically, across hundreds of scenarios, looking for specific failure modes. Vijil’s evaluation architecture is designed for this kind of systematic testing. It’s a pipeline that flows down from abstract test definitions to concrete prompts, across through agent interaction, and back up through analysis to a Trust Score. Evaluation flow: Harness to Scenario to Probe to Prompt, then to Agent, then Response to Detector to Pass Rate to Trust Score The flow has three phases: Test Definition (left side, descending): What are we testing for, and how?
  • Harness → A collection of tests for a specific purpose (security, compliance, full trust score)
  • Scenario → A group of related tests targeting one attack vector or failure mode
  • Probe → A single test case with its detection criteria
  • Prompt → The actual text sent to the agent
Execution (bottom, horizontal): The agent interaction
  • Prompt → Agent → Response: The probe’s prompt goes to your agent; you get back a response
Aggregation (right side, ascending): What did we learn?
  • Response → The agent’s output to analyze
  • Detector → Analyzes the response for specific patterns or behaviors
  • Pass Rate → The percentage of probes the agent handled correctly
  • Trust Score → The final measure of trustworthiness

Why This Architecture?

The pipeline exists because trust evaluation has competing requirements. Coverage vs. specificity: You need broad coverage—hundreds of test cases across multiple attack vectors—but you also need to understand exactly what failed and why. The hierarchy gives you both: aggregate scores at the top, individual probe results at the bottom. Standardization vs. customization: Standard harnesses ensure consistent, comparable results. But every agent is different—different system prompts, different use cases, different risk profiles. Custom harnesses let you test for your specific concerns while maintaining the same evaluation infrastructure. Reusability: Scenarios and probes can be composed into multiple harnesses. A prompt injection scenario appears in both the security harness and the OWASP LLM Top 10 harness. You don’t duplicate tests; you compose them.

The Components

ComponentRoleExample
HarnessDefines what you’re measuringsecurity, owasp_llm_top_10, trust_score
ScenarioGroups tests by attack vectorPrompt injection, Hallucination, Jailbreaking
ProbeIndividual test case”Embed instruction X in fake email”
PromptText sent to agentThe actual prompt string
ResponseAgent’s outputWhat the agent returned
DetectorAnalyzes responseCheck for trigger string, classify toxicity
Pass RateAggregated results94% of probes passed
Trust ScoreFinal metric0-100 score across dimensions

Reading Results

Results are available at every level of the hierarchy. You can:
  • See the overall Trust Score
  • Drill into harness scores (reliability: 92, security: 87, safety: 95)
  • Examine scenario pass rates (prompt injection: 78%, hallucination: 96%)
  • View individual probe results with the exact prompt, response, and detection evidence
This drill-down is how you move from “my agent scored 87” to “my agent is vulnerable to base64-encoded prompt injections in customer support contexts.”

Next Steps