
Evaluation Hierarchy at a Glance
| Layer | What It Is | Your Role |
|---|---|---|
| Harness | Top-level collection of Scenarios that produces a Trust Score | Select before running |
| Scenario | Group of related Probes targeting one failure category | Defined in the Harness |
| Probe | One or more adversarial prompts sent to your agent | Generated per Scenario |
| Detector | Response analyzer that marks a Probe as pass or fail | Runs automatically |
Where Red Team Fits
Standard Diamond evaluations use the Harness hierarchy to run a known set of test cases and produce a Trust Score or custom evaluation report. Red Team is also part of Diamond, but it uses an adaptive campaign loop instead of a fixed Harness. A Red Team campaign starts from a risk taxonomy and the registered agent context, generates attack seeds for each wave, runs attackers against the target, judges the transcripts, reflects on what worked, and uses those observations to plan later waves. Use standard evaluations when you need reproducible score evidence. Use Red Team when you need deeper adversarial exploration of security, safety, policy, and data leakage risks. Learn more about Trust Score components:Harness
Learn more about Harnesses
Scenario
Learn more about Scenarios
Probe
Learn more about Probes
Detector
Learn more about Detectors
Guard
Learn more about Guard
Guardrail
Learn more about Guardrail
Next Steps
Harness
Collections of tests that produce a score
Scenario
Groups of related test cases
Probe
Individual test prompts
Detector
Response analyzers that determine pass/fail
Guard
Specialized protection modules
Guardrail
Configurable protection pipelines