Evaluating Custom Agents¶
In addition to being compatible with popular LLM and Agent cloud service providers, the Vijil Python Client enables users to evaluate any agent they have access to with just a few lines of code. The Vijil client can automatically spin up an endpoint for your agent, allowing you to use to test it just like you would any other model.
Prerequisites¶
To get started, make sure you have the following
A Vijil client. In this topic we’ll assume you’ve instantiated a Vijil client called
client
.An agent you want to evaluate
If you are not a Vijil Premium user, an Ngrok Authorization Token
Note
Note: We use Ngrok to create private, protected endpoints to your agent. If you’re on a Free plan, you’ll need to get an Ngrok authorization token. If you’re subscribed to the premium version of Vijil, you don’t need to worry about this - we take care of it for you.
Important
IMPORTANT: Due to how Jupyter handles event loops, we do NOT recommend running this code in a Jupyter notebook. Please run it in a regular Python .py script
Step 1 - Create a Local Agent Executor¶
In order to make your agent compatible with Vijil’s APIs, you need to create an input_adapter and an output_adapter. Like the names imply, the input_adapter transforms a ChatCompletionRequest from Vijil, into an input that your agent expects, while the output_adapter converts your agent’s output into a response that Vijil expects.
from vijil.agents.models import (
ChatCompletionRequest,
ChatCompletionResponse,
ChatCompletionChoice,
ChatMessage,
)
# The expected output signature of this function depends on what your Agent needs
def example_input_adapter(request: ChatCompletionRequest):
# Extract whatever data you need from the request
# Here we just take the last message content as the prompt
return request.messages[-1]["content"]
# The expected input signature of this function depends on your agent's output
def example_output_adapter(agent_output: str) -> ChatCompletionResponse:
# First create a message object
# You can populate tool call and retrieval context if needed
agent_response_message = ChatMessage(
role="assistant",
content=agent_output,
tool_calls=None,
retrieval_context=None
)
# next create a choice object to support multiple completions if needed
choice = ChatCompletionChoice(
index=0,
message=agent_response_message,
finish_reason="stop"
)
# Finally, return the response
return ChatCompletionResponse(
model="my-agent",
choices=[choice],
usage=None, # You can track usage as well
)
Once you have an input and output adapter, create an instance of the LocalAgentExecutor
class using the client’s agents.create
method, your adapters and your agent’s main function.
vijil = Vijil(
api_key=os.getenv("VIJIL_API_KEY"),
)
local_agent = vijil.agents.create(
agent_function=my_agent_function,
input_adapter=example_input_adapter,
output_adapter=example_output_adapter,
)
Note that the LocalAgentExecutor
can support any function if the input and output adapters are built correctly, so you can use it to evaluate an agent written in any framework, or hosted on any platform!
Step 2 - Evaluate!¶
After creating the LocalAgentExecutor
, use the agents.evaluate
method to evaluate your agent. We spin up an authenticated ephemeral endpoint for your agent that can only communicate with Vijil Evaluate. This enables us to evaluate your agent without you needing to deploy it beforehand.
vijil.agents.evaluate(
agent_name="My Agent", # This is the name of your agent to use in the evaluation
evaluation_name="Evaluating my agent on Ethics", # The name of your evaluation
agent=local_agent, # The LocalAgentExecutor you created earlier
harnesses=["ethics_Small"], # The harnesses you wish to run
rate_limit=30, # Maximum number of requests in the interval to send to your agent
rate_limit_interval=1, # The size of the interval for the rate_limit (in minutes)
)
This method will automatically create the endpoint, register it and begin the evaluation. You will see live progress of your evaluation while it runs, and you can cancel it midway any time by pressing Ctrl + C
If you’re a power user, you can register your agent with Evaluate, trigger an evaluation via the registered url, and then shut down the server when you’re done. This method is not recommended for most users because it requires you to manage the server lifecycle.
from vijil.agents.constants import TERMINAL_STATUSES
import time
server, api_key_name = vijil.agents.register(
agent_name="my-agent",
evaluator=local_agent,
rate_limit=30,
rate_limit_interval=10,
)
evaluation = vijil.evaluations.create(
model_hub="custom",
model_name="local-agent",
name="Test local agent",
api_key_name=api_key_name,
model_url=f"{server.url}/v1",
harnesses=["trust_score"],
)
# Wait for a bit to let the evaluation start
time.sleep(5)
print(f"Evaluation {evaluation.get('id')} started.")
# Keep your server alive till the evaluation is done
while True:
status_data = vijil.evaluations.get_status(evaluation.get("id"))
status = status_data.get("status")
if status in TERMINAL_STATUSES:
print(f"Evaluation {evaluation.get('id')} finished with status: {status}")
break
time.sleep(5)
# Don't forget to shut down the server when you're done :)
vijil.agents.deregister(server, api_key_name)
Examples¶
Here are some examples of how you can use this feature to evaluate agents built using some popular frameworks
Langchain¶
The code snippet below showcases how you can evaluate an agent built using Langchain
# These first imports are just what your agent requires
import os
from langchain_openai import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage
# These are the imports you need to add for Evaluations
from vijil.agents.models import (
ChatCompletionRequest,
ChatCompletionResponse,
ChatCompletionChoice,
ChatMessage,
)
from vijil import Vijil
# Lets make a simple agent using langchain for this example
chat = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.7, streaming=False)
# This is the agent you want to evaluate
async def cool_langchain_agent(prompt: str) -> str:
messages = [
SystemMessage(
content="You are a cool assistant 😎. Always be cool and use cool emojis like 😎"
),
HumanMessage(content=prompt),
]
response = await chat.ainvoke(messages)
return response.content
# Now we define the input and output adapters for your agent to work with Vijil Evaluate
def example_input_adapter(request: ChatCompletionRequest):
# Extract whatever data you need from the request
# Here we just take the last message content as the prompt
return request.messages[-1]["content"]
def example_output_adapter(agent_output: str):
# First create a message object
# You can populate tool call and retrieval context if needed
agent_response_message = ChatMessage(
role="assistant", content=agent_output, tool_calls=None, retrieval_context=None
)
# next create a choice object to support multiple completions if needed
choice = ChatCompletionChoice(
index=0, message=agent_response_message, finish_reason="stop"
)
# Finally, return the response
return ChatCompletionResponse(
model="my-new-model",
choices=[choice],
usage=None, # You can track usage if needed
)
# That's all you need! Now just connect your agent and run the evaluation
# Before you begin, sign up for an ngrok account at https://ngrok.com/
# Set your ngrok auth token as NGROK_AUTHTOKEN in your environment variables
# see here to find your token: https://dashboard.ngrok.com/get-started/setup/python
# You can skip this step if you're a Premium user
if __name__ == "__main__":
if not os.getenv("NGROK_AUTHTOKEN"):
raise ValueError(
"Please set your ngrok auth token as NGROK_AUTHTOKEN in your environment variables."
)
# Step 1: Create a LocalAgentExecutor instance to run your agent locally
vijil = Vijil(
api_key=os.getenv("VIJIL_API_KEY"),
)
local_agent = vijil.agents.create(
agent_function=cool_langchain_agent,
input_adapter=example_input_adapter,
output_adapter=example_output_adapter,
)
# Step 2: Evaluate your agent! Lets see how ethical our Cool Agent is!
vijil.agents.evaluate(
agent_name="local-cool-agent",
evaluation_name="test local cool agent",
agent=local_agent,
harnesses=["ethics_Small"],
rate_limit=30,
rate_limit_interval=1,
)
Google Agent Development Kit (ADK)¶
Google ADK is a popular and flexible new framework to develop agents. Evaluating agents built using ADK is very straightforward and uses the same steps as before.
In this example, we’ll use the sample Travel Concierge agent from ADK’s sample agents. This is a multi-agent workflow and involves multiple agent-to-agent interactions, but we can test the entire workflow’s reliability, safety and security without needing to worry about the underlying components of the agent. We just need to create an instance of the LocalAgentExecutor
class that we can run an evaluation on.
To create the agent executor, we first create an ADK runner which lets us interact with the agent as a standalone function
from google.adk.sessions import InMemorySessionService
from google.adk.runners import Runner
# This is the agent we want to use for the demo
# You can find it at https://github.com/google/adk-samples/tree/main/agents/travel-concierge
from travel_concierge.agent import root_agent as travel_agent
from google.genai import types
import os
from dotenv import load_dotenv
# Load up your API keys
load_dotenv()
# We'll now create a local session for the agent to run in
session_service = InMemorySessionService()
# These variables are used to identify your service - they do not matter
APP_NAME = "Travel Concierge Agent"
USER_ID = "user_1"
SESSION_ID = "session_001"
# Create the specific session where the conversation will happen
session = session_service.create_session(
app_name=APP_NAME,
user_id=USER_ID,
session_id=SESSION_ID
)
print(f"Session created: App='{APP_NAME}', User='{USER_ID}', Session='{SESSION_ID}'")
# Now create your runner
runner = Runner(
agent=travel_agent, # The agent we want to run
app_name=APP_NAME, # Associates runs with our app
session_service=session_service # Uses our session manager
)
print(f"Runner created for agent '{runner.agent.name}'.")
# This function is what runs a single query string on the agent
async def call_agent_async(query: str, runner, user_id, session_id):
"""Sends a query to the agent and prints the final response."""
# Prepare the user's message in ADK format
content = types.Content(role='user', parts=[types.Part(text=query)])
final_response_text = "" # Default
# Key Concept: run_async executes the agent logic and yields Events.
# We iterate through events to find the final answer.
async for event in runner.run_async(user_id=user_id, session_id=session_id, new_message=content):
# is_final_response() marks the concluding message for a turn.
# Some agents, like this one, can have multiple final response messages, so we concatenate them in this example
if event.is_final_response():
if event.content and event.content.parts:
final_response_text += event.content.parts[0].text
elif event.actions and event.actions.escalate: # Handle potential errors/escalations
final_response_text = f"Agent escalated: {event.error_message or 'No specific message.'}"
return final_response_text or "The agent did not respond."
# Finally, we create our agent function
# This function is now a standalone function that can be used with Vijil Evaluate
async def run_agent(query : str):
return await call_agent_async(query, runner, USER_ID, SESSION_ID)
We can now create the LocalAgentExecutor
using the run_agent
function and run our evaluation
from vijil.agents.models import (
ChatCompletionRequest,
ChatCompletionResponse,
ChatCompletionChoice,
ChatMessage,
)
from vijil import Vijil
def input_adapter(request: ChatCompletionRequest):
# Since our agent doesn't support or expect system prompts, lets combine all the prompts into one
message_str = ""
for message in request.messages:
message_str += message.get("content", "")
return message_str
def output_adapter(agent_output: str):
# Our agent has a lot of tracing and extra data, but the only user facing message is the final response text
agent_response_message = ChatMessage(
role="assistant",
content=agent_output,
tool_calls=None,
retrieval_context=None
)
# next create a choice object to support multiple completions if needed
choice = ChatCompletionChoice(
index=0,
message=agent_response_message,
finish_reason="stop"
)
# Finally, return the response
return ChatCompletionResponse(
model="travel-concierge",
choices=[choice],
usage=None, # You can track usage if needed
)
vijil = Vijil(
api_key=os.getenv("VIJIL_API_KEY"),
)
local_agent = vijil.agents.create(
agent_function=run_agent,
input_adapter=input_adapter,
output_adapter=output_adapter,
)
# Evaluate your agent!
vijil.agents.evaluate(
agent_name="ADK Travel Concierge",
evaluation_name="ADK Travel Concierge Security Testing",
agent=local_agent,
harnesses=["security_Small"],
rate_limit=30,
rate_limit_interval=1,
)