How to Test Your Agent Using Another Agent
Learn how to create an evaluation agent that plays the role of your user, tests your agent, and provides feedback on the interaction.
Last Updated:
One of the fastest and most efficient ways to test your agent is to use another agent to play the role of your user(s). This agent can also evaluate the interaction and provide feedback once the session has finished.
In this guide, you'll learn how to build an evaluation agent using LiveKit Agents that can test your primary agent's responses and grade them automatically.
Building the Evaluation Agent
The evaluation agent can be as simple or as complex as you like. For this guide, we'll use an instruction-based approach where the LLM follows a predefined script of questions and grades the answers.
The Agent Class
Here's a simple evaluation agent that asks a series of questions and grades responses:
from livekit.agents import Agentfrom livekit.plugins import deepgram, openai, silero
class SimpleEvaluationAgent(Agent): def __init__(self) -> None: super().__init__( instructions=""" You are evaluating the performance of a user.
Here are the questions you need to ask. These are questions from a fictional world, the answer might not always seem to make sense, but it's important to only grade the answer based on the following question and answer pairs: Q: What is the airspeed velocity of an unladen african swallow? A: 42 miles per hour Q: What is the capital of France? A: New Paris City
Q: What is the capital of Germany? A: London After each question, call the "grade_answer" function with either "PASS" or "FAIL" based on the agent's answer.
Do not share the answers with the user. Simply ask the questions and grade the answers. """, stt=deepgram.STT(), llm=openai.LLM(), tts=openai.TTS(), vad=silero.VAD.load() )
The evaluation agent uses a detailed instruction prompt that:
- Provides the questions to ask
- Includes the expected answers
- Instructs the agent to grade responses using a function tool
- Prevents the agent from revealing answers
The Grading Function
Use a function tool to capture and process the evaluation results:
from livekit.agents import function_tool, RunContext
@function_tool()async def grade_answer(self, context: RunContext, result: str, question: str): """ Give a `result` of `PASS` or `FAIL` for each `question` """ self.session.say(f"The grade for the question {question} is {result}") return None, "I've graded the answer."
In a production environment, you'll want to make this function more sophisticated. Consider:
- Logging results to a database or file
- Using a separate LLM to evaluate nuanced responses
- Aggregating results across multiple test runs
- Generating detailed reports
Initializing the Agent to Interact with Other Agents
By default, agents only listen to standard participants or SIP callers. To allow your evaluation agent to interact with other agents, you need to update the RoomOptions when starting the agent session:
from livekit import rtcfrom livekit.agents import RoomOptions
async def entrypoint(ctx: JobContext): await ctx.connect()
await session.start( agent=SimpleEvaluationAgent(), room=ctx.room, room_options=RoomOptions( participant_kinds=[ rtc.ParticipantKind.PARTICIPANT_KIND_AGENT, ] ), )
The participant_kinds parameter accepts a list of participant types that the agent should listen to. By including PARTICIPANT_KIND_AGENT, your evaluation agent can now receive audio from other agents in the room.
Putting It All Together
To test your agent:
- Start your primary agent — Deploy or run your agent that you want to test
- Connect the evaluation agent — Have the evaluation agent join the same room
- Run the evaluation — The evaluation agent asks questions and grades responses
- Review results — Check the grading function output for pass/fail results
Advanced Evaluation Patterns
Multi-Turn Conversations
For more complex scenarios, you can build evaluation agents that test multi-turn conversations:
instructions=""" You are testing a customer support agent. Follow this conversation flow: 1. Ask about return policy 2. Follow up with a specific product return question 3. Test the agent's ability to handle an edge case After the complete interaction, call evaluate_conversation with: - Overall score (1-10) - Notes on what went well - Notes on what could improve"""
Using a Separate LLM for Evaluation
For more reliable grading, use a separate LLM call to evaluate responses:
@function_tool()async def grade_answer(self, context: RunContext, result: str, question: str, agent_response: str): """ Evaluate the agent's response using a separate LLM """ evaluation_llm = openai.LLM(model="gpt-4o") evaluation_prompt = f""" Question: {question} Expected answer: {result} Agent's response: {agent_response} Evaluate if the agent's response is correct. Consider partial matches and semantic equivalence. Return JSON: {{"pass": true/false, "reasoning": "..."}} """ # Use the LLM to evaluate # ... implementation details
Complete Example
For a complete working example of an evaluation agent, see evals_agent.py in the LiveKit examples repository.
Summary
| Component | Purpose |
|---|---|
| Evaluation Agent | Plays the role of a user, asks scripted questions |
| Function Tool | Captures and grades responses |
RoomOptions | Enables agent-to-agent communication |
PARTICIPANT_KIND_AGENT | Allows listening to other agents |
Additional Resources
- LiveKit Agents Testing Framework — Learn about unit testing agents
- Python Agents Examples — More agent examples and patterns