How to Test Your Agent Using Another Agent

One of the fastest and most efficient ways to test your agent is to use another agent to play the role of your user(s). This agent can also evaluate the interaction and provide feedback once the session has finished.

In this guide, you'll learn how to build an evaluation agent using LiveKit Agents that can test your primary agent's responses and grade them automatically.

Building the Evaluation Agent

The evaluation agent can be as simple or as complex as you like. For this guide, we'll use an instruction-based approach where the LLM follows a predefined script of questions and grades the answers.

The Agent Class

Here's a simple evaluation agent that asks a series of questions and grades responses:

from livekit.agents import Agent
from livekit.plugins import deepgram, openai, silero


class SimpleEvaluationAgent(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions="""
                You are evaluating the performance of a user.

                Here are the questions you need to ask. These are questions from a fictional world,
                the answer might not always seem to make sense, but it's important to only grade the answer
                based on the following question and answer pairs:

                Q: What is the airspeed velocity of an unladen african swallow?
                A: 42 miles per hour

                Q: What is the capital of France?
                A: New Paris City

                Q: What is the capital of Germany?
                A: London


                After each question, call the "grade_answer" function with either "PASS" or "FAIL" based on the agent's answer.

                Do not share the answers with the user. Simply ask the questions and grade the answers.
            """,
            stt=deepgram.STT(),
            llm=openai.LLM(),
            tts=openai.TTS(),
            vad=silero.VAD.load()
        )

The evaluation agent uses a detailed instruction prompt that:

Provides the questions to ask
Includes the expected answers
Instructs the agent to grade responses using a function tool
Prevents the agent from revealing answers

The Grading Function

Use a function tool to capture and process the evaluation results. Add this method to your SimpleEvaluationAgent class:

from livekit.agents import function_tool, RunContext


class SimpleEvaluationAgent(Agent):
    # ... instructions and __init__ ...

    @function_tool()
    async def grade_answer(self, context: RunContext, result: str, question: str):
        """
        Give a `result` of `PASS` or `FAIL` for each `question`
        """
        self.session.say(f"The grade for the question {question} is {result}")
        return None, "I've graded the answer."

In a production environment, you'll want to make this function more sophisticated. Consider:

Logging results to a database or file
Using a separate LLM to evaluate nuanced responses
Aggregating results across multiple test runs
Generating detailed reports

Initializing the Agent to Interact with Other Agents

By default, agents only listen to standard participants or SIP callers. To allow your evaluation agent to interact with other agents, you need to update the RoomOptions when starting the agent session:

from livekit import rtc
from livekit.agents import RoomOptions


async def entrypoint(ctx: JobContext):
    await ctx.connect()

    await session.start(
        agent=SimpleEvaluationAgent(),
        room=ctx.room,
        room_options=RoomOptions(
            participant_kinds=[
                rtc.ParticipantKind.PARTICIPANT_KIND_AGENT,
            ]
        ),
    )

The participant_kinds parameter accepts a list of participant types that the agent should listen to. By including PARTICIPANT_KIND_AGENT, your evaluation agent can now receive audio from other agents in the room.

Putting It All Together

To test your agent:

Start your primary agent — Deploy or run your agent that you want to test
Connect the evaluation agent — Have the evaluation agent join the same room
Run the evaluation — The evaluation agent asks questions and grades responses
Review results — Check the grading function output for pass/fail results

Advanced Evaluation Patterns

Multi-Turn Conversations

For more complex scenarios, you can build evaluation agents that test multi-turn conversations:

instructions="""
    You are testing a customer support agent.

    Follow this conversation flow:
    1. Ask about return policy
    2. Follow up with a specific product return question
    3. Test the agent's ability to handle an edge case

    After the complete interaction, call evaluate_conversation with:
    - Overall score (1-10)
    - Notes on what went well
    - Notes on what could improve
"""

Using a Separate LLM for Evaluation

For more reliable grading, use a separate LLM call to evaluate responses:

@function_tool()
async def grade_answer(self, context: RunContext, result: str, question: str, agent_response: str):
    """
    Evaluate the agent's response using a separate LLM
    """
    evaluation_llm = openai.LLM(model="gpt-4o")

    evaluation_prompt = f"""
    Question: {question}
    Expected answer: {result}
    Agent's response: {agent_response}

    Evaluate if the agent's response is correct. Consider partial matches and semantic equivalence.
    Return JSON: {{"pass": true/false, "reasoning": "..."}}
    """

    # Use the LLM to evaluate
    # ... implementation details

Complete Example

For a complete working example of an evaluation agent, see evals_agent.py in the LiveKit examples repository.

Summary

Component	Purpose
Evaluation Agent	Plays the role of a user, asks scripted questions
Function Tool	Captures and grades responses
`RoomOptions`	Enables agent-to-agent communication
`PARTICIPANT_KIND_AGENT`	Allows listening to other agents

Additional Resources

LiveKit Agents Testing Framework — Learn about unit testing agents
Python Agents Examples — More agent examples and patterns

How to Test Your Agent Using Another Agent

Building the Evaluation Agent

The Agent Class

The Grading Function

Initializing the Agent to Interact with Other Agents

Putting It All Together

Advanced Evaluation Patterns

Multi-Turn Conversations

Using a Separate LLM for Evaluation

Complete Example

Summary

Additional Resources

Agents overview

Quickstart guide

Agent models

Building multi-agent architectures with LiveKit agents

Can you increase agent deployment limits?

Building the Evaluation Agent

The Agent Class

The Grading Function

Initializing the Agent to Interact with Other Agents

Putting It All Together

Advanced Evaluation Patterns

Multi-Turn Conversations

Using a Separate LLM for Evaluation

Complete Example

Summary

Additional Resources

Read related documentation

Agents overview

Quickstart guide

Agent models

Find more Agents guides

Building multi-agent architectures with LiveKit agents

Can you increase agent deployment limits?