PytestAgentsSDK
This project provides a sample test harness for evaluating Copilot Studio agents using Pytest and DeepEval. It uses the Microsoft 365 Agents SDK to communicate with Copilot Studio and focuses on semantic evaluation of agent responses using DeepEval’s GEval metric.
Features
- Multi-turn conversation testing against a Copilot Studio agent
- Semantic response evaluation using DeepEval’s
GEvalmetric - Loads test cases from a CSV file
- Custom HTML reporting with detailed metadata (user input, actual and expected output, score, reason)
- Authentication via MSAL, supporting “Authenticate with Microsoft” in Copilot Studio
- Easily extensible for use with additional metrics and long-term result tracking using DeepEval and Pytest plugins
Setup
1. Clone the repository
git clone https://github.com/microsoft/CopilotStudioSamples.git
cd CopilotStudioSamples/FunctionalTesting/PytestAgentsSDK
2. Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows use `.venv\Scripts\activate`
3. Install required dependencies
pip install -r requirements.txt
4. Create an app registration
You will need to register an application in Azure for the SDK to authenticate with Copilot Studio:
- Create a single-tenant app registration in Azure
- Under Authentication → Platform configurations, click Add a platform, and select Mobile and desktop applications
- Add these redirect URIs:
msal40347a26-35bb-48f3-bdc4-7f4f209aecb1://auth(MSAL only)http://localhost
- Under API permissions, click Add a permission
- Choose APIs my organization uses, then search for Power Platform API
- Choose Delegated permissions, then add
CopilotStudio.Copilots.Invoke
Note: If the Power Platform API doesn’t appear, visibility can be stale — run the refresh script in the Microsoft docs.
5. Authentication and Agent details
Create a .env file (you can copy from .env.template) and populate it with your MSAL and Copilot Studio agent configuration:
APP_CLIENT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
TENANT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
ENVIRONMENT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
AGENT_IDENTIFIER=cr26e_dMyAgent # This is the schema name, found under Settings > Advanced > Metadata
6. Configure Azure OpenAI or OpenAI details
You can use either OpenAI or Azure OpenAI with DeepEval.
To configure Azure OpenAI using the DeepEval CLI:
deepeval set-azure-openai \
--openai-endpoint=<endpoint> \ # e.g. https://example-resource.openai.azure.com/
--openai-api-key=<api_key> \
--openai-model-name=<model_name> \ # e.g. gpt-4o
--deployment-name=<deployment_name> \ # e.g. Test Deployment
--openai-api-version=<openai_api_version> # e.g. 2025-01-01-preview
These values will be stored in a local
.deepevalconfiguration file.
Alternatively, if you’re using OpenAI (not Azure), set the following environment variable:
export OPENAI_API_KEY=<your-openai-key>
7. Publish and set agent authentication
Before running tests, ensure that your Copilot Studio agent is:
- Published in the Copilot Studio portal
- Configured to use Authenticate with Microsoft under Settings > Security > Authentication
8. Prepare Test Cases (CSV Input)
Before running the tests, populate the CSV file at input/test_cases.csv with your test cases.
The CSV file must contain two columns:
input_text: The message sent to the Copilot Studio agentexpected_output: The ideal response you’d expect from the agent
Example:
input_text,expected_output
What is the capital of France?,The capital of France is Paris, which is known for its historical landmarks like the Eiffel Tower and the Louvre Museum.
Who wrote 'Hamlet'?,William Shakespeare wrote the play 'Hamlet', which is considered one of the greatest works of English literature.
What is the chemical symbol for water?,H3O is the correct chemical symbol for water.
Running the Tests
From the PytestAgentsSDK directory, run:
pytest tests/multi_turn_eval_openai.py --html=reports/multi_turn_eval_openai.html --self-contained-html
This will:
- Start a conversation with your Copilot Studio agent
- Send test questions and capture responses
- Evaluate the responses using DeepEval
- Generate a self-contained HTML report in the
reports/folder
Output
The HTML report includes:
- Pass/Fail status based on semantic threshold
- User message and expected answer
- Actual response from the agent
- DeepEval score
- Explanation for the result
- Conversation ID (for debugging)