Post

Bulk File-Based Testing for Copilot Studio: Beyond Standard Evals

How to build a scalable bulk file-based testing framework for Copilot Studio agents using Dataverse, SharePoint, Power Automate, and Power BI when standard evals aren't enough.

Bulk File-Based Testing for Copilot Studio: Beyond Standard Evals

We deployed an invoice-processing agent that scored beautifully on our Copilot Studio eval suite. voluminous real vendor invoices arrived: different templates, different scan qualities, different languages. a good percentage of them failed silently. The eval metrics hadn’t changed, but the agent wasn’t production-ready.

If you’ve built agents that process enterprise documents (PDFs, contracts, spreadsheets, forms) you’ve likely hit the same wall. Standard evals measure prompt-response quality on curated samples. They don’t tell you whether your agent can reliably process thousands of real-world files with the consistency your business requires.

This post walks through a practical architecture for bulk file-based testing using components you probably already have: Dataverse, SharePoint, Power Automate, Copilot Studio, and Power BI.

This approach complements standard evals, it doesn’t replace them. Use evals for prompt quality and response behavior. Use bulk testing for file processing and structured output validation.

Where Standard Evals Fall Short

Copilot Studio evaluations excel at prompt-response testing, regression testing, quality scoring, and multi-turn conversation simulation. If you’re already using them in your CI/CD pipeline (see Quality Gates for Copilot Studio), that’s great, keep doing it.

But when your agent processes enterprise documents as part of operational workflows, you need answers to different questions:

  • Can it handle thousands of invoices from different vendors with different layouts?
  • Does it consistently extract the correct fields across scan qualities and languages?
  • Which specific failure patterns emerge at scale (missing IDs, incorrect totals, format mismatches)?
  • How does accuracy change across prompt versions or model updates?

A robust bulk-testing approach needs to:

  • Execute large volumes of file-based test cases against a defined test matrix
  • Compare every extracted field against gold-standard expected results
  • Support scenario-specific prompts and business rules
  • Track regressions over time with full traceability

The Architecture

A practical solution uses components you probably already have in your tenant:

Component Role
Dataverse Control plane — test definitions, execution history, pass/fail outcomes
SharePoint Configuration layer — input files, gold-standard outputs, prompt assets
Power Automate Orchestration — retrieves tests, invokes agent, compares results
Copilot Studio Execution — processes files and returns structured output
Power BI Reporting — pass rates, error patterns, regression trends

Overall Architecture End-to-end bulk file-based testing architecture showing how test definitions, orchestration, copilot execution, and result comparison work together.

Dataverse as the Control Plane

Each test case is a Dataverse row containing the scenario definition, input file reference, expected output, prompt version, and metadata (status, category, ownership). Execution history captures timestamps, output locations, comparison results, and pass/fail outcomes.

This gives you version-over-version comparison, regression detection, and end-to-end traceability from input to outcome, similar to what you’d get with the Dataverse retrieval patterns post but applied to test management.

Dataverse control plane Dataverse control-plane view: structured test cases, execution history, metadata, and pass/fail outcomes in a centralized system of record.

SharePoint as the Configuration Layer

SharePoint It stores the file-based assets including test inputs, gold-standard expected outputs, prompt assets, and scenario-specific reference content. By separating artifacts from orchestration, prompts, datasets, and expected results can be updated independently, allowing non-developer and business users to easily manage changes through SharePoint, a platform they are already familiar and comfortable with.

SharePoint configuration layer SharePoint configuration layer: input files, expected outputs, and prompt assets organized for reusable bulk test execution.

Orchestration with Power Automate

This is where it all comes together. Power Automate connects Dataverse, SharePoint, and Copilot into a closed-loop testing system:

  1. Read the next eligible test case from Dataverse
  2. Retrieve the input file, expected output, and prompt from SharePoint
  3. Invoke the Copilot agent with the assembled payload
  4. Compare the response against the gold-standard output (field-level)
  5. Write the execution record (pass/fail, mismatches, timestamps) back to Dataverse

Power Automate orchestration flow Power Automate orchestration flow: test retrieval → file loading → copilot execution → result comparison → outcome recording.

At scale, the flow supports batching, parallel execution with controlled concurrency (to stay within service limits), retry logic, and exception handling. Because everything is configuration-driven, you can rerun scenarios across different prompt versions or model variants and compare results over time.

If you’re already using Power Automate to orchestrate agents, the patterns from Combining Workflows and Agents apply here too, especially error handling and concurrency management.

Power BI Reporting Layer

Power BI turns execution data into decision-ready dashboards: pass/fail rates, error concentration by vendor template, mismatch trends across prompt versions, and regression detection over time.

Automation Report Automation report with detailed run statistics.

What This Enables

With this architecture in place, your team gains:

  • Scale — validation across thousands of real-world file scenarios
  • Precision — deterministic field-level comparison against expected results
  • Traceability — every run tied to a scenario, version, and outcome history
  • Regression detection — automatic identification of quality drops across versions
  • Operational visibility — Power BI dashboards for stakeholder confidence

Putting It All Together

Standard evals and bulk file testing aren’t competing approaches. They’re complementary layers:

Layer Validates Tool
Prompt quality Response accuracy, safety, relevance Copilot Studio Evals
Integration quality CI/CD gates, regression blocking Eval API in pipelines
File processing quality Bulk document handling at scale This architecture

A downloadable sample solution implementing this architecture is coming soon. It will provide a working starting point for building your own enterprise-grade bulk file validation framework.


Have you hit similar challenges testing document-heavy agents? What approaches have worked (or failed spectacularly) in your environment? Let us know in the comments.

This post is licensed under CC BY 4.0 by the author.