How to generate test data in cloud based on documents#

This guide will help you learn how to generate test data on Azure AI, so that you can integrate the created flow and process a large amount of data.

Prerequisites#

  1. Go through local test data generation guide and prepare your test data generation flow.

  2. Go to the example_gen_test_data folder and run command pip install -r requirements_cloud.txt to prepare local environment.

  3. Prepare cloud environment.

    • Navigate to file conda.yml.

    • For specific document file types, you may need to install extra packages:

      • .docx - pip install docx2txt

      • .pdf - pip install pypdf

      • .ipynb - pip install nbconvert

      !Note: We use llama index SimpleDirectoryReader to load documents. For the latest information on required packages, please check here.

  4. Prepare Azure AI resources in cloud.

  5. Create cloud AzureOpenAI or OpenAI connection

  6. Prepare test data generation setting.

Generate test data at cloud#

For handling larger test data, you can leverage the PRS component to run flow in cloud.

  • Navigate to example_gen_test_data folder.

  • After configuration, run the following command to generate the test data set:

    python -m generate-test-data.run --cloud
    
  • The generated test data will be a data asset which can be found in the output of the last node. You can register this data asset for future use.