How to generate test data in cloud based on documents#
This guide will help you learn how to generate test data on Azure AI, so that you can integrate the created flow and process a large amount of data.
Prerequisites#
Go through local test data generation guide and prepare your test data generation flow.
Go to the example_gen_test_data folder and run command
pip install -r requirements_cloud.txt
to prepare local environment.Prepare cloud environment.
Navigate to file conda.yml.
For specific document file types, you may need to install extra packages:
.docx -
pip install docx2txt
.pdf -
pip install pypdf
.ipynb -
pip install nbconvert
!Note: We use llama index
SimpleDirectoryReader
to load documents. For the latest information on required packages, please check here.
Prepare Azure AI resources in cloud.
An Azure AI ML workspace - Create workspace resources you need to get started with Azure AI.
A compute target - Learn more about compute cluster.
Prepare test data generation setting.
Navigate to example_gen_test_data folder.
Prepare
config.yml
by copyingconfig.yml.example
.Fill in configurations in the
config.yml
by following inline comment instructions.
Generate test data at cloud#
For handling larger test data, you can leverage the PRS component to run flow in cloud.
Navigate to example_gen_test_data folder.
After configuration, run the following command to generate the test data set:
python -m generate-test-data.run --cloud
The generated test data will be a data asset which can be found in the output of the last node. You can register this data asset for future use.