hi-ml

First steps: How to run your Python code in the cloud

The simplest use case for the hi-ml toolbox is taking a script that you developed, and run it inside of Azure Machine Learning (AML) services. This can be helpful because the cloud gives you access to massive GPU resource, you can consume vast datasets, and access multiple machines at the same time for distributed training.

Setting up AzureML

To run your code in the cloud, you need to have an AzureML workspace in your Azure subscription. Please follow the instructions here to create an AzureML workspace if you don’t have one yet.

Download the config file from your AzureML workspace, as described here. Put this file (it should be called config.json) into the folder where your script lives, or one of its parent folders. You can use parent folders up to the last parent that is still included in the PYTHONPATH environment variable: hi-ml will try to be smart and search through all folders that it thinks belong to your current project.

Using the AzureML integration layer

Consider a simple use case, where you have a Python script that does something - this could be training a model, or pre-processing some data. The hi-ml package can help easily run that on Azure Machine Learning (AML) services.

Here is an example script that reads images from a folder, resizes and saves them to an output folder:

from pathlib import Path
if __name__ == '__main__':
    input_folder = Path("/tmp/my_dataset")
    output_folder = Path("/tmp/my_output")
    for file in input_folder.glob("*.jpg"):
        contents = read_image(file)
        resized = contents.resize(0.5)
        write_image(output_folder / file.name)

Doing that at scale can take a long time. We’d like to run that script in AzureML, consume the data from a folder in blob storage, and write the results back to blob storage, so that we can later use it as an input for model training.

You can achieve that by adding a call to submit_to_azure_if_needed from the hi-ml package:

from pathlib import Path
from health_azure import submit_to_azure_if_needed
if __name__ == '__main__':
    current_file = Path(__file__)
    run_info = submit_to_azure_if_needed(compute_cluster_name="preprocess-ds12",
                                         input_datasets=["images123"],
                                         # Omit this line if you don't create an output dataset (for example, in
                                         # model training scripts)
                                         output_datasets=["images123_resized"],
                                         default_datastore="my_datastore")
    # When running in AzureML, run_info.input_datasets and run_info.output_datasets will be populated,
    # and point to the data coming from blob storage. For runs outside AML, the paths will be None.
    # Replace the None with a meaningful path, so that we can still run the script easily outside AML.
    input_dataset = run_info.input_datasets[0] or Path("/tmp/my_dataset")
    output_dataset = run_info.output_datasets[0] or Path("/tmp/my_output")
    files_processed = []
    for file in input_dataset.glob("*.jpg"):
        contents = read_image(file)
        resized = contents.resize(0.5)
        write_image(output_dataset / file.name)
        files_processed.append(file.name)
    # Any other files that you would not consider an "output dataset", like metrics, etc, should be written to
    # a folder "./outputs". Any files written into that folder will later be visible in the AzureML UI.
    # run_info.output_folder already points to the correct folder.
    stats_file = run_info.output_folder / "processed_files.txt"
    stats_file.write_text("\n".join(files_processed))

Once these changes are in place, you can submit the script to AzureML by supplying an additional --azureml flag on the commandline, like python myscript.py --azureml.

Note that you do not need to modify the argument parser of your script to recognize the --azureml flag.

Essential arguments to submit_to_azure_if_needed

When calling submit_to_azure_if_needed, you can to supply the following parameters:

You can also supply an input dataset. For data pre-processing scripts, you can add an output dataset (omit this for ML training scripts).

For more examples, please see examples.md. For more details about datasets, see here.

Additional arguments you should know about

submit_to_azure_if_needed has a large number of arguments, please check the API documentation for an exhaustive list. The particularly helpful ones are listed below.

Conda environments, Alternate pips, Private wheels

The function submit_to_azure_if_needed tries to locate a Conda environment file in the current folder, or in the Python path, with the name environment.yml. The actual Conda environment file to use can be specified directly with:

    run_info = submit_to_azure_if_needed(
        ...
        conda_environment_file=conda_environment_file,

where conda_environment_file is a pathlib.Path or a string identifying the Conda environment file to use.

The basic use of Conda assumes that packages listed are published Conda packages or published Python packages on PyPI. However, during development, the Python package may be on Test.PyPI, or in some other location, in which case the alternative package location can be specified directly with:

    run_info = submit_to_azure_if_needed(
        ...
        pip_extra_index_url="https://test.pypi.org/simple/",

Finally, it is possible to use a private wheel, if the package is only available locally with:

    run_info = submit_to_azure_if_needed(
        ...
        private_pip_wheel_path=private_pip_wheel_path,

where private_pip_wheel_path is a pathlib.Path or a string identifying the wheel package to use. In this case, this wheel will be copied to the AzureML environment as a private wheel.