Multi-node Search#

This notebook and accompanying code shows how to run an Archai Neural Architecture Search (NAS) using an Azure Machine Learning Workspace with partial training of models (on a GPU cluster) providing validation accuracies to guide that search. This notebook requires that you have the Azure CLI installed and logged in otherwise the create compute cluster cell will fail. You will also need to create an Azure ML workspace using [https://ml.azure.com] and an Azure storage account. The storage account does not need to be in the same resource group as the workspace.

This notebook also assumes you have a python environment setup using pip install -e .[aml] in your Archai repository root. This example requires a config.json file containing the information about your Azure subscription, the Azure ML workspace name and resource group, and the azure storage account key and name:

{
    "subscription_id": "...",
    "resource_group": "...",
    "workspace_name": "...",
    "storage_account_key": "...",
    "storage_account_name": "..."
}

See: - Set up a Python development environment for Azure Machine Learning - Get your Storage Account keys.

[2]:

import os
import json
import archai.common.azureml_helper as aml_helper
import archai.common.notebook_helper as nb_helper
from azure.ai.ml.entities import UserIdentityConfiguration

# locate the code that we will use in this notebook.
scripts_dir = "./scripts"
data_scripts_dir = "./data_prep"

config_file = "../.azureml/config.json"
config = json.load(open(config_file, 'r'))

for required_key in ['subscription_id', 'resource_group', 'workspace_name', 'storage_account_key', 'storage_account_name']:
    if not required_key in config:
        print(f"### Error: please add a {required_key} to {config_file}")

storage_account_key = config['storage_account_key']
storage_account_name = config['storage_account_name']
workspace_name = config['workspace_name']

print(f'Using workspace {workspace_name} and storage account: {storage_account_name}')

# This is the AML experiment name
experiment_name = 'mnist_test_run'

Using workspace archai-aml-test and storage account: archaimnistmodels

Our search.py will use a ConfigSearchSpace based on the following parameters, the model defined in model.py will take various configurations and build different shape CNN models for each configuration as shown below. Each time you execute this cell it will generate a new random CNN model.

[4]:

from archai.discrete_search.search_spaces.config import ArchParamTree, DiscreteChoice, ArchConfig
from scripts.model import MyModel

arch_param_tree = ArchParamTree({
    'nb_layers': DiscreteChoice(list(range(1, 13))),
    'kernel_size': DiscreteChoice([1, 3, 5, 7]),
    'hidden_dim': DiscreteChoice([16, 32, 64, 128])
})

arch_config = arch_param_tree.sample_config()
print(arch_config)

MyModel(arch_config)

ArchConfig({
    "nb_layers": 2,
    "kernel_size": 7,
    "hidden_dim": 16
})

[4]:

MyModel(
  (cross_entropy_loss): CrossEntropyLoss()
  (model): Sequential(
    (0): Conv2d(1, 16, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
    (1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): Conv2d(16, 16, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
    (4): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU()
    (6): AdaptiveAvgPool2d(output_size=(1, 1))
    (7): Conv2d(16, 10, kernel_size=(1, 1), stride=(1, 1))
  )
)

[5]:

ml_client = aml_helper.get_aml_client_from_file(config_path=config_file)
print(f'Using workspace "{ml_client.workspace_name}" in resource group "{ml_client.resource_group_name}"')

Found the config file in: ..\.azureml\config.json

Using workspace "archai-aml-test" in resource group "snpe-aml-rg"

Create the compute clusters that we need. If this cell fails with DefaultAzureCredential failed to retrieve a token from the included credentials. then you might need to run az login from the command line using the Azure CLI. If it also fails with ResourceNotFoundError: (ParentResourceNotFound) then you may need to run az account set --subscription ... with the subscription id you specified in the above config.json file and check that the resource group you specified really does contain the Azure ML workspace you specified.

[6]:

# Create cpu cluster for running the search
cpu_compute_name = "nas-cpu-cluster-D14-v2"
aml_helper.create_compute_cluster(ml_client, cpu_compute_name, size="Standard_D14_v2", location="westus2")

# Create gpu cluster for running the search
gpu_compute_name = "nas-gpu-cluster-NC6"
aml_helper.create_compute_cluster(ml_client, gpu_compute_name, size="Standard_NC6", location="westus2", max_instances=8)

You already have a cluster named nas-cpu-cluster-D14-v2, we'll reuse it as is.
You already have a cluster named nas-gpu-cluster-NC6, we'll reuse it as is.

[6]:

AmlCompute({'type': 'amlcompute', 'created_on': None, 'provisioning_state': 'Succeeded', 'provisioning_errors': None, 'name': 'nas-gpu-cluster-NC6', 'description': None, 'tags': None, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/c8b7f913-60fb-4759-a310-fc5630e56f99/resourceGroups/snpe-aml-rg/providers/Microsoft.MachineLearningServices/workspaces/archai-aml-test/computes/nas-gpu-cluster-NC6', 'Resource__source_path': None, 'base_path': 'd:\\git\\microsoft\\archai\\archai\\docs\\advanced_guide\\cloud\\azure\\notebooks\\multi_node_search', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x000001E4CB5ACFD0>, 'resource_id': None, 'location': 'westus2', 'size': 'STANDARD_NC6', 'min_instances': 0, 'max_instances': 8, 'idle_time_before_scale_down': 180.0, 'identity': None, 'ssh_public_access_enabled': True, 'ssh_settings': None, 'network_settings': <azure.ai.ml.entities._compute.compute.NetworkSettings object at 0x000001E4CB4A7610>, 'tier': 'dedicated', 'enable_node_public_ip': True, 'subnet': None})

Create the AML Environment from our conda.yaml file. This ensures our conda environment contains the Archai framework that we are using here.

[7]:

archai_job_env = aml_helper.create_environment_from_file(ml_client,
                                                         image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest",
                                                         conda_file="conda.yaml",
                                                         version='1.0.0')
environment_name = f"{archai_job_env.name}:{archai_job_env.version}"
print(environment_name)

Environment with name aml-archai is registered to workspace, the environment version is 1.0.0
aml-archai:1.0.0

Ensure our storage account is setup with a models blob store container for storing the final onnx models, a datasets blob store for our training dataset and a status storage table.

[8]:

from archai.common.store import ArchaiStore

# Register the datastore with AML
data_store_name = "datasets"
data_container_name = "datasets"
model_store_name = "models"
model_container_name = "models"
root_folder = experiment_name

# make sure the datasets container exists
store = ArchaiStore(storage_account_name, storage_account_key, blob_container_name=data_container_name)
store.upload_blob(root_folder, config_file)

# make sure the models container exists
store = ArchaiStore(storage_account_name, storage_account_key, blob_container_name=model_container_name)
store.upload_blob("config", config_file)

datastore_path = f'azureml://datastores/{data_store_name}/paths/{root_folder}'
results_path = f'azureml://datastores/{model_store_name}/paths/{root_folder}'

[9]:

from azure.ai.ml.entities import AzureBlobDatastore
from azure.ai.ml.entities._credentials import AccountKeyConfiguration

try:
    model_store = ml_client.datastores.get(model_store_name)
except:
    model_store = AzureBlobDatastore(
        name=model_store_name,
        description="Datastore pointing to our models blob container.",
        account_name=storage_account_name,
        container_name=model_container_name,
        credentials=AccountKeyConfiguration(
            account_key=storage_account_key
        ),
    )

    ml_client.create_or_update(model_store)

[10]:

try:
    data_store = ml_client.datastores.get(data_store_name)
except:
    data_store = AzureBlobDatastore(
        name=data_store_name,
        description="Datastore pointing to our dataset container.",
        account_name=storage_account_name,
        container_name=data_container_name,
        credentials=AccountKeyConfiguration(
            account_key=storage_account_key
        ),
    )

    ml_client.create_or_update(data_store)

Create a pipeline command that prepares our training dataset using prep_data_store.py. This pipeline will write the output do our datasets blob store so the training jobs can find the dataset there all ready to go. That way each training job doesn’t have to repeat the download and preparation of the dataset.

[11]:

from azure.ai.ml import command
from azure.ai.ml import Input, Output

data_prep_component = command(
    name="data_prep2",
    display_name="Data preparation for training",
    description="Downloads the remote dataset to our blob store.",
    inputs= {
        "name": Input(type='string')
    },
    outputs= {
        "data": Output(type="uri_folder", path=datastore_path, mode="rw_mount")
    },

    # The source folder of the component
    code=data_scripts_dir,
    command="""python3 prep_data_store.py \
            --path ${{outputs.data}} \
            """,
    environment=environment_name,
)

Create a command that kicks off the Archai Search using search.py, it will take the dataset as input so it can pass it along to the training jobs later on. It also produces some search output files, .png charts, and json results files which will also go into our models blob store under the folder named the same as our experiment_name.

[12]:

hex_config = bytes(json.dumps(config), encoding='utf-8').hex()

partial_epochs = 0.1
max_unseen_population = 16  # best if this is an even multiple our gpu cluster size, we'll get much better throughput.
search_iterations = 5  # for quick debugging.
init_num_models = 10

output_path = results_path + '/' + experiment_name
fixed_args = f'--data_dir {datastore_path} ' + \
             f'--output_dir {output_path} ' + \
             f'--environment "{environment_name}" ' + \
             f'--experiment "{experiment_name}" ' + \
             f'--compute "{gpu_compute_name}" ' + \
             f'--config "{hex_config}" ' + \
             f'--search_iterations {search_iterations} ' + \
             f'--init_num_models {init_num_models} ' + \
             f'--max_unseen_population {max_unseen_population} ' + \
             f'--partial_training_epochs {partial_epochs} '

search_component = command(
    name="search",
    display_name="The Archai NAS search",
    description="Runs the NAS search algorithm.",
    is_deterministic=False,
    inputs= {
        "data": Input(type="uri_folder")
    },
    outputs= {
        "results": Output(type="uri_folder", path=output_path, mode="rw_mount")
    },
    code=scripts_dir,
    identity= UserIdentityConfiguration(),
    command='python3 search.py --local_output ${{outputs.results}} ' + \
        fixed_args,
    environment=environment_name,
)

This creates a command that will do full training on the final list of the best models produced by the above search command. Also creates a monitor command that monitors all the parallel training jobs and gathers the results when they are all complete updating our final models.json file.

[13]:

from scripts.commands import make_training_pipeline_command, make_monitor_command
full_epochs = 10
timeout = 3600
hex_config = bytes(json.dumps(config), encoding='utf-8').hex()

full_training_component = make_training_pipeline_command(
    "Full Training Pipeline", hex_config, scripts_dir, gpu_compute_name,
    datastore_path, output_path, experiment_name, environment_name, full_epochs, save_models=True)

keys = ['val_acc']
monitor_component = make_monitor_command(hex_config, scripts_dir, results_path, environment_name, keys, timeout)

Create an AML pipeline that pipes the output of the data prep to the search component, then when search is finished, starts a full training job of the top models and then waits for all that training to finish.

[14]:

from azure.ai.ml import dsl

@dsl.pipeline(
    compute=cpu_compute_name,
    description="Archai search pipeline",
)
def archai_search_pipeline():

    data_prep_job = data_prep_component(
        name="MNIST"
    )

    search_job = search_component(
        data=data_prep_job.outputs.data
    )

    training_job = full_training_component(
        models=search_job.outputs.results,
        data=data_prep_job.outputs.data
    )

    monitor_job = monitor_component(
        models=search_job.outputs.results,
        training_results=training_job.outputs.results
    )

    return {
        "results": monitor_job.outputs.results
    }

Submit the pipeline job so it starts running in your Azure ML workspace.

[13]:

pipeline_job = ml_client.jobs.create_or_update(
    archai_search_pipeline(),
    # Project's name
    experiment_name=experiment_name,
)

Open the pipeline azure ML studio portal in your web browser (this works when you are running this notebook in VS code).

[14]:

import webbrowser
webbrowser.open(pipeline_job.services["Studio"].endpoint)

job_name = pipeline_job.name
print(f'Started pipeline: {job_name}')

Started pipeline: amusing_monkey_2248p6qm47

[15]:

# you can fetch any pipeline job again if you needed to continue this notebook later:
from azure.ai.ml import dsl, Input, Output
# job_name = 'amusing_monkey_2248p6qm47'
pipeline_job = ml_client.jobs.get(job_name)

Plots#

This cell can be run multiple times and you will see updates as each iteration finishes. You can even run this later after restarting the jupyter notebook because it is not dependent on variable state it is only dependent on the persistent ‘models’ blob store.

[ ]:

from scripts.utils import get_results, show_results, download_best_models
from archai.common.store import ArchaiStore
store = ArchaiStore(storage_account_name, storage_account_key, blob_container_name=model_container_name)

print(f'Fetching results for {experiment_name}...')
blob_path = root_folder + '/' + experiment_name
output_folder = experiment_name

get_results(store, blob_path, output_folder)
download_best_models(store, experiment_name, output_folder)

Take a look at the pareto curve plots.

[17]:

# Now show the png plots produced by the search
show_results(output_folder)

../../../../../_images/advanced_guide_cloud_azure_notebooks_multi_node_search_multi_node_search_31_0.png

../../../../../_images/advanced_guide_cloud_azure_notebooks_multi_node_search_multi_node_search_31_1.png

../../../../../_images/advanced_guide_cloud_azure_notebooks_multi_node_search_multi_node_search_31_2.png

[31]:

from IPython.display import display
from IPython.core.display import HTML

df = nb_helper.get_search_csv(output_folder)
df.drop(columns=['parent', 'parents'], inplace=True)
df = df[df["AmlTrainingValAccuracy"] > 0.97]
df[df["is_pareto"] == True]

[31]:

	archid	AmlTrainingValAccuracy	ONNX Latency (ms)	FLOPs	iteration_num	search_walltime_hours	is_pareto
0	00499976e367af150716d050c76b2ce013f77d0b	0.9891	0.000970	465182986.0	0	0.130325	True
1	195dd9f8e55b4441ecc584ef98fd897c6b5fee92	0.9874	0.000672	396466314.0	0	0.130325	True
2	9d53f0479e4fb954fbd9d8105bba31c3849dcaaf	0.9781	0.000363	99888202.0	0	0.130325	True
9	96fb7db976ef30e2bd9c65e9657b7c7a61e7eee4	0.9782	0.000400	131537034.0	0	0.130325	True
11	cc3bfa8581c8aec6f6faba274e0dba8b66c7286c	0.9783	0.000385	162345098.0	1	0.206381	True
12	c11b3549c49c1c58af2d6b7e2ed8e4f929ba4123	0.9824	0.000632	159045706.0	1	0.206381	True
22	821b0e0fcbdc144ae4c223c7c9291727e2413485	0.9808	0.000425	91458634.0	1	0.206381	True
30	8b0140dbdce869bd1f06def5e837c88c8944f6c7	0.9727	0.000266	87883914.0	2	0.299725	True
43	5d3ab570fb0955bf587eddb16cc2bc7501f835f2	0.9759	0.000336	61202506.0	3	0.410679	True
52	c834473070cd3911e1003b1231c709f9f4461377	0.9914	0.001342	967344394.0	3	0.410679	True
55	0580140451595a97cba5014fe430dce1f6924e2e	0.9844	0.000663	363550858.0	3	0.410679	True
71	33e4047803c5dbfe2cd0af9ca86529a31f1a34cc	0.9813	0.000423	160639114.0	4	0.476453	True

Test ONNX Runtime Inference on the Best Model#

When the search pipeline completes you should have a models.json file in the experiment_name output folder and you can use that to find the most accurate model and run it through the ONNX runtime to see if the ONNX inference gets the same accuracy.

[24]:

# find the top model in the json results.
filename = os.path.join(output_folder, "models.json")
best_of_the_best = None
top_accuracy = -1
row = None
if not os.path.isfile(filename):
    raise Exception(f"Could not find {filename} file. Please wait for job to finish.")

results = json.load(open(filename, "r"))
models = results['models']
for a in models:
    if type(a) is dict and 'val_acc' in a:
        val_acc = a['val_acc']
        if val_acc > top_accuracy:
            top_accuracy = val_acc
            best_of_the_best = a['id']
            row = a

model = MyModel(ArchConfig(row))

arch = f"nb_layers={model.nb_layers}, kernel_size={model.kernel_size}, hidden_dim={model.hidden_dim}"
print(f"The top model is {best_of_the_best} with accuracy {top_accuracy} and architecture {arch}")

blob_path = root_folder + '/' + best_of_the_best
model_output = os.path.join(output_folder, 'top_model')
get_results(store, blob_path, model_output)

model_path = os.path.join(model_output, 'model.onnx')

The top model is id_ea147000_e08a_4ff6_b3a9_e1e1a9f1a211 with accuracy 0.9907000064849854 and architecture nb_layers=7, kernel_size=5, hidden_dim=64

[26]:

# Great, now let's test if this model works as advertised.
from archai.datasets.cv.mnist_dataset_provider import MnistDatasetProvider
import onnxruntime as ort
import numpy as np

dataset_provider = MnistDatasetProvider()
val_data = dataset_provider.get_val_dataset()
count = val_data.data.shape[0]
test = np.random.choice(count, 1)[0]
data = val_data.data[test]

import matplotlib.pyplot as plt

# check what the images look like.
plt.figure(figsize=(2,2))
plt.imshow(data, cmap='gray')
print(f'data has shape: {data.shape}')
plt.axis('off')
plt.show()

# Now run the ONNX runtime on this the validation set.
# You can change this to `CUDAExecutionProvider` if you have a GPU and have
# installed the CUDA runtime.
ort_sess = ort.InferenceSession(model_path, providers=['CPUExecutionProvider'])
for i in ort_sess.get_inputs():
    print(f'input: {i.name}, {i.shape}, {i.type}')

print(f'Testing {count} rows')
failed = 0
for i in range(val_data.data.shape[0]):
    data = val_data.data[i]
    expected = int(val_data.train_labels[i])

    while len(data.shape) < 4:
        data = np.expand_dims(data, axis=0)
    outputs = ort_sess.run(None, {'input': data.astype(np.float32) / 255.0})
    result = outputs[0]
    index = np.argmax(result)
    label = val_data.classes[index]
    if expected != index:
        # print(f'### Failed: {expected} and got {label}')
        failed += 1

rate = (count - failed) * 100 / count
print(f"Failed {failed} out of {count} rows")
print(f'Inference pass rate is  {rate} %.')
print(f'How does this compare with the training validation accuracy of {top_accuracy}')
if np.isclose(rate, top_accuracy* 100, atol=0.1):
    print('Success! The model is working as expected.')
else:
    print('The onnx runtime is giving different results.')

data has shape: torch.Size([28, 28])

../../../../../_images/advanced_guide_cloud_azure_notebooks_multi_node_search_multi_node_search_35_1.png

input: input, [1, 1, 28, 28], tensor(float)
Testing 10000 rows
Failed 93 out of 10000 rows
Inference pass rate is  99.07 %.
How does this compare with the training validation accuracy of 0.9907000064849854
Success! The model is working as expected.

[ ]:

Multi-node Search

Sections

Multi-node Search#

Plots#

Test ONNX Runtime Inference on the Best Model#