DLWorkspace

Configuration: Azure Cluster

For more customized configuration, please refer to the Configuration Section and Azure doc.

Azure Cluster specific configuration

We have greatly simplified Azure Cluster Configuration. As a minimum, you will only need to create a config.yaml file under src/ClusterBootstrap, with the cluster name.

Cluster Name

Cluster name must be unique, and should be specified as:

cluster_name: <your cluster name>

Authentication

If you are not building a cluster for Microsoft employee usage, you will also need to configure Authentication.

Additional configuration.

You may provide/change the specification of the deployed Azure cluster by editing the config.yaml, here’s an example:

cluster_name: <unique cluster name, e.g. useanothername>

azure_cluster:
    infra_node_num: 1
    infra_vm_size : <az vm size, such as Standard_B2s>
    azure_location: eastus
    worker_node_num: 2
    nfs_node_num: 1
    nfs_data_disk_sz : 31
    nfs_data_disk_num: 2
    worker_vm_size: <az vm size, such as Standard_B2s>
    nfs_vm_size: <az vm size, such as Standard_B2s>
    nfs_local_storage_sz: 1023
    vm_image: Canonical:UbuntuServer:18.04-LTS:18.04.201910030
    nfs_vm:
    - suffix: toad
      data_disk_num: 2
      data_disk_sz_gb: 31
      data_disk_sku: Premium_LRS
      data_disk_mnt_path: /data

nfs_mnt_setup:
  - server_suffix: toad
    mnt_point:
      firstshare:
        curphysicalmountpoint: /mntdlws/nfs
        filesharename: /data/share
        mountpoints: ''

datasource: MySQL    
mysql_password: <password, e.g. useanotherpw!>
WinbindServers: []

priority: regular

nfs_client_CIDR:
  node_range:
    - "192.168.0.0/16"
  samba_range:
    - "s.a.m.0/24"

master_token: <DLTS master token for generating user passwords>
activeDirectory:
  tenant: <tenant ID, usually associated with a corp, such as Microsoft>
  clientId: <AAD app ID>
  clientSecret: <AAD app secret>

domain-offset:
  <url1>: <value1>
  <url2>: <value2>
  <can also set '*'>: <value0>

repair-manager:
  portal_url:  <a domain name, e.g. dltshub.mydomain.com>
  ecc_rule:
    cordon_dry_run: False
    reboot_dry_run: True
    alert_job_owners: True
    days_until_node_reboot: 5
    time_sleep_after_pausing: 30
    attempts_for_pause_resume_jobs: 10
  rest_url: http://localhost:5000
  restore_from_rule_cache_dump: True
  rule_cache_dump: /etc/RepairManager/rule-cache.json
  job_owner_email_domain: <an email domain name like microsoft.com>
  latency_rule:
    alert_expiry: 4 # In hours

smtp:
  smtp_url: <smtp, like xxx.com:587>
  smtp_from: <email address that is used to send alert emails>
  smtp_auth_username: <username used for authentication, e.g. same as smtp_from>
  smtp_auth_password: <password for the username above>
  default_recipients: <email address that would receive alert email>
  cc: <email address that alert email would be cc to>

WebUIregisterGroups:
- MicrosoftUsers

WebUIauthorizedGroups : []
WebUIadminGroups : ["CCSAdmins"]
WebUIregisterGroups: [ "MicrosoftUsers" ]

DeployAuthentications : ["Corp"]

webuiport: 80

cloud_config_nsg_rules:
  default_admin_username: core
  dev_network:
    source_addresses_prefixes:
    # These are the dev box of the cluster, only the machine in the IP address below will have access to the cluster.
    - "b.a.0.0/16"
    - "z.x.0.0/16"
  nfs_share:
    source_ips: 
      # IPs that we want to share NFS storage to
      - "x.y.z.0/24"
      - "a.b.0.0/16"
  nfs_ssh:
    source_ips: 
      # IPs that that we want to use to ssh to NFS nodes
      - "q.w.e.0/24"
      - "r.f.0.0/16"
    port: "22"

alert-manager:
  configured: True
  alert_users: False # True if we want to send out alert email to users, default False
  smtp_url: <smtp url>
  smtp_from: <email address used to send alert emails, e.g. 'dlts-bot@microsoft.com'>
  smtp_auth_username: <email account that would send email to receivers, such as 'dlts-bot@microsoft.com'>
  smtp_auth_password: <password for the email account above>
  receiver: <email address to send alert email to>

  reaper:
    dry-run: True # change to False if we want to kill idle job
    restful-url: http://localhost:5000

prometheus:
  cluster_name: <the unique cluster name> # will be used in link to job detail page

watchdog:
  vc_url: <url used for listing vc, e.g. http://localhost:5000/ListVCs?userName=Administrator>

prometheus:
  cluster_name: <the unique cluster name> # will be used in link to job detail page

job-manager:
  notifier:
    cluster: <cluster name>
    alert-manager-url: <url like http://localhost:9093/alert-manager>

registry_credential:
  <docker registry name 1>:
    username: <docker registry username 1>
    password: <docker registry password 1>
  <docker registry name 2>:
    username: <docker registry username 2>
    password: <docker registry password 2>

Please use the following to find all available azure locations.

az account list-locations

Usually, a CPU VM will be used for infra_vm_size, and a GPU VM will be used for worker_vm_size. Please find all available Azure VM size in a specific region, e.g. West US 2 in the below command:

az vm list-sizes --location <location, e.g. westus2>