For more customized configuration, please refer to the Configuration Section and Azure doc.
We have greatly simplified Azure Cluster Configuration. As a minimum, you will only need to create a config.yaml file under src/ClusterBootstrap, with the cluster name.
Cluster name must be unique, and should be specified as:
cluster_name: <your cluster name>
If you are not building a cluster for Microsoft employee usage, you will also need to configure Authentication.
You may provide/change the specification of the deployed Azure cluster by editing the config.yaml, here’s an example:
cluster_name: <unique cluster name, e.g. useanothername>
azure_cluster:
infra_node_num: 1
infra_vm_size : <az vm size, such as Standard_B2s>
azure_location: eastus
worker_node_num: 2
nfs_node_num: 1
nfs_data_disk_sz : 31
nfs_data_disk_num: 2
worker_vm_size: <az vm size, such as Standard_B2s>
nfs_vm_size: <az vm size, such as Standard_B2s>
nfs_local_storage_sz: 1023
vm_image: Canonical:UbuntuServer:18.04-LTS:18.04.201910030
nfs_vm:
- suffix: toad
data_disk_num: 2
data_disk_sz_gb: 31
data_disk_sku: Premium_LRS
data_disk_mnt_path: /data
nfs_mnt_setup:
- server_suffix: toad
mnt_point:
firstshare:
curphysicalmountpoint: /mntdlws/nfs
filesharename: /data/share
mountpoints: ''
datasource: MySQL
mysql_password: <password, e.g. useanotherpw!>
WinbindServers: []
priority: regular
nfs_client_CIDR:
node_range:
- "192.168.0.0/16"
samba_range:
- "s.a.m.0/24"
master_token: <DLTS master token for generating user passwords>
activeDirectory:
tenant: <tenant ID, usually associated with a corp, such as Microsoft>
clientId: <AAD app ID>
clientSecret: <AAD app secret>
domain-offset:
<url1>: <value1>
<url2>: <value2>
<can also set '*'>: <value0>
repair-manager:
portal_url: <a domain name, e.g. dltshub.mydomain.com>
ecc_rule:
cordon_dry_run: False
reboot_dry_run: True
alert_job_owners: True
days_until_node_reboot: 5
time_sleep_after_pausing: 30
attempts_for_pause_resume_jobs: 10
rest_url: http://localhost:5000
restore_from_rule_cache_dump: True
rule_cache_dump: /etc/RepairManager/rule-cache.json
job_owner_email_domain: <an email domain name like microsoft.com>
latency_rule:
alert_expiry: 4 # In hours
smtp:
smtp_url: <smtp, like xxx.com:587>
smtp_from: <email address that is used to send alert emails>
smtp_auth_username: <username used for authentication, e.g. same as smtp_from>
smtp_auth_password: <password for the username above>
default_recipients: <email address that would receive alert email>
cc: <email address that alert email would be cc to>
WebUIregisterGroups:
- MicrosoftUsers
WebUIauthorizedGroups : []
WebUIadminGroups : ["CCSAdmins"]
WebUIregisterGroups: [ "MicrosoftUsers" ]
DeployAuthentications : ["Corp"]
webuiport: 80
cloud_config_nsg_rules:
default_admin_username: core
dev_network:
source_addresses_prefixes:
# These are the dev box of the cluster, only the machine in the IP address below will have access to the cluster.
- "b.a.0.0/16"
- "z.x.0.0/16"
nfs_share:
source_ips:
# IPs that we want to share NFS storage to
- "x.y.z.0/24"
- "a.b.0.0/16"
nfs_ssh:
source_ips:
# IPs that that we want to use to ssh to NFS nodes
- "q.w.e.0/24"
- "r.f.0.0/16"
port: "22"
alert-manager:
configured: True
alert_users: False # True if we want to send out alert email to users, default False
smtp_url: <smtp url>
smtp_from: <email address used to send alert emails, e.g. 'dlts-bot@microsoft.com'>
smtp_auth_username: <email account that would send email to receivers, such as 'dlts-bot@microsoft.com'>
smtp_auth_password: <password for the email account above>
receiver: <email address to send alert email to>
reaper:
dry-run: True # change to False if we want to kill idle job
restful-url: http://localhost:5000
prometheus:
cluster_name: <the unique cluster name> # will be used in link to job detail page
watchdog:
vc_url: <url used for listing vc, e.g. http://localhost:5000/ListVCs?userName=Administrator>
prometheus:
cluster_name: <the unique cluster name> # will be used in link to job detail page
job-manager:
notifier:
cluster: <cluster name>
alert-manager-url: <url like http://localhost:9093/alert-manager>
registry_credential:
<docker registry name 1>:
username: <docker registry username 1>
password: <docker registry password 1>
<docker registry name 2>:
username: <docker registry username 2>
password: <docker registry password 2>
cluster_name: A name without underscore or numbers (purely consisting of lower case letters) is recommended.
infra_node_num: Should be odd (1, 3 or 5), number of infrastructure node for the deployment. 3 infrastructure nodes tolerate 1 failure, and 5 infrastructure nodes tolerate 2 failures. However, more infrastructure nodes (and more failure tolerance) will reduce performance of the node.
worker_node_num: Number of worker node used for deployment.
vm_image: Used to fix the image version if the changing LTS is breaking the consistency of the deployment.
nfs_vm: each item identified by suffix
specs would describe an NFS node, and this item would overwrite default NFS specs. A server_suffix
entry in nfs_mnt_setup
should map to this item.
azure_location: azure location of the cluster.
Please use the following to find all available azure locations.
az account list-locations
Usually, a CPU VM will be used for infra_vm_size, and a GPU VM will be used for worker_vm_size. Please find all available Azure VM size in a specific region, e.g. West US 2 in the below command:
az vm list-sizes --location <location, e.g. westus2>