Suppose your job needs to launch another job (e.g. a library you want to use must launch docker containers), or to kill a job which had been created. To allow this, DLWorkspace provides a RESTful API at $DLWS_REST_API/child
, where $DLWS_REST_API
is an environment variable passed in to every job. This document describes that API, and gives an example of how to use it in continuation passing style.
To make sure no malicious requests are made, DLWS passes every job a unique ID in the $FAMILY_TOKEN
environment variable, which is used for authentication. Every request to the /child
API must provide the query parameter familyToken=$FAMILY_TOKEN
.
The path to submit a job is $DLWS_REST_API/child/SubmitJob
. As mentioned prior, the family token must be passed as a query parameter, and the newly made job will be given this same token. The userName and userId parameters will be inherited from the submitter job, but all other parameters work the same as a normally submitted job.
The body of the response will be JSON, and will either contain an error message at the key error
or the jobId at the key jobId
. This is the same as the normal API.
Here is an example of how to submit a child job from within a python script:
def submit_child(error_handler, succes_handler):
import os
import requests
jobSettings = {
"resourcegpu": 0,
"jobName": "tensorflow-ipython-cpu",
"dataPath": "imagenet",
"image": "tensorflow/tensorflow:latest",
"cmd": "export HOME=/job && jupyter notebook --no-browser --port=8888 --ip=0.0.0.0 --notebook-dir=/",
"interactivePort": "8888",
"jobType": "training",
"jobtrainingtype": "RegularJob",
"runningasroot": "0",
"familyToken": os.environ["FAMILY_TOKEN"]
}
api_url = os.environ["DLWS_REST_API"] + "/child"
submit_child_resp = requests.get(api_url + "/SubmitJob", params=jobSettings, timeout=3)
submit_child_resp.raise_for_status()
submit_child_resp = submit_child_resp.json()
if "error" in submit_child_resp:
error_handler(submit_child_resp["error"])
else:
succes_handler(submit_child_resp["jobId"])
Note: If a job submitted by a user is killed, all jobs submitted by that original job will also be killed.
To kill a job, just pass the job-to-be-killed’s id and the current job’s familyToken to $DLWS_REST_API/child/KillJob
. The response body will be json and have a single string at the key result, which is either “Success, the job is scheduled to be terminated.” or “Cannot Kill the job. Job ID: $jobId” where $jobId denotes the id of the job which failed to be killed.
def kill_job(jobId, error_handler, succes_handler):
import os
import requests
api_url = os.environ["DLWS_REST_API"]
family_token = os.environ["FAMILY_TOKEN"]
submit_child_resp = requests.get(api_url + "KillJob", timeout=3,
params=dict(familyToken=family_token, jobId=jobId))
submit_child_resp.raise_for_status()
submit_child_resp = submit_child_resp.json()
result = submit_child_resp["result"]
if "Success" not in submit_child_resp:
succes_handler()
else:
error_handler(jobId)
TODO: Explain what a job detail is
The api takes the current job’s familyToken and the jobId of the job whose detail is requested, and returns that job’s detail in JSON as an object. It can be accessed at $DLWS_REST_API/child/JobDetail
.
def job_detail(jobId, succes_handler):
import os
import requests
api_url = os.environ["DLWS_REST_API"]
family_token = os.environ["FAMILY_TOKEN"]
submit_child_resp = requests.get(api_url + "JobDetail", timeout=3,
params=dict(familyToken=family_token, jobId=jobId))
submit_child_resp.raise_for_status()
detail = submit_child_resp.json()
succes_handler(detail)
TODO: Check the accuracy of this section w.r.t. networking
DLWorkspace supports kubernetes networking, which means each job has its own virtual IP. A running job can access this IP by giving its familyToken and that job’s id to $DLWS_REST_API/child/GetJobIP
. The response body will be json of either the form {"error": "Could not find job with id $jobId"}
or {"IP": $jobIP}
, where $jobId and $jobIP are respectively the ID and IP of the job whose IP was supposed to have been gotten.
def job_ip(jobId, succes_handler, error_handler):
import os
import requests
api_url = os.environ["DLWS_REST_API"]
family_token = os.environ["FAMILY_TOKEN"]
submit_child_resp = requests.get(api_url + "GetJobIP", timeout=3,
params=dict(familyToken=family_token, jobId=jobId))
submit_child_resp.raise_for_status()
submit_child_resp = submit_child_resp.json()
if "error" in submit_child_resp:
error_handler(jobId)
else:
succes_handler(submit_child_resp["IP"])