Azure Batch AI provides us the PaaS opportunity to use GPU resources in the cloud. The basis is to use virtual machines in a managed cluster (i.e. you don’t have to maintain them) and run jobs as you see fit. For my use case, the opportunity of low-priority VMs to reduce the cost of using GPU machines is also particularly promising. What I’ll run through is running our first job on Azure Batch AI. For setup, we’ll use the Azure CLI as I find it easier and quicker than using the UI portal. Saying that, everything can be achieved by point and click at portal.azure.com Assuming you already have the CLI installed and you are already logged in with it.

Resource Group

As with all Azure resources, we group them with a resource group and you’ll have to know your desired location. If you want to use a GPU machine, this restricts you a bit, but you can find which regions are supported at https://azure.microsoft.com/en-us/regions/services/ looking for Virtual Machines of NC, NCv2, NCv3. For my purposes, my resource group will be batch-rg and will be in location westus2.

# Configure CLI defaults so we don't have to pass them to all commands
az configure --defaults group='batch-rg' location='westus2'

# Create our resource group
az group create -n 'batch-rg'

Storage

Azure storage is used for accessing your code and will be used to read in data and write the model checkpoints and any output. Setting a few environment variables also simplifies CLI usage so we create an account named pontifexml and then set the specified variables as you’ll see below.

az storage account create \
    -n pontifexml \
    --sku Standard_LRS

# Set our storage account name for CLI and Batch AI CLI
export {AZURE_BATCHAI_STORAGE_ACCOUNT,AZURE_STORAGE_ACCOUNT}=pontifexml

# Set our storage account key for CLI and Batch AI CLI
export {AZURE_BATCHAI_STORAGE_KEY,AZURE_STORAGE_KEY}=$(az storage account keys list --account-name ${AZURE_STORAGE_ACCOUNT} --resource-group ml | head -n1 | awk '{print $3}')

We will also create a file share that the cluster will use named machinelearning:

az storage share create \
    -n machinelearning

Batch AI Cluster

Now we finally get to the point ?: the batch AI cluster. The pieces of information we really need here are:

  1. What image do we want our VMs based on. Here we will use the Ubuntu Data Science Virtual Machine
  2. The VM size which we have chosen as Standard_NC6
  3. Scale - we aren’t enabling auto scale by just setting min and max machines to 1
az batchai cluster create \
    --name dsvm \
    --image UbuntuDSVM \
    --vm-size Standard_NC6 \
    --min 1 --max 1 \
    --afs-name machinelearning \
    --user-name $USER --ssh-key ~/.ssh/id_rsa.pub \
    -c clusterconfig.json

clusterconfig.json

{
  "properties": {
    "vmPriority": "lowpriority"
  }
}

Here we have created a cluster without autoscaling, using a GPU machine that we can submit jobs to. This will take a bit for the cluster to provision but can be monitored either from the portal or using the CLI and inspecting the ‘AllocationState’ by:

az batchai cluster show -n dsvm -o table

Batch AI Job

To ensure our cluster is working correctly, our job will be a TensorFlow hello world in a file ‘hello-tf.py’ with contents as:

"""
TensorFlow 'Hello World'

Author: Damien Pontifex
"""

import tensorflow as tf

def run_training():
    """Run a 'training' sample"""

    hello = tf.constant('Hello, TensorFlow!')

    with tf.Session() as sess:
        print(sess.run(hello))

if __name__ == '__main__':
    run_training()

We need this code to be in our Azure file share and we will upload it using the CLI:

# Create a directory in our share
az storage directory create \
    --share-name machinelearning \
    --name helloworld

# Upload our hello-tf.py file to that directory
az storage file upload \
    --share-name machinelearning \
    --path helloworld \
    --source hello-tf.py

# Upload any data files

Now to define how our job should run, we use a JSON file which for this job is: (job.json)

{
  "properties": {
    "nodeCount": 1,
    "tensorFlowSettings": {
      "pythonScriptFilePath": "$AZ_BATCHAI_INPUT_SCRIPT/hello-tf.py",
      "masterCommandLineArgs": "-p"
    },
    "stdOutErrPathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/afs/helloworld",
    "inputDirectories": [
      {
        "id": "SCRIPT",
        "path": "$AZ_BATCHAI_MOUNT_ROOT/afs/helloworld"
      }
    ],
    "outputDirectories": [
      {
        "id": "DEFAULT",
        "pathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/afs/helloworld/out"
      }
    ],
    "containerSettings": {
      "imageSourceRegistry": {
        "image": "tensorflow/tensorflow:1.4.0-gpu-py3"
      }
    }
  }
}

Recipes for other configurations can be found at https://github.com/Azure/BatchAI/tree/master/recipes but going over a few points from the configuration above:

  • Environment variable $AZ_BATCHAI_MOUNT_ROOT is the path of our file share we specified when creating the cluster
  • $AZ_BATCHAI_INPUT_{id} are the input directories we defined. So in our case id = SCRIPT and an equivalent environment variable of $AZ_BATCHAI_OUTPUT_DEFAULT would be available here
  • We are using the tensorflow 1.4 python3 GPU docker container to run our code in

To create the job in our cluster run:

az batchai job create \
    --config job.json \
    --name hello \
    --cluster-name dsvm

Once completed you should see the stdout of ‘Hello, TensorFlow!’ in the fileshare.

Conclusion

Even though this is a ‘hello world’ example, the only change to this workflow for an actual ML job would be your training code inside the python script. From here we could run any job we like. As a side note I found, if you’re using Jupyter notebooks for development locally, you can run these via the command line by jupyter nbconvert --to notebook --execute mynotebook.ipynb and the notebook will be executed similar to a Run All operation inside Jupyter. Just make sure to use the environment variables as appropriate for any data locations