In 2017, there seems no doubt that if you aren’t running your ML training on a GPU you just aren’t doing things right ?. At home, my only computer is my MacBook Pro which is great to develop on, but would take an extremely long time to train something such as an image classification task. Saying this, I’d love to have a GPU machine at home, but I also love the opportunity to use this hardware in the cloud without having to power it, upgrade it and generally take care of something you actually own. Thus I am lead to investigate using Google’s ML Engine to train my models.

Changes in your TensorFlow code

Filesystem Access

The first problems I came across were in how I would access files. I had been using a local disk and had been using open or the  pythonos module for filesystem interaction. When training in ML engine, these locations will change to be storage buckets, with a path like gs://my-bucketand as such will need a filesystem object that knows how to interact with these paths. Solution: use the file_io module in TensorFlow and see below for a few samples to replace. This code can then be utilised when reading from a local data_dir on disk or from google cloud storage.

from tensorflow.python.lib.io import file_io

# Instead of with open(path, 'rb') as p:
with file_io.FileIO(os.path.join(data_dir, 'pickle_object.p'), 'rb') as p:
  # use file pointer

# Instead of glob.glob
tf_record_filepaths = file_io.get_matching_files(os.path.join(data_dir, '*.tfrecords'))

Runtime Parameters

To have the flexiility to run locally and in Google cloud, any parameters that need to be changed (i.e. data_dir) should be able to be set via command line arguments. This may seem obvious if developing using python from the terminal locally, but prior to this, I had done development in Jupyter notebooks where I would define constants linearly with the flow of the notebook. Defining these is pretty easy in python using the argparse.ArgumentParser as we will show below, but ML engine will also pass the --job-dir parameter to your program from the runtime. If you use parser.parse_args() this will throw an error. You can either define this as a parameter yourself, or use parser.parse_known_args() to only parse those you define.

from argparse import ArgumentParser

if __name__ == '__main__':
    parser = ArgumentParser()
    parser.add_argument('--data-dir')
  parser.add_argument('--job-dir')
    args = parser.parse_known_args()

gcloud CLI

You can find the Cloud SDK Quickstarts for you platform. The only thing missing for the macOS guide was locations. For me, I extracted the files to ~/.gcloud and updated my ~/.bash_profile with the lines

source ~/.gcloud/path.bash.inc
source ~/.gcloud/completion.bash.inc

Finally Running Something

My starter task was preprocessing images with labels to TFRecords: basically a task I new wouldn’t take too long and could validate everything was working. I’ll be using the CLI and you can following the ML Engine Quickstart to get the basics of that setup. This should create a project, enable billing, enable the ML Engine APIs and you should have a storage bucket.

Project Directory Structure

The trainer requires a python module based folder structure and a setup.py script for setup. We will be running a ‘preprocess’ module and the directory structure should be

Project
  __init__.py
  setup.py
  mlengine
    preproc-config.yaml
  trainer
    __init__.py
    preprocess.py

The __init__.py files can be empty, but firstly the setup.py file to define a few pip dependencies we want:

from setuptools import find_packages, setup

REQUIRED_PACKAGES = [
    'tensorflow==1.4.0',
    'h5py',
    'numpy',
    'image'
]

setup(
    name='plant_classifer',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    packages=find_packages(),
    include_package_data=True,
    description='Plant classifier application'
)

Secondly, the mlengine/preproc-config.yaml file for a few Job resource configuration. This is very minimal to just define that we only need a BASIC machine, but can be used to set much more complex run properties in ML Engine.

trainingInput:
  scaleTier: BASIC

The preprocess.py file is our python program and can be viewed on GitHub with the rest of the source, but next to how we would start off this in ML Engine. Using the CLI and having our currend directory in the root of the project in terminal, run:

gcloud ml-engine jobs submit training plants_preproc \
    --package-path trainer \
    --module-name trainer.preprocess \
    --job-dir gs://my-bucket \
    --region us-central1 \
    --config mlengine/preproc-config.yaml \
    -- \
    --data-dir gs://my-bucket/data

Here we are Submitting the job and defining that the trainer.preprocess module should be run. Arguments after the isolated – will be passed onto your program. This should queue the job and the run logs and job can be viewed on the cloud project.


The code used can be found at https://github.com/damienpontifex/plant-classification-ml-engine