This tutorial shows you how to run benchmarking workloads in Clear Linux* OS using TensorFlow* and Kubeflow with the Deep Learning Reference Stack.

The Deep Learning Reference Stack is available in two versions. The first is Eigen, which includes TensorFlow optimized for Intel® architecture. The second is Intel MKL-DNN, which includes the TensorFlow framework optimized using Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) primitives.

Release notes

View current release notes for the Deep Learning Reference Stack.

View current benchmark results for the Deep Learning Reference Stack.

Note

Performance test numbers in the Deep Learning Reference Stack were obtained using runc as the runtime.

Prerequisites

In Clear Linux OS, containers-basic provides Docker*, which is required for TensorFlow benchmarking. Use the swupd utility to check if containers-basic and cloud-native-basic are present:

sudo swupd bundle-list

If you need to install the containers-basic or cloud-native-basic, enter:

sudo swupd bundle-add containers-basic cloud-native-basic

To ensure that kubernetes is correctly installed and configured, Run Kubernetes* on Clear Linux* OS.

We have validated these steps against the following software package versions:

  • Clear Linux OS 26240–lowest version permissible.
  • Docker 18.06.1
  • Kubernetes 1.11.3
  • Go 1.11.12

TensorFlow single and multi-node benchmarks

This section describes running the TensorFlow benchmarks in single node. For multi-node testing, replicate these steps for each node. These steps provide a template to run other benchmarks, provided that they can invoke TensorFlow.

  1. Download and run either the Eigen or the Intel MKL-DNN docker image from Docker Hub.

    Note

    You will enter the following commands in the running container.

    Replace <docker_name> with the name of the image.

  2. Clone the benchmark repository:

    docker exec -t <docker_name> bash -c ‘git clone http://github.com/tensorflow/benchmarks -b cnn_tf_v1.12_compatible’
    
  3. Next, execute the benchmark script to run the benchmark.

    docker exec -i <docker_name> bash -c ‘python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --device=cpu --model=resnet50 --data_format=NHWC ’.
    

Note

You can replace the model with one of your choice supported by the TensorFlow benchmarks.

Kubeflow multi-node benchmarks

The benchmark workload will run in a Kubernetes cluster. We will use Kubeflow for the Machine Learning workload deployment on three nodes.

Kubernetes setup

Follow the instructions in the Run Kubernetes* on Clear Linux* OS tutorial to get set up on Clear Linux OS. The kubernetes community also has instructions for creating a cluster.

Kubernetes networking

We used flannel as the network provider for these tests. If you are comfortable with another network layer, refer to the Kubernetes networking documentation for setup.

Images

We need to add launcher.py to our docker image to include the Deep Learning Reference Stack and put the benchmarks repo in the correct location. From the docker image, run the following:

mkdir -p /opt
git clone https://github.com/tensorflow/benchmarks.git /opt/tf-benchmarks
cp launcher.py /opt
chmod u+x /opt/*

Your entry point now becomes “/opt/launcher.py”.

This will build an image which can be consumed directly by TFJob from kubeflow. We are working to create these images as part of our release cycle.

ksonnet*

Kubeflow uses ksonnet* to manage deployments, so we need to install that before setting up Kubeflow. On Clear Linux OS, follow these steps:

swupd bundle-add go-basic-dev
export GOPATH=$HOME/go
export PATH=$PATH:$GOPATH/bin
go get github.com/ksonnet/ksonnet
cd $GOPATH/src/github.com/ksonnet/ksonnet
make install

After the ksonnet installation is complete, ensure that binary ks is accessible across the environment.

Kubeflow

Once you have Kubernetes running on your nodes, you can setup Kubeflow by following these instructions from their quick start guide.

export KUBEFLOW_SRC=$HOME/kflow
export KUBEFLOW_TAG=”v0.3.2”
export KFAPP=”kflow_app”
export K8S_NAMESPACE=”kubeflow”

mkdir ${KUBEFLOW_SRC}
cd ${KUBEFLOW_SRC}
ks init ${KFAPP}
cd ${KFAPP}
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/${KUBEFLOW_TAG}/kubeflow
ks pkg install kubeflow/core

Now you have all the required kubeflow packages, and you can deploy the primary one for our purposes: tf-job-operator.

ks env rm default
kubectl create namespace ${K8S_NAMESPACE}
ks env add default --namespace "${K8S_NAMESPACE}"
ks generate tf-job-operator tf-job-operator
ks apply default -c tf-job-operator

This creates the CustomResourceDefinition(CRD) endpoint to launch a TFJob.

Run a TFJob

  1. Select this link for the ksonnet registries for deploying TFJobs.

    1. Install the TFJob componets as follows:

      ks registry add dlrs-tfjob github.com/clearlinux/dockerfiles/tree/master/stacks/dlrs/kubeflow/dlrs-tfjob
      
      ks pkg install dlrs-tfjob/dlrs-bench
      
  2. Next, generate Kubernetes manifests for the workloads and apply them to create and run them using these commands

    ks generate dlrs-resnet50 dlrsresnet50 --name=dlrsresnet50
    ks generate dlrs-alexnet dlrsalexnet --name=dlrsalexnet
    ks apply default -c dlrsresnet50
    ks apply default -c dlrsalexnet
    

This will replicate and deploy three test setups in your Kubernetes cluster.

Results of Running this Tutorial

You need to parse the logs of the Kubernetes pod to get the performance numbers. The pods will still be around post completion and will be in ‘Completed’ state. You can get the logs from any of the pods to inspect the benchmark results. More information about Kubernetes logging is available from the Kubernetes community.