In this demo we will assume the role of a Data Scienctist and use JupyterLab to preprocess and analyze datasets with Spark and TensorFlow. The technologies used in the demo are as follows:
-
Marathon-LB to expose JupyterLab externally
-
Estimated time for completion:
-
Manual install: 20min
-
Target audience: Anyone interested in Data Analytics.
Table of Contents:
- A running DC/OS 1.11 or higher cluster with at least 6 private agents and 1 public agent. Each agent should have 2 CPUs and 5 GB of RAM available. The DC/OS CLI also needs to be installed. If you plan to use GPU support we recommend to use the dcos-terraform project to provision DC/OS. Please refer to the the GPU Cluster Provisioning section in the README for more details.
You can install the HDFS service from the DC/OS UI or directly from the CLI:
$ dcos package install hdfs
Note that if you want to learn more about HDFS Service or advanced installation options, you can check out the HDFS Service Docs.
In order to expose JupyterLab externally we install Marathon-LB using
$ dcos package install marathon-lb
We can install Jupyterlab also from the UI or CLI. In both cases we need to change two paparameter;
- First the VHOST for exposing the service on a public agent externally.
This means changing the networking.external_access.external_public_agent_hostname
to the externally reachable VHOST (e.g., the Public Agent ELB in an AWS environment).
We can to that via the UI as shown here:
Or using a jupyterlab_options.json file where we need to configure the following setting;
"external_access": {
"enabled": true,
"external_public_agent_hostname": "<ADD YOUR VHOST NAME HERE *WITHOUT* trailing / NOR http://>"
}
- Secondly, as we want to access datasets in HDFS during this demo, we need to configure access to HDFS by exposing the necessary config files from the previously installed HDFS.
Again this can be done with the UI install:
Or using the jupyterlab_options.json with the following setting:
"jupyter_conf_urls": "http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints",
- After configuring those two settings we can install JupyterLab either with by clicking
Run Service
in the UI:
Or with the CLI and dcos package install jupyterlab --options=jupyterlab_options.json
.
For more options of installing Jupyterlab please refer to the installation section in the README
The first step is to login into JupyterLab. If we have used the default name and VHOST setting above it should be reachable via <VHOST>/jupyterlab-notebook
.
.
The default password witht he above settings is jupyter
and
Once logged in you should be to see the JupyterLab Launcher:
.
As a first test let us run the SparkPi example job.
For this we simply launch a Terminal and then use the following command:
eval \
spark-submit \
${SPARK_OPTS} \
--verbose \
--class org.apache.spark.examples.SparkPi \
/opt/spark/examples/jars/spark-examples_2.11-2.2.1.jar 100
You should then see Spark spinning up tasks and computing PI.
If you want, you can check the Mesos UI via <cluster>/mesos
and see the Spark tasks being spawned there.
Once the Spark job has finished you should be able to see output similar to Pi is roughly 3.1416119141611913
(followed by the Spark teardown log messages).
Let us run the SparkPi example as well directly from a Apache Toree notebook.
So launch a new notebook with an Apache Toree Scala
Kernel and use the below Scala code to compute Pi once more:
val NUM_SAMPLES = 10000000
val count2 = spark.sparkContext.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random()
val y = Math.random()
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count2 / NUM_SAMPLES)
- Launch a new notebook with Python 3 Kernel and use the following python code to show the available GPUs.
from tensorflow.python.client import device_lib
def get_available_devices():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos]
print(get_available_devices())
Next let us use TensorFlowOnSpark and the MNIST database to train a network recognizing handwritten digits.
- Clone the Yahoo TensorFlowOnSpark Github Repo using the Terminal:
git clone https://github.com/yahoo/TensorFlowOnSpark
- Retrieve and extract raw MNIST Dataset using the Terminal:
cd $MESOS_SANDBOX
curl -fsSL -O https://s3.amazonaws.com/vishnu-mohan/tensorflow/mnist/mnist.zip
unzip mnist.zip
- Check HDFS
Let us briefly check HDFS is working as expected and the mnist
directory does not exist yet from the Terminal:
nobody@2442bc8f-94d4-4f74-8321-b8b8b40436d7:~$ hdfs dfs -ls mnist/
ls: `mnist/': No such file or directory
- Prepare MNIST Dataset in CSV format and store on HDFS from Terminal
eval \
spark-submit \
${SPARK_OPTS} \
--verbose \
$(pwd)/TensorFlowOnSpark/examples/mnist/mnist_data_setup.py \
--output mnist/csv \
--format csv
- Check for mnist directory in HDFS from Terminal
nobody@2442bc8f-94d4-4f74-8321-b8b8b40436d7:~$ hdfs dfs -ls -R mnist/drwxr-xr-x - nobody supergroup 0 2018-08-08 01:33 mnist/csv
drwxr-xr-x - nobody supergroup 0 2018-08-08 01:33 mnist/csv/test
drwxr-xr-x - nobody supergroup 0 2018-08-08 01:33 mnist/csv/test/images
-rw-r--r-- 3 nobody supergroup 0 2018-08-08 01:33 mnist/csv/test/images/_SUCCESS
-rw-r--r-- 3 nobody supergroup 1810248 2018-08-08 01:33 mnist/csv/test/images/part-00000
-rw-r--r-- 3 nobody supergroup 1806102 2018-08-08 01:33 mnist/csv/test/images/part-00001
-rw-r--r-- 3 nobody supergroup 1811128 2018-08-08 01:33 mnist/csv/test/images/part-00002
-rw-r--r-- 3 nobody supergroup 1812952 2018-08-08 01:33 mnist/csv/test/images/part-00003
-rw-r--r-- 3 nobody supergroup 1810946 2018-08-08 01:33 mnist/csv/test/images/part-00004
-rw-r--r-- 3 nobody supergroup 1835497 2018-08-08 01:33 mnist/csv/test/images/part-00005
...
- Train MNIST model with CPUs from Terminal
eval \
spark-submit \
${SPARK_OPTS} \
--verbose \
--conf spark.mesos.executor.docker.image=dcoslabs/dcos-jupyterlab:1.2.0-0.33.7 \
--py-files $(pwd)/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \
$(pwd)/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \
--cluster_size 5 \
--images mnist/csv/train/images \
--labels mnist/csv/train/labels \
--format csv \
--mode train \
--model mnist/mnist_csv_model
If you want to use GPUs you can use (but make sure the cluster size matches the number of GPU instances)
eval \
spark-submit \
${SPARK_OPTS} \
--verbose \
--conf spark.mesos.executor.docker.image=dcoslabs/dcos-jupyterlab:1.2.0-0.33.7-gpu \
--conf spark.mesos.gpus.max=2 \
--conf spark.mesos.executor.gpus=1 \
--py-files $(pwd)/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \
$(pwd)/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \
--cluster_size 2 \
--images mnist/csv/train/images \
--labels mnist/csv/train/labels \
--format csv \
--mode train \
--model mnist/mnist_csv_model
- Check trained model on HDFS using the Terminal
nobody@2442bc8f-94d4-4f74-8321-b8b8b40436d7:~$ hdfs dfs -ls -R mnist/mnist_csv_model
-rw-r--r-- 3 nobody supergroup 128 2018-08-08 02:37 mnist/mnist_csv_model/checkpoint
-rw-r--r-- 3 nobody supergroup 4288367 2018-08-08 02:37 mnist/mnist_csv_model/events.out.tfevents.1533695777.ip-10-0-7-250.us-west-2.compute.internal
-rw-r--r-- 3 nobody supergroup 40 2018-08-08 02:36 mnist/mnist_csv_model/events.out.tfevents.1533695778.ip-10-0-7-250.us-west-2.compute.internal
-rw-r--r-- 3 nobody supergroup 156424 2018-08-08 02:36 mnist/mnist_csv_model/graph.pbtxt
-rw-r--r-- 3 nobody supergroup 814168 2018-08-08 02:36 mnist/mnist_csv_model/model.ckpt-0.data-00000-of-00001
-rw-r--r-- 3 nobody supergroup 408 2018-08-08 02:36 mnist/mnist_csv_model/model.ckpt-0.index
-rw-r--r-- 3 nobody supergroup 69583 2018-08-08 02:36 mnist/mnist_csv_model/model.ckpt-0.meta
-rw-r--r-- 3 nobody supergroup 814168 2018-08-08 02:37 mnist/mnist_csv_model/model.ckpt-600.data-00000-of-00001
-rw-r--r-- 3 nobody supergroup 408 2018-08-08 02:37 mnist/mnist_csv_model/model.ckpt-600.index
-rw-r--r-- 3 nobody supergroup 74941 2018-08-08 02:37 mnist/mnist_csv_model/model.ckpt-600.meta