Some months ago, I published a post on how to run BigDL distributed deep learning jobs on our MRS Open Cloud service: https://email@example.com/using-bigdl-for-distributed-deep-learning-on-telef%C3%B3nica-open-clouds-mrs-service-c6837115ade3
Going a step forward, now I present how to run Analytics Zoo jobs on top of the MRS Service. But, what is Analytics Zoo?
Borrowing the definition from their github page:
“Analytics Zoo provides a unified analytics + AI platform that seamlessly unites Spark, TensorFlow, Keras and BigDL programs into an integrated pipeline; the entire pipeline can then transparently scale out to a large Hadoop/Spark cluster for distributed training or inference.”
All that means that leveraging this platform, you can execute complex analytics and AI jobs, even with a few lines of code. See for example the object detection jobs that we are going to use as an example here.
The intention of this post is to show how to leverage this great technology on top of your Telefónica Open Cloud MRS instance. As a quick reference: MRS is the Telefónica Open Cloud Map Reduce Service that enables you to quickly set up a fully functional Hadoop or Spark cluster including several other Big Data products (Hive, HDFS, etc.). You can find the complete information in the corresponding help center page.
Because I think that the best way to show something is using an example, let us execute some object detection using Analytics Zoo on top of MRS. The easy way is, as stated before, to use the object detection example provided in the analytics-zoo github. This example will help us to illustrate some of the general tasks you would need to execute in order to run analytics-zoo jobs on MRS.
First of all, we need to install the dependencies required, in this case the python libraries that are not installed by default in the MRS virtual machines, as for example OpenCV in this case (see the predict.py file in the object detection example).
(Of course, you can automate the following steps). We need to deploy the corresponding packages and dependencies in all the MRS nodes.
An easy way to download the required packages and dependencies to all the nodes is to configure the NAT Gateway to the VPC subnet where you deployed the MRS cluster (see the help center); this will provide access to the internet to all the nodes in the cluster.
The next step is to download and install the python dependencies. We will use pip for this task. First, we need to download the pip tool. The most suitable form to deal with all of this is to run pip form the pip wheel itself.
Accessing via ssh to some on the nodes (the master would be recommended) as showed in the previous post (here) as the user linux, we can obtain the wheel.
[linux]$ curl -O https://files.pythonhosted.org/packages/5f/25/e52d3f31441505a5f3af41213346e5b6c221c9e086a166f3703d2ddaf940/pip-18.0-py2.py3-none-any.whl
(You can find the wheels in https://pythonwheels.com/)
In order to get the information and the wheels for all the required dependencies you can create a file that contains the dependencies, OpenCV in this case, because the argparse dependency is already fulfilled by default in the MRS nodes:
[linux]$ cat requirements.txt
opencv-python[linux]$ python pip-18.0-py2.py3-none-any.whl/pip wheel -r requirements.txt -w dist
The last command will download all the required (non satisfied dependencies) wheels to the dist folder.
It is not straightforward to know that you need to install several OS libraries before make these libraries work. I found that after trying to deploy and use the python libraries.
To be able to install the OS libraries via rpm, please check that the yum repositories are correctly configured (in /etc/yum.repos.d). The url should point to
developer.huawei.com . Once this is verified, we can proceed to install the required libraries:
[linux]$ sudo yum install libICE-1.0.9–9.x86_64 libX11–1.6.3–2.x86_64 libXau-1.0.8–2.1.x86_64 libXext-1.3.3–3.x86_64 libSM-1.2.2–2.x86_64 libX11-common-1.6.3–2.noarch libxcb-1.11–4.x86_64 libXrender-0.9.8–2.1.x86_64
Now we are prepared to install the python wheels:
[linux]$ umask 0022[linux]$ sudo python pip-18.0-py2.py3-none-any.whl/pip install dist/numpy-1.15.4-cp27-cp27mu-manylinux1_x86_64.whl[linux]$ sudo python pip-18.0-py2.py3-none-any.whl/pip install dist/opencv_python-184.108.40.206-cp27-cp27mu-manylinux1_x86_64.whl
umask instruction is required to allow regular users to access to the libraries.
We need to deploy all these packages in all the nodes of the cluster.
Analytics Zoo libraries, models, scripts and images
Now is time to download the analytics zoo framework. As you know, you can run spark jobs in different ways. Here we will run the jobs from one of the master nodes. Therefore we will access via ssh to one of the master nodes (refer to previous notes) to download the analytics zoo framework. You can find the releases to download and their URLs in the Analytics Zoo documentation page: https://analytics-zoo.github.io/master/#release-download/.
In this case, we will download one of the nightly build versions for Spark 2.1.1.
Once we have login as user linux into the master node we can proceed as follow
[linux]$ sudo su — omm[omm]$ mkdir ZOO[omm]$ cd ZOO[omm]$ culr -O https://oss.sonatype.org/content/repositories/snapshots/com/intel/analytics/zoo/analytics-zoo-bigdl_0.6.0-spark_2.1.1/0.4.0-SNAPSHOT/analytics-zoo-bigdl_0.6.0-spark_2.1.1-0.4.0-20181227.180249-48-dist-all.zip
We need to unpack just the required files:
[omm]$ unzip analytics-zoo-bigdl_0.6.0-spark_2.1.1–0.4.0–20181227.180249–48-dist-all.zip lib/analytics-zoo-bigdl_0.6.0-spark_2.1.1–0.4.0-SNAPSHOT-jar-with-dependencies.jar lib/analytics-zoo-bigdl_0.6.0-spark_2.1.1–0.4.0-SNAPSHOT-python-api.zip
In addition, we need to download the model to be used for the inference/object detection (in this example), the images to test and the python script that will make use of the analytics zoo framework to do the object detection.
Starting with the model: we will use one of the several models listed in https://analytics-zoo.github.io/master/#ProgrammingGuide/object-detection/#download-link for object detection, specifically the SSD 300x300 VGG:
[omm]$ curl -O https://s3-ap-southeast-1.amazonaws.com/analytics-zoo-models/object-detection/analytics-zoo_ssd-vgg16-300x300_COCO_0.1.0.model
Once downloaded we will deploy the model in the HDFS cluster in order to make it available to all the nodes:
[omm]$ source /opt/client/bigdata_env[omm]$ hdfs dfs -mkdir /user/ZOO[omm]$ hdfs dfs -mkdir /user/ZOO/models[omm]$ hdfs dfs -put analytics-zoo_ssd-vgg16–300x300_COCO_0.1.0.model /user/ZOO/models/
Same for the images were we want to detect the objects. In this case, we will use a set from the Open Images Dataset . In this site we can find instructions on how to download different image sets. For all the cases we will need to download them form AWS S3 service, therefore the easy way is to use the AWS cli tool. To install the tool as user linux:
[linux]$ umask 0022[linux]$ sudo python pip-18.0-py2.py3-none-any.whl/pip install awscli
just remember the previous notes on the umask command and for installing python modules using pip. The previous command will install the awscli tool and all the required packages
…Successfully installed awscli-1.16.93 botocore-1.12.83 colorama-0.3.9 docutils-0.14 futures-3.2.0 jmespath-0.9.3 pyasn1–0.4.5 python-dateutil-2.7.5 rsa-3.4.2 s3transfer-0.1.13 urllib3–1.24.1
Now we can use the awscli tool as omm user (see previous notes to see how to login as omm user from linux user) to download the images. Let us choose the images in the challenge 2018
[omm]$ aws s3 — no-sign-request sync s3://open-images-dataset/challenge2018 ./ZOO/OpenImage/Challenge
This command will download ~10GB of images to the ZOO/OpenImage/Challenge folder
Before running a complete object detection job in the spark cluster over all the images in the dataset, let us do a simple test running the object detection locally in the master node over a few images from the set. By running this job we will obtain the same images with the principal detected objects in a box, including the detected object category and the probability of success in the detection. Let us select a few images and store them in a folder in the HDFS system:
[omm]$ source /opt/client/bigdata_env[omm]$ hdfs dfs -mkdir /user/ZOO/images_test[omm]$ ls ZOO/OpenImage/Challenge/0ddd*
ZOO/OpenImage/Challenge/0ddd29426bd65ab1.jpg ZOO/OpenImage/Challenge/0ddd61facdee395c.jpg[omm]$ hdfs dfs -put ZOO/OpenImage/Challenge/0ddd* /user/ZOO/images_test/
As it can be seen in this example we have selected the four images whose name starts with “0ddd”
We are almost prepared to run the example. We just need to obtain the python script that implements the object detection from the analytics zoo github https://github.com/intel-analytics/analytics-zoo/blob/master/pyzoo/zoo/examples/objectdetection/predict.py
[omm]$ cd ZOO/lib[omm]$ curl -O https://raw.githubusercontent.com/intel-analytics/analytics-zoo/master/pyzoo/zoo/examples/objectdetection/predict.py
Object detection on four images
Now we can run this first job. Let us create a local folder to store the resulting images with the objects detected and after that, we can launch the job
[omm]$ cd ~[omm]$ mkdir ZOO/images_out/[omm]$ spark-submit --master local[*] --executor-cores 4 --num-executors 3 --driver-memory 6g --executor-memory 6g --py-files /home/omm/ZOO/lib/analytics-zoo-bigdl_0.6.0-spark_2.1.1-0.4.0-SNAPSHOT-python-api.zip --jars /home/omm/ZOO/lib/analytics-zoo-bigdl_0.6.0-spark_2.1.1-0.4.0-SNAPSHOT-jar-with-dependencies.jar ./ZOO/lib/predict.py hdfs://hacluster/user/ZOO/models/analytics-zoo_ssd-vgg16–300x300_COCO_0.1.0.model hdfs://hacluster/user/ZOO/images_test ./ZOO/images_out/ -partition_num 3
2019–01–23 09:06:46,588 | INFO | Thread-3 | Running Spark version 2.1.0 | org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)…BigDLBasePickler registering: bigdl.util.common Sample
BigDLBasePickler registering: bigdl.util.common EvaluatedResult
BigDLBasePickler registering: bigdl.util.common JTensor
BigDLBasePickler registering: bigdl.util.common JActivity
In the command before you can easily locate how to use the analytics zoo framework (jar and python API zip), the python script implementing the image detection (predict.py), the model to be used in the prediction (analytics-zoo_ssd-vgg16–300x300_COCO_0.1.0.model) and finally the image folder where we stored the images we will use to predict the objects in there (hdfs://hacluster/user/ZOO/images_test).
Once the job is completed, we can look into the output folder and check the object detection in the image, for example the image corresponding to the original 0ddd61facdee395c.jpg
This looks great, but for an automatic object detection of thousands of images this method seems useless.
Object detection on thousands of images
Some simple modifications into the original predict.py will make the trick. Just to change the Visalizer by a DecodeOutput, save the output of the main object detected into a new RDD (resilient distributed dataset) and store the results in parallel from the different worker nodes (thanks to the RDD) in the HDFS.
config = model.get_config()
textizer = DecodeOutput(config.label_map())
textized = textizer(output).get_predict()
rdd2 = textized.map(lambda x: (x, max(x, key=lambda y: y)))
You can download the modified predict.py from the github:
[omm]$ cd ZOO/lib[omm]$ curl -O https://raw.githubusercontent.com/fernandodelaiglesia/cajondesastre/master/predict_noimage_clean.py
Before analyzing the images, we need to execute some tasks in order to see the advantage of using Analytics Zoo. First, we need to upload the set of images to the HDFS.
[omm]$ cd ~[omm]$ hdfs dfs -mkdir /user/ZOO/images[omm]$ hdfs dfs -put ZOO/OpenImage/Challenge/* /user/ZOO/images/
For a test, we do not need to upload the 10K+ images in the set. For this test, we have used 2900 of them, including the ones whose name starts with 0ddd (to compare with the previous test).
Second, in order to be able to access the Hadoop web console included in the service, we need to configure the security group that the master nodes of the cluster are using to allow access to port 26000. Once we launch the job, we can see in this console the status and evolution of the executors, jobs, tasks, etc. for the application running.
To make easy the access I have created and entry in my hosts file pointing the name assigned (see the previous post) to the public address of the master, as you can see in the url in the image above.
Now we are ready to run the analytics zoo job for detecting objects in a lot of images, taking advantage of the parallel processing that the spark cluster is offering us, and storing the result of the analysis in the HDFS system in the folder hdfs://hacluster/user/ZOO/output1.
[omm]$ spark-submit --master yarn --deploy-mode cluster --executor-cores 8 --num-executors 6 --driver-memory 6g --executor-memory 7g --py-files /home/omm/ZOO/lib/analytics-zoo-bigdl_0.6.0-spark_2.1.1-0.4.0-SNAPSHOT-python-api.zip --jars /home/omm/ZOO/lib/analytics-zoo-bigdl_0.6.0-spark_2.1.1-0.4.0-SNAPSHOT-jar-with-dependencies.jar ./ZOO/lib/predict_noimage_clean.py hdfs://hacluster/user/ZOO/models/analytics-zoo_ssd-vgg16-300x300_COCO_0.1.0.model hdfs://hacluster/user/ZOO/images hdfs://hacluster/user/ZOO/output1 --partition_num 32019-02-05 03:05:11,660 | WARN | main | Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | org.apache.hadoop.util.NativeCodeLoader.<clinit>(NativeCodeLoader.java:62)
2019-02-05 03:05:12,853 | INFO | main | Requesting a new application from cluster | org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
2019-02-05 03:05:12,905 | INFO | main | Failing over to 40 | org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider.performFailover(ConfiguredRMFailoverProxyProvider.java:100)
2019-02-05 03:05:12,967 | INFO | main | Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) | org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
2019-02-05 03:05:12,968 | INFO | main | Will allocate AM container, with 6758 MB memory including 614 MB overhead | org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
2019-02-05 03:05:12,968 | INFO | main | Setting up container launch context for our AM | org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)...2019-02-05 04:13:43,738 | INFO | main | Application report for application_1545388617376_0014 (state: RUNNING) | org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
2019-02-05 04:13:44,740 | INFO | main | Application report for application_1545388617376_0014 (state: RUNNING) | org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
2019-02-05 04:13:45,741 | INFO | main | Application report for application_1545388617376_0014 (state: FINISHED) | org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
2019-02-05 04:13:45,742 | INFO | main |
client token: N/A
ApplicationMaster host: 10.13.3.77
ApplicationMaster RPC port: 0
queue user: omm
start time: 1549357518794
final status: SUCCEEDED
tracking URL: http://node-master1-Xhyyh:26000/proxy/application_1545388617376_0014/
user: omm | org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
2019-02-05 04:13:45,758 | INFO | pool-1-thread-1 | Shutdown hook called to kill application. | org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
2019-02-05 04:13:45,765 | INFO | pool-1-thread-1 | Killed application application_1545388617376_0014 | org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.killApplication(YarnClientImpl.java:415)
2019-02-05 04:13:45,765 | INFO | pool-1-thread-1 | Shutdown hook called | org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
2019-02-05 04:13:45,766 | INFO | pool-1-thread-1 | Deleting directory /tmp/spark-f7aa62ac-f6bb-4ea1-bfde-650c4285072a | org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)[omm]$
While the job is running, we can see the details of the stages of the Application in the Hadoop web console and verify that the work is executed in parallel with different tasks running in different core nodes
And when the job is finished we can see the output in the HDFS:
[omm]$ hdfs dfs -ls /user/ZOO/output1
Found 4 items
-rw-rw-rw- 3 omm hadoop 0 2019–02–05 04:13 /user/ZOO/output1/_SUCCESS
-rw-rw-rw- 3 omm hadoop 225670 2019–02–05 04:13 /user/ZOO/output1/part-00000
-rw-rw-rw- 3 omm hadoop 220755 2019–02–05 04:10 /user/ZOO/output1/part-00001
-rw-rw-rw- 3 omm hadoop 63161 2019–02–05 03:25 /user/ZOO/output1/part-00002
and the information of the main object detected into the images we tested previously:
[omm]$ hdfs dfs -cat /user/ZOO/output1/part-00000 | grep ‘/0ddd’
[omm]$ hdfs dfs -cat /user/ZOO/output1/part-00001 | grep ‘/0ddd’
(u’hdfs://hacluster/user/ZOO/images/0ddd018f6cf866ff.jpg’, array([ 1. , 0.9903376, 312.17386 , 303.02774 , 858.29724 ,
(u’hdfs://hacluster/user/ZOO/images/0ddd29426bd65ab1.jpg’, array([ 1. , 0.82304555, 526.93805 , 236.68103 ,
(u’hdfs://hacluster/user/ZOO/images/0ddd484cd2e0bbf2.jpg’, array([ 1. , 0.9958578, 628.4286 , 52.384884 , 984.40686 ,
(u’hdfs://hacluster/user/ZOO/images/0ddd61facdee395c.jpg’, array([ 3. , 0.9370283, 14.101074 , 288.6838 , 151.26463 ,
[omm]$ hdfs dfs -cat /user/ZOO/output1/part-00002 | grep ‘/0ddd’
(as we can see in this case all the four images were processed by the same job) or in any other image:
[omm]$ hdfs dfs -cat /user/ZOO/output1/part-00000 | head -6
(u'hdfs://hacluster/user/ZOO/images/00000b4dcff7f799.jpg', array([ 8. , 0.9821875, 0. , 77.09592 , 884.97925 ,
600.04706 ], dtype=float32))
(u'hdfs://hacluster/user/ZOO/images/00001a21632de752.jpg', array([ 1. , 0.98545676, 553.1527 , 430.21088 ,
825.5608 , 676.8691 ], dtype=float32))
(u'hdfs://hacluster/user/ZOO/images/0000d67245642c5f.jpg', array([9.0000000e+00, 1.8018720e-01, 5.3552136e+02, 5.7251709e+02,
6.4719580e+02, 6.2008929e+02], dtype=float32))
Let us focus on the image hdfs://hacluster/user/ZOO/images/0ddd61facdee395c.jpg, that is the one that we show in the previous test, the one with the cars. The main object detected record for this image as you can see is:
(u'hdfs://hacluster/user/ZOO/images/0ddd61facdee395c.jpg', array([ 3. , 0.9370283, 14.101074 , 288.6838 , 151.26463 ,
682. ], dtype=float32))
the important parts of this record for us are:
- The name of the image file (hdfs://hacluster/user/ZOO/images/0ddd61facdee395c.jpg)
- The first number in the array (3), that describes the main object detected and that you can locate in the corresponding classname file, in this case for the COCO model that we used https://github.com/intel-analytics/analytics-zoo/blob/e5b81233f5862c4007f5e099eba82a6632c8ea13/zoo/src/main/resources/coco_classname.txt. In the third line (starting in 0), you can see the name:
in this case, a car
- The second number in the array (0.9370283), that corresponds to the probability of success in the detection, 93.7%.
As we can see this matchs exactly with the image we show in the first test, in addition we were able to analyze 2900 images more with parallel processing with a very simple script.