Running Analytics Zoo jobs on Telefónica Open Cloud’s MRS Service

by Fernando de la Iglesia, Technology Expert at Telefónica I+D

Image for post
Image for post

Some months ago, I published a post on how to run BigDL distributed deep learning jobs on our MRS Open Cloud service:

Going a step forward, now I present how to run Analytics Zoo jobs on top of the MRS Service. But, what is Analytics Zoo?

Borrowing the definition from their github page:

“Analytics Zoo provides a unified analytics + AI platform that seamlessly unites Spark, TensorFlow, Keras and BigDL programs into an integrated pipeline; the entire pipeline can then transparently scale out to a large Hadoop/Spark cluster for distributed training or inference.”

All that means that leveraging this platform, you can execute complex analytics and AI jobs, even with a few lines of code. See for example the object detection jobs that we are going to use as an example here.

The intention of this post is to show how to leverage this great technology on top of your Telefónica Open Cloud MRS instance. As a quick reference: MRS is the Telefónica Open Cloud Map Reduce Service that enables you to quickly set up a fully functional Hadoop or Spark cluster including several other Big Data products (Hive, HDFS, etc.). You can find the complete information in the corresponding help center page.

Because I think that the best way to show something is using an example, let us execute some object detection using Analytics Zoo on top of MRS. The easy way is, as stated before, to use the object detection example provided in the analytics-zoo github. This example will help us to illustrate some of the general tasks you would need to execute in order to run analytics-zoo jobs on MRS.

First of all, we need to install the dependencies required, in this case the python libraries that are not installed by default in the MRS virtual machines, as for example OpenCV in this case (see the file in the object detection example).

(Of course, you can automate the following steps). We need to deploy the corresponding packages and dependencies in all the MRS nodes.

An easy way to download the required packages and dependencies to all the nodes is to configure the NAT Gateway to the VPC subnet where you deployed the MRS cluster (see the help center); this will provide access to the internet to all the nodes in the cluster.

The next step is to download and install the python dependencies. We will use pip for this task. First, we need to download the pip tool. The most suitable form to deal with all of this is to run pip form the pip wheel itself.

Accessing via ssh to some on the nodes (the master would be recommended) as showed in the previous post (here) as the user linux, we can obtain the wheel.

(You can find the wheels in

In order to get the information and the wheels for all the required dependencies you can create a file that contains the dependencies, OpenCV in this case, because the argparse dependency is already fulfilled by default in the MRS nodes:

The last command will download all the required (non satisfied dependencies) wheels to the dist folder.

It is not straightforward to know that you need to install several OS libraries before make these libraries work. I found that after trying to deploy and use the python libraries.

To be able to install the OS libraries via rpm, please check that the yum repositories are correctly configured (in /etc/yum.repos.d). The url should point to . Once this is verified, we can proceed to install the required libraries:

Now we are prepared to install the python wheels:

The umask instruction is required to allow regular users to access to the libraries.

We need to deploy all these packages in all the nodes of the cluster.

Now is time to download the analytics zoo framework. As you know, you can run spark jobs in different ways. Here we will run the jobs from one of the master nodes. Therefore we will access via ssh to one of the master nodes (refer to previous notes) to download the analytics zoo framework. You can find the releases to download and their URLs in the Analytics Zoo documentation page:

In this case, we will download one of the nightly build versions for Spark 2.1.1.

Once we have login as user linux into the master node we can proceed as follow

We need to unpack just the required files:

In addition, we need to download the model to be used for the inference/object detection (in this example), the images to test and the python script that will make use of the analytics zoo framework to do the object detection.

Starting with the model: we will use one of the several models listed in for object detection, specifically the SSD 300x300 VGG:

Once downloaded we will deploy the model in the HDFS cluster in order to make it available to all the nodes:

Same for the images were we want to detect the objects. In this case, we will use a set from the Open Images Dataset . In this site we can find instructions on how to download different image sets. For all the cases we will need to download them form AWS S3 service, therefore the easy way is to use the AWS cli tool. To install the tool as user linux:

just remember the previous notes on the umask command and for installing python modules using pip. The previous command will install the awscli tool and all the required packages

Now we can use the awscli tool as omm user (see previous notes to see how to login as omm user from linux user) to download the images. Let us choose the images in the challenge 2018

This command will download ~10GB of images to the ZOO/OpenImage/Challenge folder

Before running a complete object detection job in the spark cluster over all the images in the dataset, let us do a simple test running the object detection locally in the master node over a few images from the set. By running this job we will obtain the same images with the principal detected objects in a box, including the detected object category and the probability of success in the detection. Let us select a few images and store them in a folder in the HDFS system:

As it can be seen in this example we have selected the four images whose name starts with “0ddd”

We are almost prepared to run the example. We just need to obtain the python script that implements the object detection from the analytics zoo github

Now we can run this first job. Let us create a local folder to store the resulting images with the objects detected and after that, we can launch the job

In the command before you can easily locate how to use the analytics zoo framework (jar and python API zip), the python script implementing the image detection (, the model to be used in the prediction (analytics-zoo_ssd-vgg16–300x300_COCO_0.1.0.model) and finally the image folder where we stored the images we will use to predict the objects in there (hdfs://hacluster/user/ZOO/images_test).

Once the job is completed, we can look into the output folder and check the object detection in the image, for example the image corresponding to the original 0ddd61facdee395c.jpg

Image for post
Image for post

This looks great, but for an automatic object detection of thousands of images this method seems useless.

Some simple modifications into the original will make the trick. Just to change the Visalizer by a DecodeOutput, save the output of the main object detected into a new RDD (resilient distributed dataset) and store the results in parallel from the different worker nodes (thanks to the RDD) in the HDFS.

You can download the modified from the github:

Before analyzing the images, we need to execute some tasks in order to see the advantage of using Analytics Zoo. First, we need to upload the set of images to the HDFS.

For a test, we do not need to upload the 10K+ images in the set. For this test, we have used 2900 of them, including the ones whose name starts with 0ddd (to compare with the previous test).

Second, in order to be able to access the Hadoop web console included in the service, we need to configure the security group that the master nodes of the cluster are using to allow access to port 26000. Once we launch the job, we can see in this console the status and evolution of the executors, jobs, tasks, etc. for the application running.

Image for post
Image for post

To make easy the access I have created and entry in my hosts file pointing the name assigned (see the previous post) to the public address of the master, as you can see in the url in the image above.

Now we are ready to run the analytics zoo job for detecting objects in a lot of images, taking advantage of the parallel processing that the spark cluster is offering us, and storing the result of the analysis in the HDFS system in the folder hdfs://hacluster/user/ZOO/output1.

While the job is running, we can see the details of the stages of the Application in the Hadoop web console and verify that the work is executed in parallel with different tasks running in different core nodes

Image for post
Image for post

And when the job is finished we can see the output in the HDFS:

and the information of the main object detected into the images we tested previously:

(as we can see in this case all the four images were processed by the same job) or in any other image:

Let us focus on the image hdfs://hacluster/user/ZOO/images/0ddd61facdee395c.jpg, that is the one that we show in the previous test, the one with the cars. The main object detected record for this image as you can see is:

the important parts of this record for us are:

in this case, a car

  • The second number in the array (0.9370283), that corresponds to the probability of success in the detection, 93.7%.

As we can see this matchs exactly with the image we show in the first test, in addition we were able to analyze 2900 images more with parallel processing with a very simple script.

I love to learn, specially how nature works, and this is why I studied physics and love quantum “things”.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store