Detailed Guide to Setting up Scalable Apache Spark Infrastructure on Docker - Standalone Cluster With History Server
This post is a complete guide to build a scalable Apache Spark on using Dockers. We will see how to enable History Servers for log persistence.
This post is a complete guide to build a scalable Apache Spark on using Dockers. We will see how to enable History Servers for log persistence. To be able to scale up and down is one of the key requirements of today’s distributed infrastructure. By the end of this guide, you should have pretty fair understanding of setting up Apache Spark on Docker and we will see how to run a sample program.
Prerequisites:
- Docker (Installation Instructions Here)
- Eclipse (download from here)
- Scala (Read this to Install Scala)
- Gradle (Read this to Install Gradle)
- Apache Spark (Read this to Install Spark)
GitHub Repos:
- docker-spark-image - This repo contains the DOckerfile required to build base image for containers.
- create-and-run-spark-job - This repo contains all the the necessary files required to build a scalable infrastructure.
- Docker_WordCount_Spark - This repo contains the source code that I will be running as part of demo. You can clone this repo and run a
gradle clean build
to generate an executable jar.
Let’s Begin ….!!!
Setting up Apache Spark in Docker gives us the flexibility of scaling the infrastructure as per the complexity of the project. This way we are:
- Neither under-utilizing nor over-utilizing the power of Apache Spark
- Neither under-allocating nor over-allocating resource to cluster
- In a shared environment, we have some liberty to spawn our own clusters and bring them down.
So, here’s what I will be covering in this tutorial:
- Create a base image for all the Spark nodes.
- Create a bridged network to connect all the containers internally.
- Add shared volumes across all shared containers for data sharing.
- Submit a job to cluster.
- Finally, monitor the job for performance optimization.
Let’s go over each one of these above steps in detail.
1. Creating a base image for all out Spark nodes.
- Crete a directory
docker-spark-image
that will contain the following files - Dockerfile, master.conf, slave.conf, history-server.conf and spark-defaults.conf. - master.conf - This configuration file is used to start the master node on the container. This is started in supervisord mode.
From the Docker docs :
supervisord - Use a process manager like supervisord. This is a moderately heavy-weight approach that requires you to package supervisord and its configuration in your image (or base your image on one that includes supervisord), along with the different applications it manages. Then you start supervisord, which manages your processes for you.
- slave.conf - This configuration file is used to start the slave node on the container and allow it to connect to master node. This is also started in supervisord mode. Make sure to note down the name
master
. We will be referencing the master node asmaster
in the this post. - history-server.conf - This configuration file is used to start the history server on the container. This is also started in supervisord mode. I build a new container just to persist app history logs of Spark jobs. The history log location specified in spark-defaults.conf is a shared volume between all containers. This is mounted to our local for log persistence. We will see below how this is done in action.
spark-defaults.conf - This configuration file is used to enable and set log locations used by history server.
spark.eventLog.enabled true spark.eventLog.dir file:///opt/spark-events spark.history.fs.logDirectory file:///opt/spark-events
Finally, Dockerfile - Lines 6:31 update and install - Java 8, supervisord and Apache Spark 2.2.1 with Hadoop 2.7. Then, copy all the configuration files to the image and create the log location as specified in
spark-defaults.conf
. All the required ports are exposed for proper communication between the containers and also for job monitoring using WebUI.# adding conf files to all images. This will be used in supervisord for running spark master/slave COPY master.conf /opt/conf/master.conf COPY slave.conf /opt/conf/slave.conf COPY history-server.conf /opt/conf/history-server.conf # Adding configurations for history server COPY spark-defaults.conf /opt/spark/conf/spark-defaults.conf RUN mkdir -p /opt/spark-events # expose port 8080 for spark UI EXPOSE 4040 6066 7077 8080 18080 8081
Additionally, you can start a dummy process in the container so that the container does not exit unexpectedly after creation. This in combination of supervisord daemon, ensures that the container is alive until killed or stopped manually.
#default command: this is just an option CMD ["/opt/spark/bin/spark-shell", "--master", "local[*]"]
Create an image by running the below command from
docker-spark-image
directory.Pavans-MacBook-Pro:docker-spark-image pavanpkulkarni$ docker build -t pavanpkulkarni/spark_image . Sending build context to Docker daemon 81.92kB [WARNING]: Empty continuation line found in: RUN apt-get install software-properties-common -y && apt-add-repository ppa:webupd8team/java -y && apt-get update -y && echo debconf shared/accepted-oracle-license-v1-1 select true | sudo debconf-set-selections && echo debconf shared/accepted-oracle-license-v1-1 seen true | sudo debconf-set-selections && apt-get install -y oracle-java8-installer supervisor [WARNING]: Empty continuation lines will become errors in a future release. Step 1/16 : FROM ubuntu:14.04 14.04: Pulling from library/ubuntu . . . . . . Removing intermediate container 2d860633548a ---> bb560415c8bf Step 16/16 : CMD ["/opt/spark/bin/spark-shell", "--master", "local[*]"] ---> Running in 4d354e2b3984 Removing intermediate container 4d354e2b3984 ---> 4c1113febcc4 Successfully built 4c1113febcc4 Successfully tagged pavanpkulkarni/spark_image:latest Pavans-MacBook-Pro:docker-spark-image pavanpkulkarni$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE pavanpkulkarni/spark_image latest 4c1113febcc4 46 seconds ago 1.36GB ubuntu 14.04 8cef1fa16c77 13 days ago 223MB Pavans-MacBook-Pro:docker-spark-image pavanpkulkarni$ docker tag 4c1113febcc4 pavanpkulkarni/spark_image:2.2.1 Pavans-MacBook-Pro:docker-spark-image pavanpkulkarni$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE pavanpkulkarni/spark_image 2.2.1 4c1113febcc4 4 minutes ago 1.36GB pavanpkulkarni/spark_image latest 4c1113febcc4 4 minutes ago 1.36GB ubuntu 14.04 8cef1fa16c77 13 days ago 223MB
OR
You can pull this image from my Docker Hub as
docker pull pavanpkulkarni/spark_image:2.2.1
Creating a cluster
To create a cluster, I make using of docker-compose
utility.
From the docker-compose docs:
docker-compose - Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command, you create and start all the services from your configuration.
Create a new directory create-and-run-spark-job
. This directory will contain - docker-compose.yml, Dockerfile, executable jar and/any supporting files required for execution.
We start by creating docker-compose.yml. Let’s create 3 sections, one for each master, slave and history-server. The
image
needs to be specified for each container.ports
field specifies port binding between the host and container asHOST_PORT:CONTAINER_PORT
. Under the slave section, port8081
is exposed to host (expose
can be used instead ofport
). Here 8081 is free to bind with any available port on the host side.command
is used to run a command in container.volumes
field is to create and mount volumes between container and host.volumes
followsHOST_PATH:CONTAINER_PATH
format.These are the minimum configurations we need to have in docker-compose.yml
master node :
master: build: . image: pavanpkulkarni/spark_image:2.2.1 container_name: master ports: - 4040:4040 - 7077:7077 - 8080:8080 - 6066:6066 command: ["/usr/bin/supervisord", "--configuration=/opt/conf/master.conf"]
slave node:
image: pavanpkulkarni/spark_image:2.2.1 depends_on: - master ports: - "8081" command: ["/usr/bin/supervisord", "--configuration=/opt/conf/slave.conf"] volumes: - ./docker-volume/spark-output/:/opt/output - ./docker-volume/spark-events/:/opt/spark-events
history-server container:
image: pavanpkulkarni/spark_image:2.2.1 container_name: history-server depends_on: - master ports: - "18080:18080" command: ["/usr/bin/supervisord", "--configuration=/opt/conf/history-server.conf"] volumes: - ./docker-volume/spark-events:/opt/spark-events
Executable jar - I have built the project using
gradle clean build
. I will be using the Docker_WordCount_Spark-1.0.jar for the demo. This jar is a application that will perform a simple WordCount on sample.txt and write output to a directory. The jar takes 2 arguments as shown below.output_directory
is the mounted volume of worker nodes (slave containers)
Docker_WordCount_Spark-1.0.jar [input_file] [output_directory]
Dockerfile - This is application specific Dockerfile that contains only the jar and application specific files.
docker-compose
uses this Dockerfile to build the containers.FROM pavanpkulkarni/spark_image:2.2.1 LABEL authors="pavanpkulkarni@pavanpkulkarni.com" COPY Docker_WordCount_Spark-1.0.jar /opt/Docker_WordCount_Spark-1.0.jar COPY sample.txt /opt/sample.txt
To build 3 node cluster
- cd to create-and-run-spark-job
Run the command
docker ps -a
to check the status of containers. We start with one image and no containers.Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE pavanpkulkarni/spark_image 2.2.1 4c1113febcc4 About an hour ago 1.36GB pavanpkulkarni/spark_image latest 4c1113febcc4 About an hour ago 1.36GB ubuntu 14.04 8cef1fa16c77 13 days ago 223MB Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$ docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
Build the docker-compose from the application specific Dockerfile.
Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$ docker-compose build Building master Step 1/4 : FROM pavanpkulkarni/spark_image:2.2.1 ---> 4c1113febcc4 Step 2/4 : LABEL authors="pavanpkulkarni@pavanpkulkarni.com" ---> Running in 8d4cce9730cb Removing intermediate container 8d4cce9730cb ---> 0e0f1aba18ed Step 3/4 : COPY Docker_WordCount_Spark-1.0.jar /opt/Docker_WordCount_Spark-1.0.jar ---> 215e22127d54 Step 4/4 : COPY sample.txt /opt/sample.txt ---> a79cd3fb5e33 Successfully built a79cd3fb5e33 Successfully tagged pavanpkulkarni/spark_image:2.2.1 slave uses an image, skipping history-server uses an image, skipping
Spawn a 3 - node cluster
Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$ docker-compose up -d --scale slave=3 Creating master ... done Creating history-server ... done Creating create-and-run-spark-job_slave_1 ... done Creating create-and-run-spark-job_slave_2 ... done Creating create-and-run-spark-job_slave_3 ... done Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 043cf5a74586 pavanpkulkarni/spark_image:2.2.1 "/usr/bin/supervisor…" 22 seconds ago Up 24 seconds 4040/tcp, 6066/tcp, 7077/tcp, 8080-8081/tcp, 0.0.0.0:18080->18080/tcp history-server bd762af3600e pavanpkulkarni/spark_image:2.2.1 "/usr/bin/supervisor…" 22 seconds ago Up 25 seconds 4040/tcp, 6066/tcp, 7077/tcp, 8080/tcp, 18080/tcp, 0.0.0.0:32770->8081/tcp create-and-run-spark-job_slave_3 ee254a16787f pavanpkulkarni/spark_image:2.2.1 "/usr/bin/supervisor…" 22 seconds ago Up 24 seconds 4040/tcp, 6066/tcp, 7077/tcp, 8080/tcp, 18080/tcp, 0.0.0.0:32769->8081/tcp create-and-run-spark-job_slave_1 463edf008d05 pavanpkulkarni/spark_image:2.2.1 "/usr/bin/supervisor…" 22 seconds ago Up 24 seconds 4040/tcp, 6066/tcp, 7077/tcp, 8080/tcp, 18080/tcp, 0.0.0.0:32768->8081/tcp create-and-run-spark-job_slave_2 ad6a781d9437 pavanpkulkarni/spark_image:2.2.1 "/usr/bin/supervisor…" 22 seconds ago Up 25 seconds 0.0.0.0:4040->4040/tcp, 0.0.0.0:6066->6066/tcp, 0.0.0.0:7077->7077/tcp, 8081/tcp, 0.0.0.0:8080->8080/tcp, 18080/tcp master Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$
This will give us a 3-node cluster like
- Master - master
- Workers - create-and-run-spark-job_slave_1, create-and-run-spark-job_slave_2, create-and-run-spark-job_slave_3
- History Server - history-server
The mounted volumes will now be visible in your host. In my case, I can see 2 directories created in my current dir
Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$ ls -ltr docker-volume/ total 0 drwxr-xr-x 3 pavanpkulkarni staff 96 May 10 09:57 spark-output drwxr-xr-x 2 pavanpkulkarni staff 64 May 11 15:51 spark-events
Note on docker-compose networking from docker-compose docs -
docker-compose - By default Compose sets up a single network for your app. Each container for a service joins the default network and is both reachable by other containers on that network, and discoverable by them at a hostname identical to the container name.In our case, we have a bridged network called
create-and-run-spark-job_default
.The name of network is same as name of your parent dir. This can be changed by setting the COMPOSE_PROJECT_NAME variable. A deeper inspection can be done by running thedocker inspect create-and-run-spark-job_default
commandPavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$ docker network ls NETWORK ID NAME DRIVER SCOPE dc9ce7304889 bridge bridge local a5bd0ff97b90 create-and-run-spark-job_default bridge local 8433fa00d5d8 host host local fbbb577e1d8e none null local
Spark cluster can be verified to be up && running as by the WebUI
- Master - localhost:8080
- History Server - localhost:18080
- Workers - localhost:32769, localhost:32768, localhost:32770. Check the port bindings under the
ports
field of thedocker ps -a
command.
The cluster can be scaled up or down by replacing n with your desired number of nodes.
docker-compose up -d --scale slave=n
Let’s submit a job to this 3-node cluster from the
master
node. This is a simplespark-submit
command that will produce the output in/opt/output/wordcount_output
directory.Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$ docker exec master /opt/spark/bin/spark-submit --class com.pavanpkulkarni.dockerwordcount.DockerWordCount --master spark://master:6066 --deploy-mode cluster --verbose /opt/Docker_WordCount_Spark-1.0.jar /opt/sample.txt /opt/output/wordcount_output Using properties file: /opt/spark/conf/spark-defaults.conf Adding default property: spark.eventLog.enabled=true Adding default property: spark.eventLog.dir=file:///opt/spark-events Adding default property: spark.history.fs.logDirectory=file:///opt/spark-events Parsed arguments: master spark://master:6066 deployMode cluster executorMemory null executorCores null totalExecutorCores null propertiesFile /opt/spark/conf/spark-defaults.conf driverMemory null driverCores null driverExtraClassPath null driverExtraLibraryPath null driverExtraJavaOptions null supervise false queue null numExecutors null files null pyFiles null archives null mainClass com.pavanpkulkarni.dockerwordcount.DockerWordCount primaryResource file:/opt/Docker_WordCount_Spark-1.0.jar name com.pavanpkulkarni.dockerwordcount.DockerWordCount childArgs [/opt/sample.txt /opt/output/wordcount_output] jars null packages null packagesExclusions null repositories null verbose true Spark properties used, including those specified through --conf and those from the properties file /opt/spark/conf/spark-defaults.conf: (spark.eventLog.enabled,true) (spark.history.fs.logDirectory,file:///opt/spark-events) (spark.eventLog.dir,file:///opt/spark-events) Running Spark using the REST application submission protocol. Main class: org.apache.spark.deploy.rest.RestSubmissionClient Arguments: file:/opt/Docker_WordCount_Spark-1.0.jar com.pavanpkulkarni.dockerwordcount.DockerWordCount /opt/sample.txt /opt/output/wordcount_output System properties: (spark.eventLog.enabled,true) (SPARK_SUBMIT,true) (spark.history.fs.logDirectory,file:///opt/spark-events) (spark.driver.supervise,false) (spark.app.name,com.pavanpkulkarni.dockerwordcount.DockerWordCount) (spark.jars,file:/opt/Docker_WordCount_Spark-1.0.jar) (spark.submit.deployMode,cluster) (spark.eventLog.dir,file:///opt/spark-events) (spark.master,spark://master:6066) Classpath elements: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 18/05/11 20:06:54 INFO RestSubmissionClient: Submitting a request to launch an application in spark://master:6066. 18/05/11 20:06:55 INFO RestSubmissionClient: Submission successfully created as driver-20180511200654-0000. Polling submission state... 18/05/11 20:06:55 INFO RestSubmissionClient: Submitting a request for the status of submission driver-20180511200654-0000 in spark://master:6066. 18/05/11 20:06:55 INFO RestSubmissionClient: State of driver driver-20180511200654-0000 is now RUNNING. 18/05/11 20:06:55 INFO RestSubmissionClient: Driver is running on worker worker-20180511194135-172.18.0.4-33221 at 172.18.0.4:33221. 18/05/11 20:06:55 INFO RestSubmissionClient: Server responded with CreateSubmissionResponse: { "action" : "CreateSubmissionResponse", "message" : "Driver successfully submitted as driver-20180511200654-0000", "serverSparkVersion" : "2.2.1", "submissionId" : "driver-20180511200654-0000", "success" : true }
Output is available on the mounted volume on host -
Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$ ls -ltr docker-volume/spark-output/wordcount_output/ total 32 -rw-r--r-- 1 pavanpkulkarni staff 12324 May 11 16:07 part-00000-a4225163-c8dd-4a51-9ef4-44e085e553e4-c000.csv -rw-r--r-- 1 pavanpkulkarni staff 0 May 11 16:07 _SUCCESS
Automation
Should the Ops team choses to have a scheduler on the job for daily processing or for the ease do developers, I have created a simple script to take care of the above steps - RunSparkJobOnDocker.sh. This script alone can be used to scale the cluster up or scale down per requirement.
TIP: Using spark-submit
REST API, we can monitor the job and bring down the cluster after job completion.
Share this post
Twitter
Google+
Facebook
LinkedIn
Email