Install Apache Spark 2.3

This post will guide through steps to install Spark

April 9, 2018 Pavan Kulkarni

2 minute read

This post will guide you through installation of Apache Spark 2.3.

Download the latest version of Apache Spark to your local from here. This will download spark-x.x.x-bin-hadoop2.7.tgz.
Un-compress the the .tgz to your desired directory. For the purpose of this post, I will unzip it to /Users/pavanpkulkarni/Documents/spark

Add the below entries to your ~/.bash_profile

#Spark Home
export SPARK_HOME=/Users/pavanpkulkarni/Documents/spark/spark-2.3.0-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Source the ~/.bash_profile file to reflect the changes.
```
source ~/.bash_profile
```

Verify installation

Pavans-MacBook-Pro:~ pavanpkulkarni$ spark-shell
2018-04-09 14:00:15 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://10.0.0.67:4040
Spark context available as 'sc' (master = local[*], app id = local-1523296821403).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Web UI should be available at - http://localhost:4040/

Run a sample code in spark-shell

scala> val rdd = sc.parallelize(1 to 1000000, 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.count()
res0: Long = 1000000

scala> rdd.take(20)
res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)

scala> val rdd1 = rdd.map( _ + 1 )
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:25

scala> rdd1.count()
res2: Long = 1000000

scala> rdd1.take(20)
res3: Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)

blog

Install Apache Spark 2.3

Pavan's Blog

Recent Posts

Spark Structured Streaming - File-to-File Real-time Streaming (3/3)

Spark Structured Streaming - Socket Word Count (2/3)

Spark Structured Streaming - Introduction (1/3)

MongoDB Data Processing (Python)

Categories

About

Home

About Me

Blog