This tutorial describes how to install, configure, and run Apache Spark on Clear Linux* OS. Apache Spark is a fast general-purpose cluster computing system with the following features:
- Provides high-level APIs in Java*, Scala*, Python*, and R*.
- Includes an optimized engine that supports general execution graphs.
- Supports high-level tools including Spark SQL, MLlib, GraphX, and Spark Streaming.
In this tutorial, you will install Spark on a single machine running the master daemon and a worker daemon.
This tutorial assumes you have installed Clear Linux OS on your host system. For detailed instructions on installing Clear Linux OS on a bare metal system, visit the bare metal installation tutorial.
Before you install any new packages, update Clear Linux OS with the following command:
sudo swupd update
Install Apache Spark
Apache Spark is included in the big-data-basic bundle. To install the framework, enter:
sudo swupd bundle-add big-data-basic
Configure Apache Spark
Create the configuration directory with the command:
sudo mkdir /etc/spark
Copy the default templates from /usr/share/defaults/spark to /etc/spark with the command:
sudo cp /usr/share/defaults/spark/* /etc/spark
Since Clear Linux OS is a stateless system, you should never modify the files under the /usr/share/defaults directory. The software updater overwrites those files.
Copy the template files below to create custom configuration files:
sudo cp /etc/spark/spark-defaults.conf.template /etc/spark/spark-defaults.conf sudo cp /etc/spark/spark-env.sh.template /etc/spark/spark-env.sh sudo cp /etc/spark/log4j.properties.template /etc/spark/log4j.properties
Edit the /etc/spark/spark-env.sh file and add the SPARK_MASTER_HOST variable. Replace the example address below with your localhost IP address. View your IP address using the hostname -I command.
This optional step enables the master’s web user interface to view information needed later in this tutorial.
Edit the /etc/spark/spark-defaults.conf file and update the spark.master variable with the SPARK_MASTER_HOST address and port 7077.
Start the master server and a worker daemon
Start the master server using:
Start one worker daemon and connect it to the master using the spark.master variable defined earlier:
sudo /usr/share/apache-spark/sbin/./start-slave.sh spark://10.300.200.100:7077
Open an internet browser and view the worker daemon information using the master’s IP address and port 8080:
Run the Spark wordcount example
Run the wordcount example using a file on your local host and output the results to a new file with the following command:
sudo spark-submit /usr/share/apache-spark/examples/src/main/python/wordcount.py ~/Documents/example_file > ~/Documents/results
Open an internet browser and view the application information using the master’s IP address and port 8080:
View the results of the wordcount application in the ~/Documents/results file.
You successfully installed and set up a standalone Apache Spark cluster. Additionally, you ran a simple wordcount example.