Apache Spark example with Java and Maven

Apache Spark has become the next cool Big Data technology. Apache Spark is designed to work seamlessly with either Hadoop or as a standalone application. The big advantage of Spark standalone is the ease of use especially for models evaluation.

Project structure

This example consists of a pom.xml file and a WorkCount.java file. An additional test file with some random data is also present with the name loremipsum.txt. The files can be found on GitHub https://github.com/melphi/spark-examples/tree/master/first-example.

spark-examples
|-- pom.xml
`-- src
    |-- main/java/org/spark-example/WordCountTask.java
    `-- test/java/WordCountTaskTest.java
    `-- test/resources/loremipsum.txt

Maven configuration

This is the Maven pom.xml configuration file, as you can see we need to import the spark-core library. The optional maven-compile-plugin is used to compile the project directly from Maven.

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.sparkexamples</groupId>
    <artifactId>first-example</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
    </properties>

    <dependencies>
        <!-- Spark -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.1.0</version>
        </dependency>

        <!-- Logging -->
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>1.7.22</version>
        </dependency>

        <!-- Testing -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
</project>

Java application

The WordCountTask.java is a simple Java Spark application which counts the number of words of the file passed as input argument.

The full source is here https://github.com/melphi/spark-examples/blob/master/first-example/src/main/java/org/sparkexample/WordCountTask.java

package org.sparkexample;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import scala.Tuple2;

import java.util.Arrays;

import static com.google.common.base.Preconditions.checkArgument;

public class WordCountTask {
  private static final Logger LOGGER = LoggerFactory.getLogger(WordCountTask.class);

  public static void main(String[] args) {
    checkArgument(args.length > 1, "Please provide the path of input file as first parameter.");
    new WordCountTask().run(args[1]);
  }

  public void run(String inputFilePath) {
    String master = "local[*]";

    SparkConf conf = new SparkConf()
        .setAppName(WordCountTask.class.getName())
        .setMaster(master);
    JavaSparkContext context = new JavaSparkContext(conf);

    context.textFile(inputFilePath)
        .flatMap(text -> Arrays.asList(text.split(" ")).iterator())
        .mapToPair(word -> new Tuple2<>(word, 1))
        .reduceByKey((a, b) -> a + b)
        .foreach(result -> LOGGER.info(
            String.format("Word [%s] count [%d].", result._1(), result._2)));
  }
}

Run as standalone local application

Running the application locally is just a matter of running maven test from the project folder.

mvn test

This will run an embedded version of Spark for testing purpose.

Setup the Spark environment

Apache Spark can run as standalone service or installed in an Hadoop environment. Spark can be dowloaded at https://spark.apache.org/downloads.html. For this example we will use the last available release for the package type "Prebuilt for Hadoop 2.7", however any other version should work without any change. This is how the download page should look like:

Spark Maven download

Once downloaded and decompressed the package is ready to be used (an Hadoop installation is not required for the standalone execution), the only requirement is to have a Java virtual machine installed.

Run the Java application on Apache Spark cluster

The Java application needs to be compiled first before executing it on Apache Spark. To compile the Java application from Maven:

  1. open the command line and move to the root maven project with "cd /<path to the project root>"
  2. execute the command "mvn package". Maven must be installed on the system path, otherwise the command mvn will not be recognized. Refer to the maven documentation on how to set up maven properly.
  3. maven will build the java file and save it on the target directory /<path to the project root>/target/first-example-1.0-SNAPSHOT.jar

Spark Maven Java build

Once we have build the Java application first-example-1.0-SNAPSHOT.jar we can execute it locally on Apache Spark, this makes the entire testing process very easy.

On a command shell move to the spark installation directory and use the following command:

./bin/spark-submit --class org.sparkexample.WordCount --master local[2] /<path to maven project>/target/spark-examples-1.0-SNAPSHOT.jar /<path to a demo test file> /<path to output directory>

where 

  • "--class org.sparkexample.WordCount" is the main Java class with the public static void main method
  • "--master local[2]" starts the cluster locally using 2 CPU cores
  • <path to maven project> is the path to our maven project
  • <path to a demo test file> is a demo local file which contains some words, an example file can be downloaded at https://github.com/melphi/spark-examples/blob/master/first-example/src/test/resources/loremipsum.txt
  • <path to output directory> is the directory where the resuls should be saved

If everything is fine the output should be similar to the following image and the world count result shoud be shown on console.

Spark Maven Java execute

All the files used in this tutorial can be found at https://github.com/melphi/spark-examples/tree/master/first-example.