Cut Apache Spark costs with Google Cloud preemptible instances

Apache Spark, the popular Open Source tool for large scale machine learning and data processing, is well known for its demand in terms of memory and CPU cores, which are the most expensive resources in a cloud cloud computing environment. 

Apache Spark cluster costs can be greatly reduced by unsing Google Cloud preemptible instances, a functionality present in Amazon Web Services as well, however Google make it straightforward to use them in a Spark cluster with Google Dataproc.

Preemptive images

Preemptible instances are virtual machines that can last up to one day, which means that they will be automatically destroyed at any time within 24 hours. Google Cloud provides them to optimise the workload of the datafarm and the good news is that they are really cheap!

A preemtible instance costs up to 80% less than any nomal instance, the question is that they will be automatically shut down within 24 hours, so why use them for a Spark cluster?

  • Most of the Apache Spark tasks run within few hours or minutes, well beyond the 24 hours limit
  • Spark is designed to auto recover from node failures, therefore tasks will automatically recover in case of note failure
  • Google Cloud will automatically and immeditaly recycle the lost instance with a new preemtible instance
  • Preemtible instance are very cheap! This means that cluster costs can be dramatically reduced

There are also some drawbacks:

  • It is not possible to use a full preemptible cluster, at least 1 master and 2 workers should not be preemptible. This is absoutely reasonable to avoid failures of the master node and guarantee a minimum of stable nodes.
  • Very long running tasks might be penalised by preemptible instances shutting down and make the cluster much less performing. Very long Spark tasks lasting serveral hours or days are however not so common.

Set up a Google Cloud preemptible cluster

With Google DataProc setting up an Apache Spark preemptible cluster is extremely easy:

  1. Go to https://console.cloud.google.com/dataproc/clusters or log in on the Cloud Console and select DataProc from the menu.
  2. Press the Create cluster button

    Google cloud Spark cluster
  3. Set the instance type of the master and at least 2 worker nodes
    Google cloud DataProc Apache Spark setup
  4. Click on "Preemptible workers, bucket, network, version, initialisation & access options" and configure the number of preemtible nodes. The instance type will be the same of the normal working nodes  Eventually enable "Allow API access to all Google Cloud services in the same project" when the cluster needs to use other cloud resources.

    Google cloud Apache Spark preemptible node
  5. The cluster will be ready in few minutes and preemptible workes will be shown on the cluster details

    Google cloud Apache Spark preemptible nodes

Conclusion

The fact that Apache Spark can auto recover from node failures and Google Cloud DataProc automatically recycles the shut down preemptible instances with new preemptible workes make this option a valid choice to size up cluters and greatly reduce cloud computing costs.