Beginners guide to Apache Spark, a lightning-fast unified analytics engine

Overview

image-right Apache Spark is a lightning-fast unified analytics engine for big data and machine learning. It was originally developed at AmpLab at UC Berkeley back in 2009 and it is fully open-sourced under the Apache software foundation.

This blog points you to some good articles that make you well-versed on Apache Spark.

Pre-requisites

  • SQL and data background
    With a Data background you will already be knowing how to do transformation, joins, etc. It is just you need to perform the same in a distributed way using the Spark APIs on your choice of language
  • Any programming knowledge
    Even though you can write SQL queries to perform data transformation, this is not the approach followed mostly in the industry. Familiarity with a programming language such as Scala, Python, Java, or R will give you an edge. If you don’t have any programming background, I would recommend learning python
  • Spark Environment
    You need a spark cluster to perform spark operations. There are various options to accomplish this such as
    • Installing spark on your local machine
    • Using any managed services from a cloud provider ( eg: Amazon EMR)
    • Try Databricks community edition and so on …

    I would recommend using Databricks Community Edition which is free of charge and you will get all the additional benefits of using the cloud and databricks unified platform.

Video resources

Watch the below playlist in the order

Do handson

Perform the exercises discussed in the video sessions in your spark environment

Books

Every spark developer should have any of the below books. You will love reading the books if you are doing it post watching the videos.

  • Spark – The Definitive Guide: Big data processing made simple
  • Learning Spark - Lightning-fast Data Analytics

More Handson…

Try out all the code samples given in the books

Familiarize yourself with Spark Documentation

When you use a specific spark transformation/action, try to understand more details of the same by using the Spark documentation.

And that’s a wrap, yo! You survived the first waves of learning Apache Spark…
More to follow and Good luck!!


Related Posts

If you found this blog post helpful or informative, please consider sharing it with your friends and followers on social media.Thank you for your support!