Apache Spark 3 playground — Back to school with Zeppelin notebooks! Pt.3

In this third and final part of our “Apache Spark 3 playground” guide we will be adding a Zeppelin notebook into the mix.

Apache Zeppelin — from Zeppelin assets

Apache Zeppelin, from the Zepplin home page, is:

Web-based notebook that enables data-driven,
interactive data analytics and collaborative documents with SQL, Scala and more.

This will allow us to create a sandbox environment that will allow us to experiment and learn without needing to manage things like building, packaging, or deploying our code.

You can save these environments and share them with other developers, engineers, and analysts.

  • Setup a Zeppelin notebook in a local environment
  • Using Spark as a backend to explore data
  • The general guidance for solving an introduction data question (Word-count)

Jupyter is usually to go-to notebook solution for data engines, it has amazing community support and is widely used.

I found Zeppelin to be easier and more intuitive to get up and running and decided to go with it, but both are great applications to know.
It would be interesting to try to recreate this guide using Jupyter for the sake of comparison and deciding what flavor you prefer.

  • Setting up a local Spark environment
  • Adding an S3 mock and automating its creation

The only step you need to complete from them is cloning the guide’s Github repository, we did that in Part 1.

If you’re ready to go, let's start with spinning up our playground.

Launching Zeppelin

If you haven’t yet, clone our Playground by running

git clone https://github.com/omrisk/spark-local-env.git

Since we’re big on automation, run the following in the repository directory:

docker-compose -f docker-compose.yml -f docker-compose-zeppelin.yml up

We’re now launching the following:

This may take a few moments if you’re running it for the first time, the positive side is that all you need to do before you’re ready to go is to run the one command we saw earlier.

It provides an easy way to get a local “notebook” testing environment up and running, it can also be used to expose data and tools to data engineers analysts.

If you visit localhost:9090, you should be met with the Zeppelin UI.

Zeppelin splash screen

You now have a local spark cluster up and running with a zeppelin notebook you can use to run spark commands.

I’ve prepared a small airline-notebooks , the directory contains a small tutorial notebook I’ve prepared to start testing out Spark.

We’ll take a step-by-step walkthrough for solving an airline-count exercise using the flight information we loaded to our mock S3 in the previous part.

Notebook basics

A Zeppelin notebook is comprised of paragraphs, the place where we can write code. Each paragraph has an input section, for writing code and an output section for showing the output.

The code in each paragraph is executed by an “interpreter”, the interpreter used is decided by the first word in the paragraph.

So for example this paragraph is written in markdown, specified by the %md , and when clicking the Runbutton, Zeppelin will launch the markdown interpreter to create the markdown text output.

A Zeppelin markdown paragraph

This means that if we want to execute Spark code, we’ll need to use %spark as the first word in the paragraph so that Zeppelin will know what interpreter to use.

Spark Configuration

Spark has a ton of configurations available for engineers to allow tuning and different work modes.

Running the following paragraph will set the spark configurations needed for this guide.

See the comments in the snippet for explanations, if you’re following along with an open notebook, click run before proceeding.

Sanity testing

Now that we have set the needed configuration, we can do a simple sanity test to see that everything is “wired up” correctly.

Let’s try running this paragraph to load a small sample parquet file, that is part of the Apache Spark example files.

Reading a simple parquet file

A few things are happening now:

  • The Zeppelin notebook has launched a Spark Driver that will run the paragraph above on our local Spark cluster.
    If you open localhost:8080 you will see that it’s connected and running.
  • The Spark Driver, available at localhost:4040, has received 1 executor to run our “mini-application”.
    The tabs at the top hold more detailed information regarding the job.
  • Once our job completes you will see the results in the output section of the paragraph.
    We have read the users.parquet file and printed the schema and contents of it.

Reading our Airlines data

The next paragraph will read our airlines.csv file from our mock S3, MinIO, and create a Spark Dataframe.

The csv we’ve loaded has no columns, so we can see that Spark has assigned them generic names _cx , but since we have access to the openflights documentation we can improve this.
Let’s explain what a schema and Dataframe are in the next sections.

What’s in a schema?

A schema is the description of the structure of your data, it defines how Spark treats each column of the CSV.
It’s what defines if we treat a column containing a `123` as a string, integer, or double.

Reading data this way creates a Dataframe, this is a “row” of data that has columns according to the schema we defined
Let’s define a schema in the next %sparkparagraph and read our flight data using a schema.

Since we know a bit about our data, we can create a schema with named columns by creating a StructType and describing the fields and then passing the airlineSchema to the read function.

After running this we can see that the output has the data with the column names we defined.

DataFrames Vs DataSets

Spark has a few ways to represent data, we introduced the DataFrame in the previous section, now we’ll bring in DataSets.

A Spark DataSet is a strongly-typed row of Data, this means that we can easily use it as if it was a scala object.
Using a DataSet would ensure that all the columns are named and have the correct type used.

Let’s define a AirlineDS case class in scala and use that as our schema to load our information.

After running this we can see we receive a similar result as we did when using a schema to define our column names and types.

Count all the things!

A common problem we usually need to solve is aggregation, how many times does something appear, and in what groups?
In this case, we may be interested to know how many airlines are based in the United States?
How many in Russia?
Since we’ve created the airlineDS dataset let’s try to count what are the countries with the most active airlines.
Please notice the comment explaining line by line how we did so.

We’re greeted by the results:

+--------------+-------------+ 
| country|airline-count|
+--------------+-------------+
| United States| 1099|
| Mexico| 440|
|United Kingdom| 414|
| Canada| 323|
| Russia| 238|
...
...

Great! You run you’ve solved your first aggregation problem.

Persist it back

Usually, we run a Spark job so we can save the results somewhere, we may be updating a database, creating a report, or anything else we want to refer back to later.

So the last thing we’ll do now is to save our results back to MinIO by running:

If you open the MinIO UI at localhost:9000 (user: abc, password: xyzxyzxyz) and browse to the output bucket you should see our results saved there.

That’s it!

In this guide we’ve shown how to:

  1. Load data from s3
  2. Different ways of representing it (Dataframes vs. DataSets)
  3. Did a basic aggregation and count problem
  4. Saved the results back to MinIO
  5. Done so in a Zeppelin notebook environment with no local installations

Zeppelin provides additional notebooks and interpreters to experiment with.
I think it’s a good way to get a “taste” of how to work with the different data engines and an easy way to get started quickly.
You can also try going through the different example notebooks in Zeppelin and explore the different interpreter options.

This was the final part of our Apache Spark Playground, I tried to provide a brief overview of what you need to get up and running quickly to start learning.
There’s a great deal more to learn about Spark and big data, it’s a deep lake, in the future I plan to write about whatever I find interesting or deserving of a guide.

Thanks for making it this far!

Data Guild Manager at Dynamic Yield

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store