In this third and final part of our “Apache Spark 3 playground” guide we will be adding a Zeppelin notebook into the mix.
Web-based notebook that enables data-driven,
interactive data analytics and collaborative documents with SQL, Scala and more.
This will allow us to create a sandbox environment that will allow us to experiment and learn without needing to manage things like building, packaging, or deploying our code.
You can save these environments and share them with other developers, engineers, and analysts.
This guide will explain:
- Setup a Zeppelin notebook in a local environment
- Using Spark as a backend to explore data
- The general guidance for solving an introduction data question (Word-count)
Zeppelin vs Jupyter?
Jupyter is usually to go-to notebook solution for data engines, it has amazing community support and is widely used.
I found Zeppelin to be easier and more intuitive to get up and running and decided to go with it, but both are great applications to know.
It would be interesting to try to recreate this guide using Jupyter for the sake of comparison and deciding what flavor you prefer.
The first two parts of this short series covered:
- Setting up a local Spark environment
- Adding an S3 mock and automating its creation
If you’re ready to go, let's start with spinning up our playground.
If you haven’t yet, clone our Playground by running
Since we’re big on automation, run the following in the repository directory:
docker-compose -f docker-compose.yml -f docker-compose-zeppelin.yml up
We’re now launching the following:
- The Spark cluster, setup in Part 1 — Available at localhost:8080
- The MinIO S3 Mock, prepared in Part 2 — Available at localhost:9000
- The Zeppelin Notebook, this guide will begin diving into what this is and how to use it — Available at localhost:9090
This may take a few moments if you’re running it for the first time, the positive side is that all you need to do before you’re ready to go is to run the one command we saw earlier.
It provides an easy way to get a local “notebook” testing environment up and running, it can also be used to expose data and tools to data engineers analysts.
If you visit localhost:9090, you should be met with the Zeppelin UI.
You now have a local spark cluster up and running with a zeppelin notebook you can use to run spark commands.
I’ve prepared a small
airline-notebooks , the directory contains a small tutorial notebook I’ve prepared to start testing out Spark.
We’ll take a step-by-step walkthrough for solving an
airline-count exercise using the flight information we loaded to our mock S3 in the previous part.
A Zeppelin notebook is comprised of paragraphs, the place where we can write code. Each paragraph has an input section, for writing code and an output section for showing the output.
The code in each paragraph is executed by an “interpreter”, the interpreter used is decided by the first word in the paragraph.
This means that if we want to execute Spark code, we’ll need to use
%spark as the first word in the paragraph so that Zeppelin will know what interpreter to use.
Spark has a ton of configurations available for engineers to allow tuning and different work modes.
Running the following paragraph will set the spark configurations needed for this guide.
See the comments in the snippet for explanations, if you’re following along with an open notebook, click run before proceeding.
Now that we have set the needed configuration, we can do a simple sanity test to see that everything is “wired up” correctly.
A few things are happening now:
- The Zeppelin notebook has launched a Spark Driver that will run the paragraph above on our local Spark cluster.
If you open localhost:8080 you will see that it’s connected and running.
- The Spark Driver, available at localhost:4040, has received 1 executor to run our “mini-application”.
The tabs at the top hold more detailed information regarding the job.
- Once our job completes you will see the results in the output section of the paragraph.
We have read the
users.parquetfile and printed the
schemaand contents of it.
Reading our Airlines data
The next paragraph will read our
airlines.csv file from our mock S3, MinIO, and create a Spark Dataframe.
csv we’ve loaded has no columns, so we can see that Spark has assigned them generic names
_cx , but since we have access to the openflights documentation we can improve this.
Let’s explain what a schema and Dataframe are in the next sections.
What’s in a schema?
A schema is the description of the structure of your data, it defines how Spark treats each column of the CSV.
It’s what defines if we treat a column containing a `123` as a string, integer, or double.
Reading data this way creates a Dataframe, this is a “row” of data that has columns according to the schema we defined
Let’s define a schema in the next
%sparkparagraph and read our flight data using a schema.
Since we know a bit about our data, we can create a schema with named columns by creating a
StructType and describing the fields and then passing the
airlineSchema to the read function.
After running this we can see that the output has the data with the column names we defined.
DataFrames Vs DataSets
Spark has a few ways to represent data, we introduced the DataFrame in the previous section, now we’ll bring in DataSets.
A Spark DataSet is a strongly-typed row of Data, this means that we can easily use it as if it was a scala object.
Using a DataSet would ensure that all the columns are named and have the correct type used.
Let’s define a
AirlineDS case class in scala and use that as our schema to load our information.
After running this we can see we receive a similar result as we did when using a schema to define our column names and types.
Count all the things!
A common problem we usually need to solve is aggregation, how many times does something appear, and in what groups?
In this case, we may be interested to know how many airlines are based in the United States?
How many in Russia?
Since we’ve created the
airlineDS dataset let’s try to count what are the countries with the most active airlines.
Please notice the comment explaining line by line how we did so.
We’re greeted by the results:
| United States| 1099|
| Mexico| 440|
|United Kingdom| 414|
| Canada| 323|
| Russia| 238|
Great! You run you’ve solved your first aggregation problem.
Persist it back
Usually, we run a Spark job so we can save the results somewhere, we may be updating a database, creating a report, or anything else we want to refer back to later.
So the last thing we’ll do now is to save our results back to MinIO by running:
If you open the MinIO UI at localhost:9000 (user: abc, password: xyzxyzxyz) and browse to the output bucket you should see our results saved there.
In this guide we’ve shown how to:
- Load data from s3
- Different ways of representing it (Dataframes vs. DataSets)
- Did a basic aggregation and count problem
- Saved the results back to MinIO
- Done so in a Zeppelin notebook environment with no local installations
Where do you go from here?
Zeppelin provides additional notebooks and interpreters to experiment with.
I think it’s a good way to get a “taste” of how to work with the different data engines and an easy way to get started quickly.
You can also try going through the different example notebooks in Zeppelin and explore the different interpreter options.
This was the final part of our Apache Spark Playground, I tried to provide a brief overview of what you need to get up and running quickly to start learning.
There’s a great deal more to learn about Spark and big data, it’s a deep lake, in the future I plan to write about whatever I find interesting or deserving of a guide.
Thanks for making it this far!