Apache Spark 3 playground — Mock S3! Pt. 2

In the first part of our Apache Spark 3 playground guide, we covered setting up a Spark local environment that allowed you to start experimenting, and even run through the basic “Getting Started” tutorial.

If you haven’t done Part 1 yet, I suggest you start there as I’ll assume you’ve already gone through it and are familiar with the stages leading up to this guide, and are ready to dive in.

By Charles Deluvio on Unsplash

First, let’s talk a bit about what S3 is and why we’re interested in mocking it in the first place.

What’s the ‘S’ about?

From the AWS S3 documentation:

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.

There are a few important takeaways here. Let’s explain:

  • Object Storage Service — S3 is often treated as a remote file system, but that is just a testament to its flexibility. It’s actually an object store that allows users to store files(Objects) for later use or retrieval.
  • Scalability and performance— S3 is amazing at providing service to fluctuating demands that can go up to many thousands of requests per second from a single user all the way down to zero without any preparation.
  • Data AvailabilityFrom the S3 site, S3 is designed for 11 9’s of data durability by replicating your stored data to multiple systems so it’s less likely to be lost due to system failure or human error.
  • Security — Objects on S3 can be made publicly available and S3 has many features in place to ensure that only authorized users can write or read data from it and even encrypt data stored on it.

So S3 is pretty useful. At Dynamic Yield, we use it as a one-stop-shop to store all our data, from the moment we ingest it to the moment we’re done processing and extracting insights from it.

We also use it for various other uses, even hosting our internal websites.

Why do we want to mock S3?

In a cloud environment like AWS we, by definition, work with different tools than what is available in the cloud.

When developing new applications we want to ensure that our local setup is as close as possible to the cloud environment or the “real world”.
We want to ensure that a development cycle is as short as possible, imagine you’re working on a new piece of code and want to test it.

If you want a fast and efficient development experience, you need short and effective iterations that allow you to quickly and easily write new code and test it.
Otherwise, you’ll probably find yourself doing something like:

  1. Compile locally
  2. Open a pull request to merge it to a staging/dev development branch
  3. Wait for the code to be reviewed
  4. Merge once approved
  5. Wait for it to be deployed to the relevant environment
  6. Find out you forgot to update some random variable, update it and return to step 1

Closely simulating the cloud services needed locally, would allow quick and simple development iterations.

Mocking will allow us to closely follow the behavior of the real S3, but without needing the real deal.
This will allow developing, testing, and debugging code to be much more effective.

By Timothy Dykes on Unsplash

Why MinIO?

There are two main contenders in the S3 mocking ring:

localstack

LocalStack provides an easy-to-use test/mocking framework for developing Cloud applications.

Localstack is super useful, it aims to mock as many AWS services it can, as closely as possible, and currently supports over 20 of them.
The problem with this is that since it’s a Swiss army knife, starting it up can be slow and not simple to use for beginners since it has many more options and features than other simpler solutions.
Try running this in your terminal and see how long it takes to spin up:

docker run -it --rm -p 4566:4566 -p 4571:4571 localstack/localstack

On my machine, this took about 2 minutes and the image takes up 658MB.
Localstack is an amazing project, but it’s a sledgehammer and more than we need for now.

MinIO

MinIO is a High Performance Object Storage released under Apache License v2.0. It is API compatible with Amazon S3 cloud storage service.

MinIO provides a much simpler solution as it tries to do only one thing, but do it well.
It only provides object storage and is therefore a much more lightweight solution.
Try running this in your terminal to compare:

docker run -p 9000:9000 minio/minio server /data

On my machine, this took about 25 seconds and the image takes up 184MB.
So, in this case, I preferred a simpler and faster solution for testing.

Setting up MinIO

Up to now, we’ve talked a little about:

  • What S3 is
  • Why we want to mock it
  • What the popular options for that are and why we chose MinIO

You should still have the spark-local-env git repository, if not see Part 1 for setting it up.
Once cloned, run the following in the terminal from the repository location:

docker-compose run minio-setup no-data

Open localhost:9000 in another browser tab and you should see this:

MinIO login page

Great! You now have a local MinIO running, log in using the following credentials, they’re set in the docker/.env file in case you’re curious:

  • User: abc
  • Password: xyzxyzxyz

You will be met with this:

This is the MinIO UI, it looks pretty empty, so let’s fill it up.

  • Go to the “openFlights” Github repository
  • Download a few files
  • Click the red plus sign and create a bucket named word-count
  • Create Directories by clicking the (+) directory icon
  • Upload files one at a time

When you’re done it should look something like this:

Amazing! We now have a local “cloud” object-store with data ready to experiment with.
But take a moment to imagine doing that every time you wanted to run a test or experiment with something new. It can quickly become tiring.

So what we’d want to do is automate this. We’ll do that in the next section.
But first, run the following to stop the MinIO container and reset it so it won’t hold on to any of the files you’ve uploaded.

docker-compose down -v --rm local

You can restart the container to confirm the files are now gone.

Automate all the things!

Since we want to have an easy way to get up and running, we want to automate the entire last section so we can run a command and be ready to rock.

Try running the following and opening the MinIO UI once again:

docker-compose run minio-setup

Presto! You should see a bucket with directories and files all set up and ready to go!
Let’s explain how:

This is part for the docker-compose.yml we worked with in Part 1.
When you ran docker-compose run minio-setup the following things happened:

  • The minio-setup container wants to start, but it depends (line 14-15) on the minio service
  • docker-compose launched minio first, since the minio-setup container depends on it to work
  • After minio started, the minio-setup ran the ./setup.sh script that was mounted to it (lines 19–22).
  • Running ./setup.sh created the needed credentials, create a bucket, downloaded the files, and uploaded them to the minio container.

Let’s take a closer look at the./setup.sh script, we’re using mc, the MinIO CLI tool, to automate everything we did earlier.

Using methods like this, that allow you to quickly set up test environments according to your preferences and requirements, can greatly improve your speed and development experience.

Summary

In this part we discussed:

  • Object storage — S3 in particular
  • Why we should mock them for local development
  • How to automate setting up your environment.

You now have all the basic building blocks to experiment with Spark locally while closely simulating an AWS cloud environment.

Next Steps

Try experimenting with MC and see how to interact with MinIO.
Read the docs to better understand what we did in the ./setup.sh script.

Since MinIO is S3 compliant you can experiment with the AWS CLI tool and see that it can work with your local MinIO with these MinIO examples.

In the next part, we will show how you can use an Apache Zeppelin notebook to experiment with Spark and do the infamous “Word-Count” exercise.

Data Guild Manager at Dynamic Yield

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store