In the first part of our Apache Spark 3 playground guide, we covered setting up a Spark local environment that allowed you to start experimenting, and even run through the basic “Getting Started” tutorial.
If you haven’t done Part 1 yet, I suggest you start there as I’ll assume you’ve already gone through it and are familiar with the stages leading up to this guide, and are ready to dive in.
First, let’s talk a bit about what S3 is and why we’re interested in mocking it in the first place.
What’s the ‘S’ about?
From the AWS S3 documentation:
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.
There are a few important takeaways here. Let’s explain:
- Object Storage Service — S3 is often treated as a remote file system, but that is just a testament to its flexibility. It’s actually an object store that allows users to store files(Objects) for later use or retrieval.
- Scalability and performance— S3 is amazing at providing service to fluctuating demands that can go up to many thousands of requests per second from a single user all the way down to zero without any preparation.
- Data Availability — From the S3 site, S3 is designed for 11 9’s of data durability by replicating your stored data to multiple systems so it’s less likely to be lost due to system failure or human error.
- Security — Objects on S3 can be made publicly available and S3 has many features in place to ensure that only authorized users can write or read data from it and even encrypt data stored on it.
We also use it for various other uses, even hosting our internal websites.
Why do we want to mock S3?
In a cloud environment like AWS we, by definition, work with different tools than what is available in the cloud.
When developing new applications we want to ensure that our local setup is as close as possible to the cloud environment or the “real world”.
We want to ensure that a development cycle is as short as possible, imagine you’re working on a new piece of code and want to test it.
If you want a fast and efficient development experience, you need short and effective iterations that allow you to quickly and easily write new code and test it.
Otherwise, you’ll probably find yourself doing something like:
- Compile locally
- Open a pull request to merge it to a staging/dev development branch
- Wait for the code to be reviewed
- Merge once approved
- Wait for it to be deployed to the relevant environment
- Find out you forgot to update some random variable, update it and return to step 1
Closely simulating the cloud services needed locally, would allow quick and simple development iterations.
Mocking will allow us to closely follow the behavior of the real S3, but without needing the real deal.
This will allow developing, testing, and debugging code to be much more effective.
There are two main contenders in the S3 mocking ring:
LocalStack provides an easy-to-use test/mocking framework for developing Cloud applications.
Localstack is super useful, it aims to mock as many AWS services it can, as closely as possible, and currently supports over 20 of them.
The problem with this is that since it’s a Swiss army knife, starting it up can be slow and not simple to use for beginners since it has many more options and features than other simpler solutions.
Try running this in your terminal and see how long it takes to spin up:
On my machine, this took about 2 minutes and the image takes up 658MB.
Localstack is an amazing project, but it’s a sledgehammer and more than we need for now.
MinIO is a High Performance Object Storage released under Apache License v2.0. It is API compatible with Amazon S3 cloud storage service.
MinIO provides a much simpler solution as it tries to do only one thing, but do it well.
It only provides object storage and is therefore a much more lightweight solution.
Try running this in your terminal to compare:
On my machine, this took about 25 seconds and the image takes up 184MB.
So, in this case, I preferred a simpler and faster solution for testing.
Setting up MinIO
Up to now, we’ve talked a little about:
- What S3 is
- Why we want to mock it
- What the popular options for that are and why we chose MinIO
docker-compose run minio-setup no-data
Open localhost:9000 in another browser tab and you should see this:
Great! You now have a local MinIO running, log in using the following credentials, they’re set in the
docker/.env file in case you’re curious:
- User: abc
- Password: xyzxyzxyz
You will be met with this:
This is the MinIO UI, it looks pretty empty, so let’s fill it up.
- Go to the “openFlights” Github repository
- Download a few files
- Click the red plus sign and create a bucket named
- Create Directories by clicking the (+) directory icon
- Upload files one at a time
When you’re done it should look something like this:
Amazing! We now have a local “cloud” object-store with data ready to experiment with.
But take a moment to imagine doing that every time you wanted to run a test or experiment with something new. It can quickly become tiring.
So what we’d want to do is automate this. We’ll do that in the next section.
But first, run the following to stop the MinIO container and reset it so it won’t hold on to any of the files you’ve uploaded.
You can restart the container to confirm the files are now gone.
Automate all the things!
Since we want to have an easy way to get up and running, we want to automate the entire last section so we can run a command and be ready to rock.
Try running the following and opening the MinIO UI once again:
docker-compose run minio-setup
Presto! You should see a bucket with directories and files all set up and ready to go!
Let’s explain how:
minio-setupcontainer wants to start, but it depends (line 14-15) on the
miniofirst, since the
minio-setupcontainer depends on it to work
./setup.shscript that was mounted to it (lines 19–22).
./setup.shcreated the needed credentials, create a bucket, downloaded the files, and uploaded them to the
Using methods like this, that allow you to quickly set up test environments according to your preferences and requirements, can greatly improve your speed and development experience.
In this part we discussed:
- Object storage — S3 in particular
- Why we should mock them for local development
- How to automate setting up your environment.
You now have all the basic building blocks to experiment with Spark locally while closely simulating an AWS cloud environment.
In the next part, we will show how you can use an Apache Zeppelin notebook to experiment with Spark and do the infamous “Word-Count” exercise.