PolarSPARC

Apache Spark 4.x Quick Notes :: Part - 5


Bhaskar S 12/12/2025


Overview

In a typical Enterprise, the most important asset (the data) is spread across various data stores. Nowadays, more and more of those important data asset(s) end up in an S3 Object Store.

In the article Hands-on with Garage, we introduced how one can setup and use an S3 Object Store on a single node.

In this part of the series, we will demonstrate how one can access and use the Iris dataset (in parquet format) from an S3 bucket. The Iris dataset (in parquet format) can be downloaded from HERE to store in S3.


Hands-on PySpark using S3 (Local Mode)

Before proceeding, ensure Garage is setup and running. In addition, ensure that the iris parquet file iris.parquet is stored in the S3 bucket named spark-datasets.

To launch the pyspark shell, execute the following docker command in the terminal window:


docker run --rm --name pyspark-local --network host -u $(id -u $USER):$(id -g $USER) -v $HOME/spark/conf:/opt/spark/conf -v $HOME/spark/data:/opt/spark/data -it ps-spark:v4.0.1 /opt/spark/bin/pyspark --master local[1]


The following would be the typical output:


Output.1

Python 3.10.12 (main, Aug 15 2025, 14:32:43) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/09 01:25:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting Spark log level to "INFO".
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.1
      /_/

Using Python version 3.10.12 (main, Aug 15 2025 14:32:43)
Spark context Web UI available at http://kailash:4040
Spark context available as 'sc' (master = local[1], app id = local-1765243559026).
SparkSession available as 'spark'.
>>>

To load the iris dataset from the S3 bucket and create a pyspark dataframe named iris_df, execute the following code snippet:


iris_df = spark.read.parquet('s3://spark-datasets/iris.parquet')

Executing the above Python code would generate the following typical output:


Output.2

25/12/13 00:09:02 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: s3://spark-datasets/iris.parquet.
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"

OOPs - what happened here ???

Digging into the Hadoop AWS documentation HERE, we learn that we need to use the scheme s3a to access any S3 resource.

Once again, to load the iris dataset from the S3 bucket and create a pyspark dataframe named iris_df, execute the following code snippet:


iris_df = spark.read.parquet('s3a://spark-datasets/iris.parquet')

Executing the above Python code would generate the following typical output:


Output.3

25/12/13 00:09:59 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: s3a://spark-datasets/iris.parquet.
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

Hmm - looks like we are missing some jar(s) ???

Once again, from the Hadoop AWS documentation, we learn that we need the jar(s) hadoop-aws and the AWS V2 SDK bundle.

Before we proceed further, we need to determine the version of the hadoop client jar(s) in the docker base image.

To find the hadoop version, execute the following command in a terminal window:


docker run --rm --name temp -it spark:4.0.1-scala2.13-java21-python3-ubuntu ls /opt/spark/jars/ | grep hadoop-client

The following would be the typical output:


Output.4

hadoop-client-api-3.4.1.jar
hadoop-client-runtime-3.4.1.jar

From the Output.4 above, it is clear that the desired hadoop version is 3.4.1.

Also, googling around in the web, we find that the compatible AWS V2 SDK version is 2.29.52.


!!! ATTENTION !!!

If we do NOT pick the right version of Hadoop and the corresponding compatible version of AWS V2 SDK, we will encounter ClassNotFound issues !!!

To add the missing jar(s), we will have to modify the contents of the dockerfile that we created in Part-1 with the following contents:


dockerfile
FROM spark:4.0.1-scala2.13-java21-python3-ubuntu

### Complete the necessary installation as root

USER root

ARG user=alice
ARG group=alice
ARG uid=1000
ARG gid=1000

RUN groupadd -g ${gid} ${group}
RUN useradd -u ${uid} -g ${gid} -M -s /sbin/nologin ${user}

RUN usermod -a -G spark ${user}
RUN usermod -a -G ${user} spark

### Setup S3 as root

ENV HADOOP_AWS_VERSION=3.4.1
ENV AWS_JAVA_SDK_VERSION=2.29.52

RUN curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_AWS_VERSION}/hadoop-aws-${HADOOP_AWS_VERSION}.jar -Lo /opt/spark/jars/hadoop-aws-${HADOOP_AWS_VERSION}.jar
RUN curl https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/${AWS_JAVA_SDK_VERSION}/bundle-${AWS_JAVA_SDK_VERSION}.jar -Lo /opt/spark/jars/bundle-${AWS_JAVA_SDK_VERSION}.jar

### Change user - has to be last

USER ${user}

To build our custom docker image tagged ps-spark-s3:v4.0.1, execute the following command in the terminal window:


docker build -t 'ps-spark-s3:v4.0.1' .


The following would be the typical trimmed output:


Output.5

[+] Building 8.8s (11/11) FINISHED                                                                                   docker:default
 => [internal] load build definition from dockerfile.2                                                                         0.0s
 => => transferring dockerfile: 922B                                                                                           0.0s
 => [internal] load metadata for docker.io/library/spark:4.0.1-scala2.13-java21-python3-ubuntu                                 0.0s
 => [internal] load .dockerignore                                                                                              0.0s
 => => transferring context: 2B                                                                                                0.0s
 => [1/7] FROM docker.io/library/spark:4.0.1-scala2.13-java21-python3-ubuntu                                                   0.0s
 => CACHED [2/7] RUN groupadd -g 1000 alice                                                                                    0.0s
 => CACHED [3/7] RUN useradd -u 1000 -g 1000 -M -s /sbin/nologin alice                                                         0.0s
 => CACHED [4/7] RUN usermod -a -G spark alice                                                                                 0.0s
 => CACHED [5/7] RUN usermod -a -G alice spark                                                                                 0.0s
 => [6/7] RUN curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.4.1/hadoop-aws-3.4.1.jar -Lo /opt/spark/jars  0.3s
 => [7/7] RUN curl https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.29.52/bundle-2.29.52.jar -Lo /opt/spark/jar  7.5s
 => exporting to image                                                                                                         1.0s 
 => => exporting layers                                                                                                        1.0s 
 => => writing image sha256:cfff0540f92a56fba3f37da87629ac5d7f0fa241383b472b9a817a3cc46e58e5                                   0.0s 
 => => naming to docker.io/library/ps-spark-s3:v4.0.1                                                                          0.0s

Once again, to launch the pyspark shell using the new docker image, execute the following docker command in the terminal window:


docker run --rm --name pyspark-local --network host -u $(id -u $USER):$(id -g $USER) -v $HOME/spark/conf:/opt/spark/conf -v $HOME/spark/data:/opt/spark/data -it ps-spark-3:v4.0.1 /opt/spark/bin/pyspark --master local[1]


The output will be similar to that of Output.1 from above.

To load the iris dataset from the S3 bucket and create a pyspark dataframe named iris_df, execute the following code snippet:


iris_df = spark.read.parquet('s3a://spark-datasets/iris.parquet')

Executing the above Python code would generate the following typical output:


Output.6

25/12/13 00:21:22 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: s3a://spark-datasets/iris.parquet.
java.nio.file.AccessDeniedException: s3a://spark-datasets/iris.parquet: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : software.amazon.awssdk.core.exception.SdkClientException: Unable to load credentials from system settings. Access key must be specified either via environment variable (AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId).

Aargh - what happened here ???

We need to provide the S3 Object Server details to our running Garage instance.

To add the desired configuration parameters, we will have to modify the contents of spark-defaults.conf that we created in Part-2 with the following contents:


spark-defaults.conf
# This is useful for setting default environmental settings.

spark.log.level                                 INFO

# Enable S3 via s3a://..

spark.hadoop.fs.s3a.impl                        org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key                  GK0cda775ac76876cd4317d737
spark.hadoop.fs.s3a.secret.key                  ae9c072c1edbb4b9018dc7c3f5fa63de1e9bd1526c26292cfb6a2b86f66f3e1a
spark.hadoop.fs.s3a.endpoint                    http://192.168.1.25:3900
spark.hadoop.fs.s3a.endpoint.region             garage
spark.hadoop.fs.s3a.ssl.enabled                 false
spark.hadoop.fs.s3a.path.style.access           true
spark.hadoop.fs.s3a.aws.credentials.provider    org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider


!!! ATTENTION !!!

If we do NOT set the AWS Region parameter spark.hadoop.fs.s3a.endpoint.region, we will encounter connection issues !!!

Once again, to load the iris dataset from the S3 bucket and create a pyspark dataframe named iris_df, execute the following code snippet:


iris_df = spark.read.parquet('s3a://spark-datasets/iris.parquet')

Executing the above Python code generates no output.

To check the count of rows in the iris_df dataframe, execute the following code snippet:


iris_df.count()

Executing the above Python code generates the following typical output:


Output.7

150

To display the top 10 rows of the iris_df dataframe, execute the following code snippet:


iris_df.show(10)

Executing the above Python code generates the following typical output:


Output.8

+------------+-----------+------------+-----------+-------+
|sepal.length|sepal.width|petal.length|petal.width|variety|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| Setosa|
|         4.9|        3.0|         1.4|        0.2| Setosa|
|         4.7|        3.2|         1.3|        0.2| Setosa|
|         4.6|        3.1|         1.5|        0.2| Setosa|
|         5.0|        3.6|         1.4|        0.2| Setosa|
|         5.4|        3.9|         1.7|        0.4| Setosa|
|         4.6|        3.4|         1.4|        0.3| Setosa|
|         5.0|        3.4|         1.5|        0.2| Setosa|
|         4.4|        2.9|         1.4|        0.2| Setosa|
|         4.9|        3.1|         1.5|        0.1| Setosa|
+------------+-----------+------------+-----------+-------+
only showing top 10 rows

BAM - we are in business now as everything seems to be working as expected !!!

With this, we conclude the demonstration on how one can leverage pyspark for accessing dataset(s) in S3 !!!


References

Hands-on with Garage

Hadoop AWS Module

Apache Spark 4.x Quick Notes :: Part - 4

Apache Spark 4.x Quick Notes :: Part - 3

Apache Spark 4.x Quick Notes :: Part - 2

Apache Spark 4.x Quick Notes :: Part - 1



© PolarSPARC