'How to test mocked (moto/boto) S3 read/write in PySpark

I am trying to unittest a function that writes data to S3 and then reads the same data from the same S3 location. I am trying to use a moto and boto (2.x) to achieve that [1]. The problem is that the service returns that I am forbidden to access the key [2]. A similar problem (even though that the error message is a bit different) is reported in the moto github repository [3] but it is not resolved yet.

Has anyone ever successfully tested mocked s3 read/write in PySpark to share some insights?

[1]

import boto
from boto.s3.key import Key
from moto import mock_s3

_test_bucket = 'test-bucket'
_test_key = 'data.csv'

@pytest.fixture(scope='function')
def spark_context(request):
    conf = SparkConf().setMaster("local[2]").setAppName("pytest-pyspark-local-testing")
    sc = SparkContext(conf=conf)
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", 'test-access-key-id')
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", 'test-secret-access-key')
    request.addfinalizer(lambda: sc.stop())
    quiet_py4j(sc)
    return sc

spark_test = pytest.mark.usefixtures("spark_context")

@spark_test
@mock_s3
def test_tsv_read_from_and_write_to_s3(spark_context):
    spark = SQLContext(spark_context)

    s3_conn = boto.connect_s3()
    s3_bucket = s3_conn.create_bucket(_test_bucket)
    k = Key(s3_bucket)
    k.key = _test_key 
    k.set_contents_from_string('')    

    s3_uri = 's3n://{}/{}'.format(_test_bucket, _test_key)
    df = (spark
          .read
          .csv(s3_uri))

[2]

(...)
E py4j.protocol.Py4JJavaError: An error occurred while calling o33.csv.
E : org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/data.csv' - ResponseCode=403, ResponseMessage=Forbidden
(...)

[3] https://github.com/spulec/moto/issues/1543

pyspark boto moto

Solution 1:^[1]

moto is a library which is used to mock aws resources.

1. Create the resource:

If you try to access an S3 Bucket which doesn't exist, aws will return a Forbidden error.

Usually, we need these resources created even before our tests run. So, create a pytest fixture with autouse set to True

import pytest
import boto3
from moto import mock_s3

@pytest.fixture(autouse=True)
def fixture_mock_s3():
    with mock_s3():
        conn = boto3.resource('s3', region_name='us-east-1')
        conn.create_bucket(Bucket='MYBUCKET') # an empty test bucket is created
        yield

The above code creates a mock s3 bucket with name "MUBUCKET". The bucket is empty.

The name of the bucket should be same as that of the original bucket.

with autouse, the fixture is automatically available across tests.

You can confidently run tests, as your tests will not have access to the original bucket.

2. Define and run tests involving the resource:

Suppose, you have code that writes a file to S3 Bucket

def write_to_s3(filepath: str):
    s3 = boto3.resource('s3', region_name='us-east-1')    
    s3.Bucket('MYBUCKET').upload_file(filepath, 'A/B/C/P/data.txt')

This can be tested the following way:

from botocore.errorfactory import ClientError

def test_write_to_s3():
    dummy_file_path = f"{TEST_DIR}/data/dummy_data.txt"
    # The s3 bucket is created by the fixture and not lies empty
    # test for emptiness
    s3 = boto3.resource('s3', region_name='us-east-1')
    bucket = s3.Bucket("MYBUCKET")
    objects = list(bucket.objects.filter(Prefix="/"))
    assert objects == []
    # Now, lets write a file to s3
    write_to_s3(dummy_file_path)
    # the below assert statement doesn't throw any error
    assert s3.head_object(Bucket='MYBUCKET', Key='A/B/C/P/data.txt')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	whitehat

'How to test mocked (moto/boto) S3 read/write in PySpark

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]