'How to test mocked (moto/boto) S3 read/write in PySpark

I am trying to unittest a function that writes data to S3 and then reads the same data from the same S3 location. I am trying to use a moto and boto (2.x) to achieve that [1]. The problem is that the service returns that I am forbidden to access the key [2]. A similar problem (even though that the error message is a bit different) is reported in the moto github repository [3] but it is not resolved yet.

Has anyone ever successfully tested mocked s3 read/write in PySpark to share some insights?

[1]

import boto
from boto.s3.key import Key
from moto import mock_s3

_test_bucket = 'test-bucket'
_test_key = 'data.csv'

@pytest.fixture(scope='function')
def spark_context(request):
    conf = SparkConf().setMaster("local[2]").setAppName("pytest-pyspark-local-testing")
    sc = SparkContext(conf=conf)
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", 'test-access-key-id')
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", 'test-secret-access-key')
    request.addfinalizer(lambda: sc.stop())
    quiet_py4j(sc)
    return sc

spark_test = pytest.mark.usefixtures("spark_context")

@spark_test
@mock_s3
def test_tsv_read_from_and_write_to_s3(spark_context):
    spark = SQLContext(spark_context)

    s3_conn = boto.connect_s3()
    s3_bucket = s3_conn.create_bucket(_test_bucket)
    k = Key(s3_bucket)
    k.key = _test_key 
    k.set_contents_from_string('')    

    s3_uri = 's3n://{}/{}'.format(_test_bucket, _test_key)
    df = (spark
          .read
          .csv(s3_uri))

[2]

(...)
E py4j.protocol.Py4JJavaError: An error occurred while calling o33.csv.
E : org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/data.csv' - ResponseCode=403, ResponseMessage=Forbidden
(...)

[3] https://github.com/spulec/moto/issues/1543



Solution 1:[1]

moto is a library which is used to mock aws resources.

1. Create the resource:

If you try to access an S3 Bucket which doesn't exist, aws will return a Forbidden error.

Usually, we need these resources created even before our tests run. So, create a pytest fixture with autouse set to True

import pytest
import boto3
from moto import mock_s3

@pytest.fixture(autouse=True)
def fixture_mock_s3():
    with mock_s3():
        conn = boto3.resource('s3', region_name='us-east-1')
        conn.create_bucket(Bucket='MYBUCKET') # an empty test bucket is created
        yield

  • The above code creates a mock s3 bucket with name "MUBUCKET". The bucket is empty.
  • The name of the bucket should be same as that of the original bucket.
  • with autouse, the fixture is automatically available across tests.
  • You can confidently run tests, as your tests will not have access to the original bucket.

2. Define and run tests involving the resource:

Suppose, you have code that writes a file to S3 Bucket

def write_to_s3(filepath: str):
    s3 = boto3.resource('s3', region_name='us-east-1')    
    s3.Bucket('MYBUCKET').upload_file(filepath, 'A/B/C/P/data.txt')

This can be tested the following way:

from botocore.errorfactory import ClientError

def test_write_to_s3():
    dummy_file_path = f"{TEST_DIR}/data/dummy_data.txt"
    # The s3 bucket is created by the fixture and not lies empty
    # test for emptiness
    s3 = boto3.resource('s3', region_name='us-east-1')
    bucket = s3.Bucket("MYBUCKET")
    objects = list(bucket.objects.filter(Prefix="/"))
    assert objects == []
    # Now, lets write a file to s3
    write_to_s3(dummy_file_path)
    # the below assert statement doesn't throw any error
    assert s3.head_object(Bucket='MYBUCKET', Key='A/B/C/P/data.txt')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 whitehat