'AWS Glue 2.0, local pyspark development, testing confusion
I'm new to Glue jobs and I'm looking to try to use Glue 2.0 to run PySpark jobs (python 3) that require the following python libraries as defined in my requirements.txt. I'm sort of at a loss on how/where to start so I can dev locally and then successfully publish jobs and not get into a cycle of lots of trial and error.
python-dateutil pyyaml jsonpath-ng jinja2 dateparser pymmh3 (pure python)
How should one be developing locally on my machine before even attempting to submit PySpark jobs? When I read the docs on AWS's site and various AWS blog articles I feel pretty confused.
Should I use local development endpoints? This doc says they are not supported for Glue 2.0 and the link takes me to an article labeled "Running Spark ETL Jobs with Reduced Startup Times." which says nothing about local development endpoints or insights for local development. https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint.html and https://docs.aws.amazon.com/glue/latest/dg/reduced-start-times-spark-etl-jobs.html
Next I checked this document "Developing and Testing ETL Scripts Locally Using the AWS Glue ETL Library" at https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html This document seems to only mention Glue 1.0.... will this work for 2.0? When I look at https://github.com/awslabs/aws-glue-libs again all I see are references to 1.0 and nothing to 2.0. This ticket is also concerning: https://github.com/awslabs/aws-glue-libs/issues/51 The 1.0 artifacts seem to reference spark 2.4.3 which is the same as 2.0, but there are numerous references to how the environment for Glue 2.0 jobs is different.... so should one expect that investing time into local development with these 1.0 aws-glue-libs would work in 2.0 especially if trying to test w/ extra modules?
This AWS blog article "Building Python modules from a wheel for Spark ETL workloads using AWS Glue 2.0" seems helpful for how to package up your modules for job submission, but does not help in regards to local testing/dev: https://aws.amazon.com/blogs/big-data/building-python-modules-from-a-wheel-for-spark-etl-workloads-using-aws-glue-2-0/
This AWS blog article: "Developing AWS Glue ETL jobs locally using a container" again seems promising but again references the aws-glue-libs project and its corresponding docker image for 2.0 "amazon/aws-glue-libs:glue_libs_2.0.0_image_01".... but alas this does not exist, nor again does the github project mention 2.0. https://github.com/awslabs/aws-glue-libs and https://aws.amazon.com/blogs/big-data/developing-aws-glue-etl-jobs-locally-using-a-container/ also: https://aws.amazon.com/blogs/big-data/building-an-aws-glue-etl-pipeline-locally-without-an-aws-account/
At the end of the day, I'm a bit confused on how to successfully start w/ AWS Glue 2.0 w/ local python development and packaging for Glue 2.0, Python 3, PySpark.... any suggestions out there? Thanks!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|