Category "aws-glue"

How to create External Table without specifying columns in Redshift?

I have a folder containing files in parquet format. I used crawler to create table defined in Glue Data Catalog which counted to 2500+ columns. I want to create

Glue 3.0 has Psycopg2 but "No module named 'psycopg2._psycopg'"?

I'm using AWS Glue 3.0 and am trying to connect to Redshift using Psycopg2. At first I was uploading a whl file version of it and it would give me the error abo

Spark SQL error from EMR notebook with AWS Glue table partition

I'm testing some pyspark code in an EMR notebook before I deploy it and keep running into this strange error with Spark SQL. I have all my tables and metadata i

Is version control possible in AWS Glue ETL jobs?

I am fairly new to AWS Glue. I have tried creating some jobs and it works fine, now i want to take it a step further. Say we have other developers working and n

Get all tables and fields from glue data catalog

Thanks for taking your time to read this! I have multiple tables within an AWS glue catalog database and want to create an ER diagram from that database. It sho

SPARK SQL create table does not show / read all columns as expected

I am trying to create table in spark sql by providing the schema and giving the location. However when i run select on the table, i see only half the columns. (

Does AWS Glue support positional arguments

How to capture a Glue job's arguments by position rather than using the getResolvedOptions function and passing the arguments as key value pairs?

Copy and Merge files to another S3 bucket

I have a source bucket where small 5KB JSON files will be inserted every second. I want to use AWS Athena to query the files by using an AWS Glue Datasource and

How to copy data from Amazon S3 to DDB using AWS Glue

I am following AWS documentation on how to transfer DDB table from one account to another. There are two steps: Export DDB table into Amazon S3 Use a Glue job t

Non-Partitioned Table Schema not updated with Glue ETL Job

We have an ETL job that uses the below code snippet to update the catalog table: sink = glueContext.getSink(connection_type='s3', path=config['glue_s3_path_bc']

Having trouble setting up multiple tables in AWS glue from a single bucket

So, I've used Glue before, but it's been with a single file <> single folder relationship. What I'm trying to do now is to have a structure like this crea

Spark Catalog w/ AWS Glue: database not found

Ive created an EMR cluster with the Glue Data catalog. When I invoke the spark-shell, I am able to successfully list tables stored within a Glue database via s

Not able to populate AWS Glue ETL Job metrics

I am trying to populate maximum possible Glue job metrics for some testing, below is the setup I have created: A crawler reads data (dummy customer data of 500

exclusions doesn't work in AWS Glue ELT job s3 connection

According to AWS Glue documentation, we can use exlusions to exclude files when the connection type is s3: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-

Trino iceberg connector "getTablesWithParameter for GlueHiveMetastore is not implemented"

I'm running trino on EMR version 6.5 and I have added the iceberg connector for the trino and I want it to use a glue catalog. These are the configuration under

AWS Glue Jupyter Notebook Failed to authenticate user

When I started job with IAM Role AWSGlueServiceNotebookRoleDefault I have this error: Failed to authenticate user due to missing information in request. No info

AWS Glue Crawler Not Creating Table

I have a crawler I created in AWS Glue that does not create a table in the Data Catalog after it successfully completes. The crawler takes roughly 20 seconds

AWS Glue Job extracts columns that are not present in Catalog table

Looks like my earlier post was not clear. Here is what am looking for, I have an aws glue catalog table consisting of 29 columns. Source table with 31 columns.

Load data from S3 into Aurora Serverless using AWS Glue

According to Moving data from S3 -> RDS using AWS Glue I found that an instance is required to add a connection to a data target. However, my RDS is a serve

Redshift Spectrum table doesnt recognize array

I have ran a crawler on json S3 file for updating an existing external table. Once finished I checked the SVL_S3LOG to see the structure of the external table a