Category "parquet"

Is there any way to read multiple parquet paths from s3 in parallel using spark?

My data is stored in s3 (parquet format) under different paths and I'm using spark.read.parquet(pathes:_*) in order to read all the paths into one dataframe. Un

Py4JJavaError: An error occurred while calling o143.parquet

So I have parquet files in S3 bucket and I want to load it using pyspark in python, but I'm getting some error, here's what I have tried so far. I'm using Juput

When writing parquet files to s3 NoSuchMethodError :void org.apache.hadoop.util.SemaphoredDelegatingExecutor

When I try to write the dataframe to s3 as parquet, I always get an error like below. In the s3 bucket, an empty folder is generated automatically every time, b

AWS Athena table from python output with dates - dates get wrongly converted

I have a pandas DataFrame containing a date column ("2022-02-02"). I write this table to parquet using pyarrow. df[col] = df[col].astype(str) df.to_parquet(loc)

Parquet Binary storing value in encoded format

I'm creating a parquet file from Java using org.apache.parquet.*. No issues with other data types, but when a binary value is written and I cat the parquet file

Spatial database architecture with Apache Parquet, PostgresSQL and PostGIS on on-premises bare-metal S3/MinIo cluster

Designing storage architecture for Petabyte-scale geospatial data; starting from scratch. Creating a MinIo cluster to store the objects in S3 buckets. To store

pandas df.to_parquet write to multiple smaller files

Is it possible to use Pandas' DataFrame.to_parquet functionality to split writing into multiple files of some approximate desired size? I have a very large Data

Can I store a Parquet file with a dictionary column having mixed types in their values?

I am trying to store a Python Pandas DataFrame as a Parquet file, but I am experiencing some issues. One of the columns of my Pandas DF contains dictionaries as

includeOpForFullLoad property on Parquet files - AWS DMS

I have tried includeOpForFullLoad property for csv files while reading from oracle and writing in to s3. I wanted to use this property for parquet files. I can

Presto fails to import PARQUET files from S3

I have a presto table that imports PARQUET files based on partitions from s3 as follows: create table hive.data.datadump ( tUnixEpoch varchar, tDateTi

sparkR sql() returns string

We have parquet data saved on a server and I am trying to use SparkR sql() function in the following ways df <- sql("SELECT * FROM parquet.`<path to parq

Efficient way to read specific columns from parquet file in spark

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(

Py4JJavaError when trying to write pyspark DataFrame to parquet

I wanted to convert a large .csv vile into .parquet format using pyspark. I am using python 3. I tried changing the codec used for compression, as suggested in

How/Where can I write time series data? As Parquet format to Hadoop, or HBase, Cassandra?

I have real-time time series sensor data. My primary goal is to keep the raw data. I should do this so that the cost of storage is minimal. My scenario like th

Pyspark throwing error while trying to read parquet

I am a newbie in pyspark, While trying to read parquet file through pyspark I get the below error. I have tried various things like reinstallation of jre and jd

Convert csv to parquet file using python

I am trying to convert a .csv file to a .parquet file. The csv file (Temp.csv) has the following format 1,Jon,Doe,Denver I am using the following python

Best way to store many pandas dataframes in a single file

I have 20,000 ~1000-row dataframes, each of which has a name, in a 170GB pickle file at the moment. I'd like to write these to a file these so I can load them i

How do I write this Python code to use 2+ fewer nested if statements?

I have the following code which I use to loop through row groups in a parquet metadata file to find the maximum values for columns i,j,k across the whole file.

How to read partitioned parquet files from S3 using pyarrow in python

I looking for ways to read data from multiple partitioned directories from s3 using python. data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snapp

Python: Obtain number of rows for ParquetDataset?

How do I obtain the number of rows of a ParquetDataset that is structured in the form of a folder containing multiple parquet files. I tried from pyarrow.parq