Category "pyarrow"

Read timeout in pd.read_parquet from S3, and understanding configs

I'm trying to simplify access to datasets in various file formats (csv, pickle, feather, partitioned parquet, ...) stored as S3 objects. Since some users I supp

Join two pyarrow tables

I have orc with data as after. Table A: Name age school address phone tony 12 havard UUU 666 tommy 13 abc

How to update data in pyarrow table?

I have a python script that reads in a parquet file using pyarrow. I'm trying to loop through the table to update values in it. If I try this: for col_name in t

pandas df.to_parquet write to multiple smaller files

Is it possible to use Pandas' DataFrame.to_parquet functionality to split writing into multiple files of some approximate desired size? I have a very large Data

Can I store a Parquet file with a dictionary column having mixed types in their values?

I am trying to store a Python Pandas DataFrame as a Parquet file, but I am experiencing some issues. One of the columns of my Pandas DF contains dictionaries as

How does Pyarrow read_csv handle different file encodings?

I have a .dat file that I had been reading with pd.read_csv and always needed to use encoding="latin" for it to read properly / without error. When I use pyarr

How to control whether pyarrow.dataset.write_dataset will overwrite previous data or append to it?

I am trying to use pyarrow.dataset.write_dataset function to write data into hdfs. However, if i write into a directory that already exists and has some data, t

How to properly create an apache plasma store from within a python program?

If I run the following as a program: import subprocess subprocess.run(['plasma_store -m 10000000000 -s /tmp/plasma'], shell=True, capture_output=True) and then

"pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data" when sending data to BigQuery without schema

I'm working on a script where I'm sending a dataframe to BigQuery: load_job = bq_client.load_table_from_dataframe( df, '.'.join([PROJECT, DATASET, PROGRAMS

No module named 'pyarrow.lib' found from lambda function

I have installed pyarrow version 0.14.0. I'm creating a package to run that from lambda. While executing from lambda i'm getting error - No module named 'pyarro

How do I write this Python code to use 2+ fewer nested if statements?

I have the following code which I use to loop through row groups in a parquet metadata file to find the maximum values for columns i,j,k across the whole file.

Python multiprocessing shared memory error on close

I am using Python multiprocessing shared memory. I see the following error when I try to close a SharedMemory object: BufferError: memoryview has 1 exported buf

How to read partitioned parquet files from S3 using pyarrow in python

I looking for ways to read data from multiple partitioned directories from s3 using python. data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snapp