I'm trying to simplify access to datasets in various file formats (csv, pickle, feather, partitioned parquet, ...) stored as S3 objects. Since some users I supp
I have orc with data as after. Table A: Name age school address phone tony 12 havard UUU 666 tommy 13 abc
I have a python script that reads in a parquet file using pyarrow. I'm trying to loop through the table to update values in it. If I try this: for col_name in t
Is it possible to use Pandas' DataFrame.to_parquet functionality to split writing into multiple files of some approximate desired size? I have a very large Data
I am trying to store a Python Pandas DataFrame as a Parquet file, but I am experiencing some issues. One of the columns of my Pandas DF contains dictionaries as
I have a .dat file that I had been reading with pd.read_csv and always needed to use encoding="latin" for it to read properly / without error. When I use pyarr
I am trying to use pyarrow.dataset.write_dataset function to write data into hdfs. However, if i write into a directory that already exists and has some data, t
If I run the following as a program: import subprocess subprocess.run(['plasma_store -m 10000000000 -s /tmp/plasma'], shell=True, capture_output=True) and then
I'm working on a script where I'm sending a dataframe to BigQuery: load_job = bq_client.load_table_from_dataframe( df, '.'.join([PROJECT, DATASET, PROGRAMS
I have installed pyarrow version 0.14.0. I'm creating a package to run that from lambda. While executing from lambda i'm getting error - No module named 'pyarro
I have the following code which I use to loop through row groups in a parquet metadata file to find the maximum values for columns i,j,k across the whole file.
I am using Python multiprocessing shared memory. I see the following error when I try to close a SharedMemory object: BufferError: memoryview has 1 exported buf
I looking for ways to read data from multiple partitioned directories from s3 using python. data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snapp