I'm trying to use the pyarrow Filesystem interface with HDFS. I receive a libhdfs.so not found error when calling the fs.HadoopFileSystem constructor even thou
I'm trying to simplify access to datasets in various file formats (csv, pickle, feather, partitioned parquet, ...) stored as S3 objects. Since some users I supp
I have orc with data as after. Table A: Name age school address phone tony 12 havard UUU 666 tommy 13 abc
I have a python script that reads in a parquet file using pyarrow. I'm trying to loop through the table to update values in it. If I try this: for col_name in t
Is it possible to use Pandas' DataFrame.to_parquet functionality to split writing into multiple files of some approximate desired size? I have a very large Data
I am trying to store a Python Pandas DataFrame as a Parquet file, but I am experiencing some issues. One of the columns of my Pandas DF contains dictionaries as
I have a .dat file that I had been reading with pd.read_csv and always needed to use encoding="latin" for it to read properly / without error. When I use pyarr
I am trying to use pyarrow.dataset.write_dataset function to write data into hdfs. However, if i write into a directory that already exists and has some data, t
If I run the following as a program: import subprocess subprocess.run(['plasma_store -m 10000000000 -s /tmp/plasma'], shell=True, capture_output=True) and then
I'm working on a script where I'm sending a dataframe to BigQuery: load_job = bq_client.load_table_from_dataframe( df, '.'.join([PROJECT, DATASET, PROGRAMS
I have installed pyarrow version 0.14.0. I'm creating a package to run that from lambda. While executing from lambda i'm getting error - No module named 'pyarro
I have the following code which I use to loop through row groups in a parquet metadata file to find the maximum values for columns i,j,k across the whole file.
I am using Python multiprocessing shared memory. I see the following error when I try to close a SharedMemory object: BufferError: memoryview has 1 exported buf
I looking for ways to read data from multiple partitioned directories from s3 using python. data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snapp