'How to update data in pyarrow table?
I have a python script that reads in a parquet file using pyarrow. I'm trying to loop through the table to update values in it. If I try this:
for col_name in table2.column_names:
if col_name in my_columns:
print('updating values in column ' + col_name)
col_data = pa.Table.column(table2, col_name)
row_ct = 1
for i in col_data:
pa.Table.column(table2, col_name)[row_ct] = change_str(pa.StringScalar.as_py(i))
row_ct += 1
I get this error:
TypeError: 'pyarrow.lib.ChunkedArray' object does not support item assignment
How can I update these values?
I tried using pandas, but it couldn't handle null values in the original table, and it also incorrectly translated the datatypes of the columns in the original table. Does pyarrow have a native way to edit the data?
Solution 1:[1]
Arrow tables (and arrays) are immutable. So you won't be able to update your table in place.
The way to achieve this is to create copy of the data when modifying it. Arrow supports some basic operation to modify strings, but they are very limited.
Another option is to go use pandas, but as you've noticed going from arrow to pandas and back isn't seamless.
Let's take and example:
>>> table = pa.Table.from_arrays(
[
pa.array(['abc', 'def'], pa.string()),
pa.array([1, None], pa.int32()),
],
schema=pa.schema(
[
pa.field('str_col', pa.string()),
pa.field('int_col', pa.int32()),
]
)
)
>>> from_pandas = pa.Table.from_pandas(table.to_pandas())
>>> from_pandas.schema
str_col: string
int_col: double
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 487
You can see that converting to pandas and back has changed the type of the int column to double. This is because pandas doesn't support null int values very well, so it converted the int column to double.
To avoid this issue I'd suggest working on a column by column basis, only converting the string columns to pandas:
def my_func(value):
return 'hello ' + value + '!'
columns = []
my_columns = ['str_col']
for column_name in table.column_names:
column_data = table[column_name]
if column_name in my_columns:
column_data = pa.array(table['str_col'].to_pandas().apply(my_func))
columns.append(column_data)
updated_table = pa.Table.from_arrays(
columns,
schema=table.schema
)
>>> table['str_col']
<pyarrow.lib.ChunkedArray object at 0x7f05f42b3f40>
[
[
"hello abc!",
"hello def!"
]
]
Solution 2:[2]
The native way to update the array data in pyarrow is pyarrow compute functions. Converting to pandas, which you described, is also a valid way to achieve this so you might want to figure that out. However, the API is not going to be match the approach you have.
You currently decide, in a Python function change_str
, what the new value of each item should be. Hopefully it is possible to express the manipulation you need to perform as a composite of pyarrow compute functions. This will avoid the (expensive) cost of marshalling the entire native array into python objects. If you can describe what you are trying to achieve in change_str
(probably in a new question) I can help figure it out.
If, for some reason, you must keep change_str
in Python then you will need to convert the entire column to python objects (which will have a pretty hefty performance penalty) using ChunkedArray.to_pylist()
Solution 3:[3]
I was able to get it working using these references:
http://arrow.apache.org/docs/python/generated/pyarrow.Table.html
http://arrow.apache.org/docs/python/generated/pyarrow.Field.html
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_table.py
Basically it loops through the original table and creates new columns (pa.array) with the adjusted text that it appends to a new table. It's probably not the best way to do it, but it worked. Most importantly, it let me preserve the nulls and specify the data type of each column.
import sys, getopt
import random
import re
import math
import pyarrow.parquet as pq
import pyarrow.csv as pcsv
import numpy as np
#import pandas as pd
import pyarrow as pa
import os.path
<a lot of other code here>
parquet_file = pq.ParquetFile(in_file)
table2 = pq.read_table(in_file)
<a lot of other code here>
changed_ct = 0
all_cols_ct = 0
table3 = pa.Table.from_arrays([pa.array(range(0,table2.num_rows))], names=('0')) # CREATE TEMP COLUMN!!
#print(table3)
#exit()
changed_column_list = []
for col_name in table2.column_names:
print('processing column: ' + col_name)
new_list = []
col_data = pa.Table.column(table2, col_name)
col_data_type = table2.schema.field(col_name).type
printed_changed_flag = False
for i in col_data:
# GET STRING REPRESENTATION OF THE COLUMN DATA
if(col_data_type == 'string'):
col_str = pa.StringScalar.as_py(i)
elif(col_data_type == 'int32'):
col_str = pa.Int32Scalar.as_py(i)
elif(col_data_type == 'int64'):
col_str = pa.Int64Scalar.as_py(i)
if col_name in change_columns:
if printed_changed_flag == False:
print('changing values in column ' + col_name)
changed_column_list.append(col_name)
changed_ct += 1
printed_changed_flag = True
new_list.append(change_str(col_str))
else:
new_list.append(col_str)
#set data type for the column
if(col_data_type == 'string'):
col_data_type = pa.string()
elif(col_data_type == 'int32'):
col_data_type = pa.int32()
elif(col_data_type == 'int64'):
col_data_type = pa.int64()
arr = pa.array(new_list, type=col_data_type)
new_field = pa.field(col_name, col_data_type)
table3 = pa.Table.append_column(table3, new_field, arr)
all_cols_ct += 1
#for i in table3:
# print(i)
table3 = pa.Table.remove_column(table3, 0) # REMOVE TEMP COLUMN!!
#print(table2)
#print('-------------------')
#print(table3)
#exit()
print('changed ' + str(changed_ct) + ' columns:')
print(*changed_column_list, sep='\n')
# WRITE NEW PARQUET FILE
pa.parquet.write_table(table3, out_file)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | 0x26res |
Solution 2 | Pace |
Solution 3 |