'Join two pyarrow tables
I have orc with data as after.
Table A:
Name age school address phone
tony 12 havard UUU 666
tommy 13 abc Null Null
john 14 cde Null Null
john 14 cde Null Null
Table B: Name address phone
tommy USD 345
john ASA 444
Expected table after join: Name age school address phone
tony 12 havard UUU 666
tommy 13 abc USD 345
john 14 cde ASA 444
john 14 cde ASA 444
How I can do it with pyarrow or pandas Name of table a is not unique, Name of table B is unique.
Solution 1:[1]
Try this:
dfA.set_index('Name', inplace=True)
dfA.update(dfB.set_index('Name'))
dfA.reset_index()
Note: this 'Name' column should have unique values as mentioned by @Antti Haapala -- ????? ???????
When A and B have different values of 'Address' and 'Phone' for one 'Name', table A's values will be updated by values from table B
Solution 2:[2]
In pyarrow, starting with 8.0.0, you can do this with a combination of join and coalesce.
import pyarrow as pa
import pyarrow.compute as pc
table_a = pa.Table.from_pydict({
"name": ["tony", "tommy", "john"],
"age": [12, 13, 14],
"school": ["havard", "abc", "cde"],
"address": ["UUU", None, None],
"phone": [666, None, None]
})
table_b = pa.Table.from_pydict({
"name": ["tommy", "john"],
"address": ["USD", "ASA"],
"phone": [345, 444]
})
combined = table_a.join(table_b, 'name', right_suffix='_r')
coalesced_addrs = pc.coalesce(combined.column('address_r'), combined.column('address'))
coalesced_phone = pc.coalesce(combined.column('phone_r'), combined.column('phone'))
result = pa.Table.from_pydict({
'name': combined.column('name'),
'age': combined.column('age'),
'school': combined.column('school'),
'address': coalesced_addrs,
'phone': coalesced_phone
})
print(result)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 |