'Join two pyarrow tables

I have orc with data as after.

Table A:

Name    age     school      address      phone
tony    12      havard      UUU          666
tommy   13      abc         Null         Null
john    14      cde         Null         Null
john    14      cde         Null         Null

Table B: Name address phone

tommy   USD         345   
john    ASA         444

Expected table after join: Name age school address phone

tony    12      havard      UUU          666
tommy   13      abc         USD          345
john    14      cde         ASA          444
john    14      cde         ASA          444

How I can do it with pyarrow or pandas Name of table a is not unique, Name of table B is unique.



Solution 1:[1]

Try this:

dfA.set_index('Name', inplace=True)
dfA.update(dfB.set_index('Name'))
dfA.reset_index()

Note: this 'Name' column should have unique values as mentioned by @Antti Haapala -- ????? ???????

When A and B have different values of 'Address' and 'Phone' for one 'Name', table A's values will be updated by values from table B

Solution 2:[2]

In pyarrow, starting with 8.0.0, you can do this with a combination of join and coalesce.

import pyarrow as pa
import pyarrow.compute as pc

table_a = pa.Table.from_pydict({
    "name": ["tony", "tommy", "john"],
    "age": [12, 13, 14],
    "school": ["havard", "abc", "cde"],
    "address": ["UUU", None, None],
    "phone": [666, None, None]
  })

table_b = pa.Table.from_pydict({
    "name": ["tommy", "john"],
    "address": ["USD", "ASA"],
    "phone": [345, 444]
  })

combined = table_a.join(table_b, 'name', right_suffix='_r')

coalesced_addrs = pc.coalesce(combined.column('address_r'), combined.column('address'))
coalesced_phone = pc.coalesce(combined.column('phone_r'), combined.column('phone'))

result = pa.Table.from_pydict({
    'name': combined.column('name'),
    'age': combined.column('age'),
    'school': combined.column('school'),
    'address': coalesced_addrs,
    'phone': coalesced_phone
  })

print(result)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2