'Comparing 2 columns with different rows in different csv files, and output status to another csv file

I have 2 csv files as shown below. They contain different numbers of rows and the columns are not aligned/sorted along a common index. I need to compare the column fname on both files, and if they are same, have to check if place is also the same or not, and output status true or false to another file3.csv with columns fname, place, status.

file1.csv:

fname.   age.  place 

Andy     16    a
kate.    15.   k
brian.   17.   b

file2.csv:

fname.   age.  place

brian.  17.   c
kate.   15.   k
andy.   16.   a
laura.  18.   l

output file3.csv should be like:

Andy.   a.   true
Kate.   k.   true
brian.  b.   false

I have seen quite number of queries similar to this, but they either have the same number of rows, or they are sorted and aligned along a common index.

I have tried using pandas, but that did not work, as the rows are different.



Solution 1:[1]

Here is an approach that will work using pandas:

import pandas as pd

Read the files into panda dataframes

df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv)

Create a common key for each dataframe

df1['lk'] = df1.apply(lambda row: row['fname'].lower(), axis= 1)  
df2['rk'] = df2.apply(lambda row: row['fname'].lower(), axis= 1)  

Merge the frames into a third df, and add flag column

df3 = df1.merge(df2, left_on='lk', right_on='rk')
df3['flag'] = df3.apply(lambda row: row['place_x'] == row['place_y'], axis = 1)

Rename the fname_x and place_X columns

df3.rename(columns={'fname_x':'fname', 'place_x':'place'}, inplace=True)

Create the output csv file

df3[['fname', 'place', 'flag']].to_csv('file3', index=False)
 

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 itprorh66