'Using python how to take input search string from one excel sheet column and search in other excel on a specific column?

For example, take input string _TXT from col A in file 1 and search in file 2 col X. If any row contains _TXT, then for that specific row compare col B value to col Y value in file 1 to file 2.

If col B and col Y values matches, take no action. If they are not matched, update col Y in file 2 with same value as col B in file 1

excel image click here



Solution 1:[1]

A simple (though perhaps not the most efficient) way to do this would be to follow this algorithm :

  • load both excel files into python memory ;
  • for each row in input file, take inputString from colA and look for it in output dataFrame, perform operations as needed ;
  • write back second file to disk.

Pandas offer a function to read an excel file (read_excel). There are quite a lot of options, documented in the link : I think the ones that would be useful to you are :

  • sheet_name : defaults to 0 (first sheet), you can also use its full name (as a string) or any number (N would be N+1th sheet). None means "all_sheets".
  • usecols : defaults to None (all columns), if you only need colA and colB you may want to specify that here (using something like "A:B") ;

The function to write back your pandas dataframe to excel is a method of the class DataFrame (full documentation here) called to_excel. Useful options include :

  • excel_writer : it can be a file path or an ExcelWriter object, which is a pandas class that writes dataframes to excel. Using an ExcelWriter allows you to control more precisely what you write and might be necessary if you have to write several sheets, or change your file in place ;
  • sheet_name : defaults to "Sheet1", does not seem to accept integers for sheet numbers ;
  • index : default to True, writes row index as well as other data - we don't want that, so it will be necessary to put it to false.

Your final code might look something like this :

from pandas import *

# read both excel files ; assume only one sheet in file to modify
# taking only useful columns in reference file according to your example
dfIn = read_excel("path/to/refFile.xlsx", usecols="A:B")
dfOut = read_excel("path/to/outFile.xlsx", usecols=None, sheet_name="data_sheet")

for index, rowIn in dfIn.iterrows():
    inputString = rowIn['colA']
    for index, rowOut in dfOut.iterrows():
        # using python string endswith as matching rule
        # replace with anything that suits your needs
        if rowOut['colX'].endswith(inputString):
            rowOut['colY'] = rowIn['colB']
# write dfOut to disk
with ExcelWriter("path/to/outFile.xlsx", mode="a", if_sheet_exists="replace") as writer:
    dfOut.to_excel(writer, sheet_name="data_sheet", index=False)

Admittedly, the pandas documentation warns against modifying something you are iterating over (because iterrows might return a copy of the data instead of a view, and then changing the copy will have no effect). Since you are using strings here, the modifications will work.

Depending on your excel engine and its version (it worked for me with python 3.8.10 and openpyxl 3.0.9, but failed for OP) replacing the sheet might fail. If that is the case, this related question suggests completely deleting the old sheet and making a new one, like this :

with ExcelWriter('/path/to/file.xlsx',engine = "openpyxl",  mode='a') as writer:
 workBook = writer.book
 # data_sheet exists for sure, since we read data from it at beginning of script
 workBook.remove(['data_sheet'])
 df.to_excel(writer, sheet_name='data_sheet', index=False)
 writer.save()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1