'Encoding each value in a pandas cell
I have a dataset
Inp1 Inp2 Inp3 Output
A,B,C AI,UI,JI Apple,Bat,Dog Animals
L,M,N LI,DO,LI Lawn, Moon, Noon Noun
X,Y AI,UI Yemen,Zombie Extras
For these values, I need to apply a ML algorithm. Hence need an encoding technique.
Tried a Label encoding technique, it encodes the entire cell to an int
for eg.
Inp1 Inp2 Inp3 Output
5 4 8 0
But I need a separate encoding for each value in a cell. How should I go about it.
Inp1 Inp2 Inp3 Output
7,44,87 4,65,2 47,36,20 45
Integers are random here.
Solution 1:[1]
Assuming each cell is a list (as you have multiple strings stored in each), and that you are not looking for a specific encoding. The following should work. It can also be adjusted to suit different encodings.
import pandas as pd
A = [["Inp1", "Inp2", "Inp3", "Output"],
[["A","B","C"], ["AI","UI","JI"],["Apple","Bat","Dog"],["Animals"]],
[["L","M","N"], ["LI","DO","LI"], ["Lawn", "Moon", "Noon"], ["Noun"]]]
dataframe = pd.DataFrame(A[1:], columns=A[0])
def my_encoding(row):
encoded_row = []
for ls in row:
encoded_ls = []
for s in ls:
sbytes = s.encode('utf-8')
sint = int.from_bytes(sbytes, 'little')
encoded_ls.append(sint)
encoded_row.append(encoded_ls)
return encoded_row
print(dataframe.apply(my_encoding))
output:
Inp1 ... Output
0 [65, 66, 67] ... [32488788024979009]
1 [76, 77, 78] ... [1853189966]
if my assumptions are incorrect or this is not what you're looking for let me know.
Solution 2:[2]
As you mentioned, you are going to apply ML algorithm (say classification), I think One Hot Encoding is what you are looking for.
Requested format:
Inp1 Inp2 Inp3 Output
7,44,87 4,65,2 47,36,20 45
This format can't help you to train your model as multiple labels in a single cell. However you have to pre-process again like OHE.
Suggesting format:
A B C L M N X Y AI DO JI LI UI Apple Bat Dog Lawn Moon Noon Yemen Zombie
1 1 1 0 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 0
0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0
0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 1
Hereafter you can label encode / ohe the output field as per your model requires.
Happy learning !
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | thomas |
Solution 2 | Bhanuchander Udhayakumar |