'How to apply Target Encoding in test dataset?
I am working on a project, where I had to apply target encoding for 3 categorical variables:
merged_data['SpeciesEncoded'] = merged_data.groupby('Species')['WnvPresent'].transform(np.mean)
merged_data['BlockEncoded'] = merged_data.groupby('Block')['WnvPresent'].transform(np.mean)
merged_data['TrapEncoded'] = merged_data.groupby('Trap')['WnvPresent'].transform(np.mean)
I received the results and ran the model. Now the problem is that I have to apply the same model to test data that has columns Block, Trap, and Species, but doesn't have the values of the target variable WnvPresent (which has to be predicted).
How can I transfer my encoding from training sample to the test? I would greatly appreciate any help.
P.S. I hope it makes sense.
Solution 1:[1]
You need to same the mapping between the feature and the mean value, if you want to apply it to the test dataset.
Here is a possible solution:
species_encoding = df.groupby(['Species'])['WnvPresent'].mean().to_dict()
block_encoding = df.groupby(['Block'])['WnvPresent'].mean().to_dict()
trap_encoding = df.groupby(['Trap'])['WnvPresent'].mean().to_dict()
merged_data['SpeciesEncoded'] = df['Species'].map(species_encoding)
merged_data['BlockEncoded'] = df['Block'].map(species_encoding)
merged_data['TrapEncoded'] = df['Trap'].map(species_encoding)
test_data['SpeciesEncoded'] = df['Species'].map(species_encoding)
test_data['BlockEncoded'] = df['Block'].map(species_encoding)
test_data['TrapEncoded'] = df['Trap'].map(species_encoding)
This would answer your question, but I want to add, that this approach can be improved. Directly using mean values of targets could make the models overfit on the data.
There are many approaches to improve target encoding, one of them is smoothing, here is a link to an example: https://maxhalford.github.io/blog/target-encoding/
Here is an example:
m = 10
mean = df['WnvPresent'].mean()
# Compute the number of values and the mean of each group
agg = df.groupby('Species')['WnvPresent'].agg(['count', 'mean'])
counts = agg['count']
means = agg['mean']
# Compute the "smoothed" means
species_encoding = ((counts * means + m * mean) / (counts + m)).to_dict()
Solution 2:[2]
There are 2 open source Python libraries that offer this functionality off-the-shelf: Feature-engine and Category encoders.
Assuming that we have a train and a testing set...
With Feature engine it would work as follows:
from feature_engine.encoding import MeanEncoder
# set up the encoder
encoder = MeanEncoder(variables=['Species', 'Block', 'Trap'])
# fit the encoder - finds the mean target value per category
encoder.fit(X_train, X_train['WnvPresent'])
# transform data
X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)
We find the replacement values in the encoding_dict_ attribute as follows:
encoder.encoding_dict_
With category encoders it works as follows:
from category_encoders.target_encoder import TargetEncoder
# set up the encoder
encoder = TargetEncoder(cols=['Species', 'Block', 'Trap'])
# fit the encoder - finds the mean target value per category
encoder.fit(X_train, X_train['WnvPresent'])
# transform data
X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)
The replacement values can be found in the attribute mapping:
encoder.mapping
More details in the respective documentation:
Category encoders' TargetEncoder also offers smoothing as suggested by @andrey-lukyanenko out-of-the-box.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Andrey Lukyanenko |
Solution 2 | Sole Galli |