'One hot encoder what is the industry norm, to encode before train/split or after

I know there're some people who have answered this. I'm still trying to get this straight though.

I'm still a little bit confused over the one hot encoder. I was just thinking, if we were to encode before splitting, there shouldnt be any 'information leakage' into the test set. So why do people advocate doing the encoding after? Isn't the one hot encoder just used to convert categorical variables into binary.

And if we were to encode after splitting, the results can vary quite significantly as was pointed out here : Scikit-Learn One-hot-encode before or after train/test split

I'm just wondering what is the industry norm.

Thanks



Solution 1:[1]

Specifically for the One-Hot-Encoder, it should not make much difference, except when there are categories that are not represented in a split.

But in that case, there is information leakage. With splitting training/test data, you are trying to simulate how well your model (and that includes all feature selection/transformation!) generalizes. If there are categories that are present in the test set but not the training set, then arguably there can surely be categories in the real world that your whole data set does not contain. In that case you are betraying yourself if you encode before splitting.

There are cases where you would want to encode before, though. If you have few data points and are sampling to get balanced splits, you might want to ensure each split gets all the categories, or something like that. In such cases it might be useful to encode before.

In general, always keep in mind that feature selection and transformation are part of your model. One-hot encoding in particular depends on the data, so that applies even more.

Solution 2:[2]

One hot encoding is a technique to specify the desired class of a data item. It is a replacement to integer coding where you can just put integers. A simple example would be: Let's say, we have 3 classes: Cat, Dog, Human

In integer encoding we would give the classes as (say): 
Cat - 1, Dog - 2, Human - 3
In One-hot encoding, we would do these classes as:
Cat - [1,0,0]. Dog - [0,1,0], Human - [0,0,1]

So you can get an idea, one-hot encoding works only for categorical data!

Hence, the whole dataset has to be labeled in a homogenous manner. Hence the One-hot encoding has to be performed even before the test-train split.

Solution 3:[3]

I come to the same conclusion as @em_bis_me. I think most of the people are just doing that because they saw it a notebook where somebody was doing that before and they are just doing a copy + paste.( Kaggle is the best community to see that, a ton of people just copy + paste work of others without stopping to consider whether it is right or wrong).

Here you can see a example from Kaggle where they are doing the encoding after split.

https://www.kaggle.com/code/prashant111/logistic-regression-classifier-tutorial/notebook

here you have the same dataset encoding before split.

https://github.com/Enrique1987/machine_learning/blob/master/1_Classification_algorithms/01_Logistic_Regresion_Australian_Weather.ipynb

Of course: Same results.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 kutschkem
Solution 2 em_bis_me
Solution 3 Enrique Benito Casado