'How do I filter outliers in test data based on z-scores of train data?
I have a train and test dataset. On the train dataset I detected and deleted outlier values, when their standard deviation is 5 times greater from the mean. If a z-score returned is larger than that, the value is quite unusual and therefore I delete it from the dataset.
import scipy.stats as stats
z_scores = train_df.apply(stats.zscore)
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 5).all(axis=1)
train_df= train_df[filtered_entries]
Now I want to use the same z-scores based on the train set to remove values from the test set. (I don't want to get the z_scores from the test dataset itself!) Probably one idea is to store the mean and standard deviation of X from the train data and calculate the z-score for the test data based on them e.g.
(Xtest−μ)/σ
But I do not have any concrete ideas how to do so. Could someone give me some advice?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|