'Tensorflow: oversampling with SMOTE giving highly skewed results
I have an imbalanced data set of 2 classes (1 & 0). 1 is about 6 times less likelier than 0. Hence, I am using SMOTE
to make the data set balanced through over sampling. Using SMOTE is giving an extremely skewed result. I can't understand why
from imblearn.over_sampling import SMOTE
def train_neural_network(x, y, features, labels):
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2)
print(len(y_train), len(y_train[y_train == 1]), len(y_train[y_train == 0]))
sm = SMOTE()
X_train, y_train = sm.fit_sample(X_train, y_train)
print(len(y_train), len(y_train[y_train == 1]), len(y_train[y_train == 0]))
prediction = neural_network_model(x, len(features.columns))
cost = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=prediction, labels=tf.cast(y, tf.int32)))
optimizer = tf.train.AdamOptimizer().minimize(cost)
hm_epochs = 1
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
for epoch in range(hm_epochs):
epoch_loss = 0
for i in range(int(len(X_train) / batch_size)):
epoch_x = X_train[i*batch_size: min((i + 1)*batch_size, len(X_train))]
epoch_y = y_train[i*batch_size: min((i + 1)*batch_size, len(y_train))]
i, c = sess.run([optimizer, cost], feed_dict = {x:epoch_x, y:epoch_y})
epoch_loss += c
print('Epoch', epoch + 1, ' completed out of ', hm_epochs, ' loss: ', epoch_loss)
correct = tf.equal(tf.argmax(prediction, 1), y)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
print('Accuracy : ', sess.run(accuracy, feed_dict={x: X_test, y: y_test}))
y1 = y_test[y_test == 1]
X1 = X_test[y_test == 1]
print('Accuracy 1: ', sess.run(accuracy, feed_dict={x: X1, y: y1}))
y0 = y_test[y_test == 0]
X0 = X_test[y_test == 0]
print('Accuracy 0: ', sess.run(accuracy, feed_dict={x: X0, y: y0}))
with open("xdf.pickle", 'rb') as f:
features = pickle.load(f)
with open("ydf.pickle", 'rb') as f:
labels = pickle.load(f)
x = tf.placeholder('float', [None, len(features.columns)])
y = tf.placeholder(tf.int64)
train_neural_network(x, y, features, labels)
This is the output (print statements)
1521207 255174 1266033 // initial dataset (total points, label = 1, label = 0)
2532066 1266033 1266033 // after smote
Epoch 1 completed out of 1 loss: 345947.933431 // after 1 epoch
Accuracy : 0.168227 // test accuracy
Accuracy 1: 1.0 // output of test with all labels = 1
Accuracy 0: 3.1613e-06 // output of test will all labels = 0
When I did not over sample the dataset then I get the following result
1521207 255174 1266033 // initial dataset (total points, label = 1, label = 0)
Epoch 1 completed out of 1 loss: 270053.921566 // after 1 epoch
Accuracy : 0.762063 // test accuracy
Accuracy 1: 0.1554 // output of test with all labels = 1
Accuracy 0: 0.883916 // output of test will all labels = 0
This gives an expected output as the dataset is skewed. Am I making a mistake in the method of using SMOTE? Why is the result getting so skewed?
Solution 1:[1]
Can you try doing stratified sampling here? Since you have not pasted your graph code, I cannot guess the type of task here except for the classification type. It might have been sequential data in which case we might see such results.
I see that there is some temporal nature in the data. SMOTE will unnecessarily add more unwanted indicators into the data. Please try weighted loss function here where you weight down the dominant class error by its verbosity ratio over the other labels.
Also for very rare events, training classifiers biased for different rare labels and using ensemble method with voting might help.
Solution 2:[2]
I really don't know the nature of your data & the concept of your Modelling, but I'm not sure, that you always need oversampling for imbalanced data - as so as your data-case can occur to be a Data Drift & in this case detection of anomalies, as well as pretraining, can help to adopt the further learning process to the circumstances having occured... just pay some attention to - Analyzing training-serving skew with TensorFlow Data Validation - if this could be your situation?
I believe, that blindly oversampling of minority is not the correct way, because you're just adding_on to minor class some random data getting another scope of reality, == if your data distribution is really Gaussian in nature, then only in such a case you have the reason to involve SMOTE as the oversampling technique (that is - your data should be really Random in its nature -- is it such? - you're saying NOT "they are skewd")
By the way, your SMOTE-results gave to you not a skew, but Normality
P.S. usually (if you really need to process your skewed data to Normal-distribution analysis) possible statistical way to model skewed data is dealing with their ln (natural logarithms), not oversampling [real imbalance looks like the other way then skewed data!!]... just at the final stage of analysis you should return data back from ln... - though it is frequentist-approach, not a bayesian-approach
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 |