'When predicting, shall we scale unseen inputs, and un-scale outputs of a model?

I am new to Machine Learning, and I followed this tutorial to implement LSTM model in Keras/Tensorflow: https://www.tensorflow.org/tutorials/structured_data/time_series

In the tutorial the training/validaton/testing dataset are normalized like that:

train_df = (train_df - train_mean) / train_std
val_df = (val_df - train_mean) / train_std
test_df = (test_df - train_mean) / train_std

I have exported my model in HDF5 format and use keras to load it and run prediction. The input timeseries I get are in a Pandas Dataframe df.

model = load_model(model_filename)
prediction_values = model.predict(df.to_numpy()[np.newaxis][0 : 24])

Reading articles on the internet, it is unclear to me if I should scale my input data the same way that they were scaled in the training and/or after prediction. Some article mentions that a scaling should be done before prediction, and other that it should be done after.

I tried to create those 2 functions:

def scale_data(df):
    df = (df - pd.Series(historical_data_mean)) / pd.Series(historical_data_standard_deviation)
    return df
def unscale_date(df):
    df = df * pd.Series(historical_data_standard_deviation) + pd.Series(historical_data_mean)
    return df

And run them like that:

unseen_inputs_df   # A new timeserie that will be used as input of the model

scaled_input = scale_data(unseen_inputs_df)  # Should I scale my 'unseen' inputs here?
  
prediction_values = model.predict(scaled_input.to_numpy()[np.newaxis][0 : 24])
prediction_df = pd.DataFrame(data=prediction_values[0])

unscaled_output = unscale_date()    # Should I unscale the models output here?

However it returned totally wrong outputs.

Do you have any clue on what is the correct way to proceed?



Solution 1:[1]

Preprocessing and what to have in mind when taking it to inference

Whatever preprocessing you are doing during training, in your case:

train_df = (train_df - train_mean) / train_std

Needs to be executed on your data during inference, as you implicitly did right with your validation and test data. On an intuitive level this makes sure that your input data is mapped to the same feature space, which is digestable by your model.

As you use the training data as a prior for scaling, you have to use this training data as a prior for scaling the inference as well.

Note: This is also the way it is done within the tutorial mentioned in the comments "Note that we fit the scaler only on the training data."

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 mrk