'Save, Recover, and Continue Updating Learning Curves while Training a CNN if Server Crashes Suddenly

I am training a deep learning model with TensorFlow on a remote server. The problem is that I am only allocated 2 hours of training at a time and the server may crash at any points for various reason.

I know the training of my model will take me at least 48 hours to complete. I would like to be able after the model is fully trained (48+ hours) to display the training curves from start to finish with no breaks in between.

I am able to pick wherever the training was when it last crashes using callbacks (save best weights) but I am unsure how to achieve this with the training curves (loss + accuracy).

Thank you very much for your help.



Solution 1:[1]

Tensorboard is automatically archeive or you can reset it but you can do logging file with the text format where these values indicates or using summary.

  1. history = model.fit(batched_features, epochs=1 ,validation_data=(batched_features) callbacks=[custom_callback, tb_callback]) Easily plots using matlibplot by history.history['loss'], history.history['accuracy'][0] and commons summary.

  2. You also can do using merged summary or loss summary in tf.summary.

  3. You can access the history from callback or callback epoche return.

    history = model_highscores.fit(batched_features, epochs=1000 ,validation_data=(dataset.shuffle(len(list_image))), callbacks=[custom_callback])
    print(history.params)                   # {'verbose': 1, 'epochs': 100, 'steps': 2}
    print(history.history.keys())           # dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
    
    temp = history.history['loss'][:-1]
    plt.plot(temp, history.history['accuracy'])
    plt.show()
    
    loss_summary = tf.summary.scalar('loss', 0.5)
    
    1. callback
    def _val_writer(self):
            if 'val' not in self._writers:
                self._writers['val'] = tf.summary.create_file_writer(val_dir)
            return self._writers['val']
    def on_epoch_end(self, epoch, logs={}):
            print(self.model.inputs) 
            feature_extractor = tf.keras.Model(inputs=self.model.inputs, outputs=[layer.output for layer in self.model.layers], )
            x = tf.ones((32, 32, 3))
            print(np.asarray(feature_extractor))
    

...Simple plots

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Martijn Pieters