'Save, Recover, and Continue Updating Learning Curves while Training a CNN if Server Crashes Suddenly

I am training a deep learning model with TensorFlow on a remote server. The problem is that I am only allocated 2 hours of training at a time and the server may crash at any points for various reason.

I know the training of my model will take me at least 48 hours to complete. I would like to be able after the model is fully trained (48+ hours) to display the training curves from start to finish with no breaks in between.

I am able to pick wherever the training was when it last crashes using callbacks (save best weights) but I am unsure how to achieve this with the training curves (loss + accuracy).

Thank you very much for your help.

Solution 1:^[1]

Tensorboard is automatically archeive or you can reset it but you can do logging file with the text format where these values indicates or using summary.

history = model.fit(batched_features, epochs=1 ,validation_data=(batched_features) callbacks=[custom_callback, tb_callback]) Easily plots using matlibplot by history.history['loss'], history.history['accuracy'][0] and commons summary.
You also can do using merged summary or loss summary in tf.summary.

You can access the history from callback or callback epoche return.

history = model_highscores.fit(batched_features, epochs=1000 ,validation_data=(dataset.shuffle(len(list_image))), callbacks=[custom_callback])
print(history.params)                   # {'verbose': 1, 'epochs': 100, 'steps': 2}
print(history.history.keys())           # dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

temp = history.history['loss'][:-1]
plt.plot(temp, history.history['accuracy'])
plt.show()

loss_summary = tf.summary.scalar('loss', 0.5)

callback

def _val_writer(self):
        if 'val' not in self._writers:
            self._writers['val'] = tf.summary.create_file_writer(val_dir)
        return self._writers['val']
def on_epoch_end(self, epoch, logs={}):
        print(self.model.inputs) 
        feature_extractor = tf.keras.Model(inputs=self.model.inputs, outputs=[layer.output for layer in self.model.layers], )
        x = tf.ones((32, 32, 3))
        print(np.asarray(feature_extractor))

...

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Martijn Pieters

'Save, Recover, and Continue Updating Learning Curves while Training a CNN if Server Crashes Suddenly

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]