'Save, Recover, and Continue Updating Learning Curves while Training a CNN if Server Crashes Suddenly
I am training a deep learning model with TensorFlow on a remote server. The problem is that I am only allocated 2 hours of training at a time and the server may crash at any points for various reason.
I know the training of my model will take me at least 48 hours to complete. I would like to be able after the model is fully trained (48+ hours) to display the training curves from start to finish with no breaks in between.
I am able to pick wherever the training was when it last crashes using callbacks (save best weights) but I am unsure how to achieve this with the training curves (loss + accuracy).
Thank you very much for your help.
Solution 1:[1]
Tensorboard is automatically archeive or you can reset it but you can do logging file with the text format where these values indicates or using summary.
history = model.fit(batched_features, epochs=1 ,validation_data=(batched_features) callbacks=[custom_callback, tb_callback]) Easily plots using matlibplot by history.history['loss'], history.history['accuracy'][0] and commons summary.
You also can do using merged summary or loss summary in tf.summary.
You can access the history from callback or callback epoche return.
history = model_highscores.fit(batched_features, epochs=1000 ,validation_data=(dataset.shuffle(len(list_image))), callbacks=[custom_callback]) print(history.params) # {'verbose': 1, 'epochs': 100, 'steps': 2} print(history.history.keys()) # dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy']) temp = history.history['loss'][:-1] plt.plot(temp, history.history['accuracy']) plt.show()
loss_summary = tf.summary.scalar('loss', 0.5)
- callback
def _val_writer(self): if 'val' not in self._writers: self._writers['val'] = tf.summary.create_file_writer(val_dir) return self._writers['val'] def on_epoch_end(self, epoch, logs={}): print(self.model.inputs) feature_extractor = tf.keras.Model(inputs=self.model.inputs, outputs=[layer.output for layer in self.model.layers], ) x = tf.ones((32, 32, 3)) print(np.asarray(feature_extractor))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Martijn Pieters |