'Tensorflow gpu - out of memory error, the kernel appears to have died

Here is my classification problem :

  • Classify pathological images between 2 classes : "Cancer" and "Normal"
  • Data sets contain respectively 150 000 and 300 000 images
  • All images are 512x512 rgb .jpg images
  • The total is about 32 Go

Here is my configuration :

  • CPU : Intel i7
  • GPU : Nvidia Geforce RTX 3060 (6 Go)
  • Python 3.7
  • Jupyter notebook 6.4.8
  • Tensorflow 2.6 (tensorflow gpu has been installed as described here https://www.tensorflow.org/install/gpu)

And here is the simple CNN with which I wanted to give a first try:

model = tf.keras.Sequential([
  tf.keras.layers.Rescaling(1./255),
  tf.keras.layers.Conv2D(16, 4, activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Conv2D(32, 4, activation='relu'),
  tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Conv2D(64, 4, activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(num_classes)
])

Unfortunately, it raised several kind of errors during or at the end of the first epoch, like Out of memory error, or "The kernel appears to have died" like reported here How to fix 'The kernel appears to have died. It will restart automatically" caused by pytorch

, or even a black screen without control anymore. I assumed that it was a problem of my GPU that is running out of memory, so I tried several changes according to this post : How to fix "ResourceExhaustedError: OOM when allocating tensor" (notably decreasing the batch size, downsizing and switching images from rgb to grayscale). Nevertheless, I still have issues as described above...

So here are my questions:

  1. Do you think it is still possible to address such problem with my nvidia RTX 3060 GPU?
  2. If yes, do you have any tip that I may be missing?

Bonus question) I used to work on another CNN with 40 000 images in data sets (256x256 grayscale images). The CNN was deeper (4 layers with more filters) and the GPU had less memory (nvidia quadro p600). Nevertheless, I never faced any memory issues. => That's why I am really wondering what is using GPU memory : storing images? neurons weight? something else that I am missing?



Solution 1:[1]

Generally GPU memory issues aren't caused by a large training dataset, they are caused by too large of a network with too large of a batch size.

Back of napkin math, your first dense layer is going to have about a million weights, and the conv layers should have no more than a few thousand each. By comparison, MobileNet has about four million weights and is designed to run inference on mobile devices, so unless you are using a huge batch size you shouldn't have GPU memory issues.

Are you using the tf.data API, or trying to load the entire dataset into your RAM? Using tf.data is the best practice for large datasets, as it allows you to load data and perform data augmentations just-in-time.

Edit: Also, since you are performing classification you probably should have a Softmax activation for your last layer.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1