'Resourse Exhaust Error on Kaggle when using image size of (350, 300) in TensorFlow

I have images stored in a '/train' folder and labels are in train.csv file. I am loading the data like this:

training_percentage = 0.8
training_item_count = int(len(train) * training_percentage)
validation_item_count = len(train)-int(len(train) * training_percentage)
training_df = train[:training_item_count]
validation_df = train[training_item_count:]

batch_size = 64
image_height = 350
image_width = 300
input_shape = (image_height, image_width, 3)
dropout_rate = 0.4
classes_to_predict = sorted(training_df.label.unique())

training_data = tf.data.Dataset.from_tensor_slices((training_df.file_name.values, training_df.label.values))
validation_data = tf.data.Dataset.from_tensor_slices((validation_df.file_name.values, validation_df.label.values))

def load_image_and_label_from_path(image_path, label):
    img = tf.io.read_file(image_path)
    img = tf.io.decode_jpeg(img)
    img = tf.image.convert_image_dtype(img, tf.float32)
    
    return img, label

AUTOTUNE = tf.data.experimental.AUTOTUNE

training_data = training_data.map(load_image_and_label_from_path, num_parallel_calls = AUTOTUNE)
validation_data = validation_data.map(load_image_and_label_from_path, num_parallel_calls = AUTOTUNE)

training_data_batches = training_data.shuffle(buffer_size = 500).batch(batch_size).prefetch(buffer_size = AUTOTUNE)
validation_data_batches = validation_data.shuffle(buffer_size = 500).batch(batch_size).prefetch(buffer_size = AUTOTUNE)

I think batch size of 64 is good, there should not be any error. Because image size is not that big (350, 300). Why am I getting this error?:

ResourceExhaustedError:  OOM when allocating tensor with shape[64,192,88,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node model_4/efficientnetb4/block3a_expand_activation/Sigmoid (defined at tmp/ipykernel_34/3030206641.py:4) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_620332]

Function call stack:
train_function

This is how I am training my model - Click Here



Solution 1:[1]

with a 350 X 300 X 3 image you have 315,000 pixels that are float values that will consume a lot of memory per image. Plus you will have pretty high training times. So if you can reduce the image size to something like 250 X 215. Next reduce your batch size to say 20. A lot depends on the model you are using as well. So try the values I suggested. If it runs then you can incrementally increase the image size.

Solution 2:[2]

Your HW does not have enough memory to allocate your data. Among others, you have the following options:

  1. decrease the batch size
  2. distribute your training among several workers
  3. use a different training HW. e.g. different GPU or change to TPU
  4. Use different model architecture

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Sara PB