'Resourse Exhaust Error on Kaggle when using image size of (350, 300) in TensorFlow
I have images stored in a '/train' folder and labels are in train.csv
file. I am loading the data like this:
training_percentage = 0.8
training_item_count = int(len(train) * training_percentage)
validation_item_count = len(train)-int(len(train) * training_percentage)
training_df = train[:training_item_count]
validation_df = train[training_item_count:]
batch_size = 64
image_height = 350
image_width = 300
input_shape = (image_height, image_width, 3)
dropout_rate = 0.4
classes_to_predict = sorted(training_df.label.unique())
training_data = tf.data.Dataset.from_tensor_slices((training_df.file_name.values, training_df.label.values))
validation_data = tf.data.Dataset.from_tensor_slices((validation_df.file_name.values, validation_df.label.values))
def load_image_and_label_from_path(image_path, label):
img = tf.io.read_file(image_path)
img = tf.io.decode_jpeg(img)
img = tf.image.convert_image_dtype(img, tf.float32)
return img, label
AUTOTUNE = tf.data.experimental.AUTOTUNE
training_data = training_data.map(load_image_and_label_from_path, num_parallel_calls = AUTOTUNE)
validation_data = validation_data.map(load_image_and_label_from_path, num_parallel_calls = AUTOTUNE)
training_data_batches = training_data.shuffle(buffer_size = 500).batch(batch_size).prefetch(buffer_size = AUTOTUNE)
validation_data_batches = validation_data.shuffle(buffer_size = 500).batch(batch_size).prefetch(buffer_size = AUTOTUNE)
I think batch size of 64 is good, there should not be any error. Because image size is not that big (350, 300). Why am I getting this error?:
ResourceExhaustedError: OOM when allocating tensor with shape[64,192,88,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model_4/efficientnetb4/block3a_expand_activation/Sigmoid (defined at tmp/ipykernel_34/3030206641.py:4) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[Op:__inference_train_function_620332]
Function call stack:
train_function
This is how I am training my model - Click Here
Solution 1:[1]
with a 350 X 300 X 3 image you have 315,000 pixels that are float values that will consume a lot of memory per image. Plus you will have pretty high training times. So if you can reduce the image size to something like 250 X 215. Next reduce your batch size to say 20. A lot depends on the model you are using as well. So try the values I suggested. If it runs then you can incrementally increase the image size.
Solution 2:[2]
Your HW does not have enough memory to allocate your data. Among others, you have the following options:
- decrease the batch size
- distribute your training among several workers
- use a different training HW. e.g. different GPU or change to TPU
- Use different model architecture
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Sara PB |