'Why is Tensorflow image classification model overfitting? [closed]

I've been working a food image classification model.

I started off with the TensorFlow tutorial and modified the model (code below). The model trains fine however whenever it gets to ~50-60% validation accuracy it starts overfitting and I have no idea why. This is the best result I got:

My model

I am using 6 classes from the Food-101 dataset for training and validation split 750 train 250 validation. My data folder structure looks like

images - train - class 1 - <750 images>
               - .......
               - class 6 - <750 images>
       - validation  - class 1 - <250 images>
                     - .......
                     - class 6 - <250 images>

Things I have tried:

  1. adding additional data augmentation using ImageDataGenerator
  2. removing layers
  3. adding additional conv2d and drop layers
  4. increasing the number of neurons in the layers
  5. reducing the number of neurons in the layers
  6. various models I found online including the tensorflow tutorial one
  7. adding kernel_initializer='he_uniform' and kernel_regularizer=l2(0.001) to the conv2d layers
  8. doubling the training and validation data by adding the food-101N dataset as well.
  9. altering the learning rate
  10. using one-hot

Everything I try gives me the roughly same result. I am training the model on an AMD Ryzen 5 3600 CPU with 16 Gb of RAM.

What am I doing wrong?

Model summary

Layer (type)                 Output Shape              Param #
=================================================================
sequential_1 (Sequential)    (None, 224, 224, 3)       0
_________________________________________________________________
conv2d (Conv2D)              (None, 224, 224, 16)      448
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 224, 224, 16)      2320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 112, 112, 16)      0
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 112, 112, 32)      4640
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 112, 112, 32)      9248
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 56, 56, 32)        0
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 56, 56, 64)        18496
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 56, 56, 64)        36928
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 56, 56, 64)        36928
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 28, 28, 64)        0
_________________________________________________________________
dropout (Dropout)            (None, 28, 28, 64)        0
_________________________________________________________________
flatten (Flatten)            (None, 50176)             0
_________________________________________________________________
dense (Dense)                (None, 128)               6422656
_________________________________________________________________
dense_1 (Dense)              (None, 6)                 774
=================================================================
Total params: 6,532,438
Trainable params: 6,532,438
Non-trainable params: 0
_________________________________________________________________

Here's the full code

import matplotlib.pyplot as plt
import numpy as np
import os
import PIL
import tensorflow as tf
import cv2

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential

from sklearn.metrics import classification_report,confusion_matrix

import time

import pathlib

training_dir = pathlib.Path('.../images/train')
validation_dir = pathlib.Path('.../images/validation')

batch_size = 32
img_height = 224
img_width = 224

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
  training_dir,
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

val_ds = tf.keras.preprocessing.image_dataset_from_directory(
  validation_dir,
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

class_names = train_ds.class_names
print(class_names)
print(val_ds.class_names)

AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

normalization_layer = layers.experimental.preprocessing.Rescaling(1./255)

resize_and_rescale = tf.keras.Sequential([
  layers.experimental.preprocessing.Resizing(img_height, img_width),
  layers.experimental.preprocessing.Rescaling(1./255)
])

normalized_ds = train_ds.map(lambda x, y: (resize_and_rescale(x, training=True), y))
image_batch, labels_batch = next(iter(normalized_ds))

data_augmentation = keras.Sequential(
  [
    layers.experimental.preprocessing.RandomFlip("horizontal", 
                                                 input_shape=(img_height, 
                                                              img_width,
                                                              3)),
    layers.experimental.preprocessing.RandomRotation(0.1),
    layers.experimental.preprocessing.RandomZoom(0.1),
  ]
)


num_classes = 6

model = Sequential([
  data_augmentation,
  layers.experimental.preprocessing.Rescaling(1./255),
  layers.Conv2D(16, 3, padding='same', activation='relu'),
  layers.Conv2D(16, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(32, 3, padding='same', activation='relu'),
  layers.Conv2D(32, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(64, 3, padding='same', activation='relu'),
  layers.Conv2D(64, 3, padding='same', activation='relu'),
  layers.Conv2D(64, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Dropout(0.4),
  layers.Flatten(),
  layers.Dense(256, activation='relu'),
  layers.Dense(num_classes, activation='softmax')
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
              metrics=['accuracy'])

model.summary()

start_time = time.monotonic()

epochs = 50
history = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

#plotting and performance metrics



Solution 1:[1]

Overcoming overfit is a matter of principled work, not just randomly flailing about.
In general, (i.e disregarding the black-magic of deep-learning otherwise known as double-descent), overfit is a matter of too much effective capacity compared to the training data at hand. This is better known as the "bias-variance-tradeoff".
So adding augmentations is a form of increasing the data (with some redundancy due to correlation, so it's not a matter of simple multiplication), which helps overcome overfit.
Reducing capacity is another form of reducing overfit, so removing layers and/or filters helps reduce capacity. #3 and #4 on your list are the exact opposite of what you intuitively expect would help...
A third form of overfit reduction is to remove capacity via regularization factors, such as weight-decay, drop-out, auxiliary loss heads, etc...

It's worth noting that these are all different sides of the same coin; augmentations can be viewed as a form of regularization for instance, whereby we're enforcing certain invariances upon the model, by requiring the same output for inputs that differ only by some function that we deem as a non-relevant variance.

Putting it all together, you want to try and do the following, maybe all in conjunction:

  • Remove layers and/or filters
  • Add regularization such as L2 norm on the weights, drop-out, etc.
  • Add augmentations, or make them parametrically more extreme (up to a certain point of course)
  • Add auxiliary tasks...Very much depending on the data and labels available.

Two more important notes!

  1. Overfit is sometimes a matter of the training and validation data actually being sampled from different distributions. This is not to say that it's impossible to generalize, but it often means you need to put in extra effort for things to work. (e.g. cross-domain generalization, out-of-sample inference, etc.)
  2. The best practice, when possible, is to add new data! This can be added to the list above, not necessarily instead of the recommendations listed.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 desertnaut