'TensorFlow Object Detection API - How to train on COCO dataset and achieve same mAP as the reported one?

I'm trying to reproduce the officially reported mAP of EfficientDet D3 in the Object Detection API by training on COCO using a pretrained EfficientNet backbone. The official COCO mAP is 45.4% and yet all I can manage to achieve is around 14%. I don't need to reach the same value, but I wish to at least come close to it.

I am loading the EfficientNet B3 checkpoint pretrained on ImageNet found here, and using the config file found here. The only parameters I changed are batch size (to fit into an RTX 3090), learning rate (0.08 was yielding loss=NaN so I reduced it to 0.01), and steps, which I increased to 600k. This is my pipeline.config file:

  model {
  ssd {
    inplace_batchnorm_update: true
    freeze_batchnorm: false
    num_classes: 90
    add_background_class: false
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    encode_background_as_zeros: true
    anchor_generator {
      multiscale_anchor_generator {
        min_level: 3
        max_level: 7
        anchor_scale: 4.0
        aspect_ratios: [1.0, 2.0, 0.5]
        scales_per_octave: 3
      }
    }
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 896
        max_dimension: 896
        pad_to_max_dimension: true
        }
    }
    box_predictor {
      weight_shared_convolutional_box_predictor {
        depth: 160
        class_prediction_bias_init: -4.6
        conv_hyperparams {
          force_use_bias: true
          activation: SWISH
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            random_normal_initializer {
              stddev: 0.01
              mean: 0.0
            }
          }
          batch_norm {
            scale: true
            decay: 0.99
            epsilon: 0.001
          }
        }
        num_layers_before_predictor: 4
        kernel_size: 3
        use_depthwise: true
      }
    }
    feature_extractor {
      type: 'ssd_efficientnet-b3_bifpn_keras'
      bifpn {
        min_level: 3
        max_level: 7
        num_iterations: 6
        num_filters: 160
      }
      conv_hyperparams {
        force_use_bias: true
        activation: SWISH
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          scale: true,
          decay: 0.99,
          epsilon: 0.001,
        }
      }
    }
    loss {
      classification_loss {
        weighted_sigmoid_focal {
          alpha: 0.25
          gamma: 1.5
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    normalize_loc_loss_by_codesize: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.5
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  fine_tune_checkpoint: "/API/Tensorflow/models/research/object_detection/test_data/efficientnet_b3/efficientnet_b3/ckpt-0"
  fine_tune_checkpoint_version: V2
  fine_tune_checkpoint_type: "classification"
  batch_size: 2
  sync_replicas: true
  startup_delay_steps: 0
  replicas_to_aggregate: 8
  use_bfloat16: false
  num_steps: 600000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    random_scale_crop_and_pad_to_square {
      output_size: 896
      scale_min: 0.1
      scale_max: 2.0
    }
  }
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        cosine_decay_learning_rate {
          learning_rate_base: 1e-2
          total_steps: 600000
          warmup_learning_rate: .001
          warmup_steps: 2500
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
}

train_input_reader: {
  label_map_path: "/DATASETS/COCO/classes.pbtxt"
  tf_record_input_reader {
    input_path: "/DATASETS/COCO/coco_train.record-00000-of-00100"
  }
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  batch_size: 1;
}

eval_input_reader: {
  label_map_path: "/DATASETS/COCO/classes.pbtxt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "/DATASETS/COCO/coco_val.record-00000-of-00050"
  }
}

These are the results:

mAP Loss



Solution 1:[1]

Your loss is too high. A loss around 1 indicates that your model is not being trained. It doesn’t learn the weights. There are a couple of things you can check:

  1. The dataset. Are all images used during training? Also have a look at the annotations. Are classes and bounding boxes correct? Or is there anything weird? For example, COCO’s bounding boxes should be given as absolute value. If there are given as relative value, this might indicate you need to rescale them.
  2. Is the image resized? If so, its bounding box also needs to be resized.
  3. Check the bounding boxes. Maybe plot a few images with their bounding boxes. If the bounding boxes are not in the correct format or its values are incorrectly scaled, you’ll see it.
  4. To narrow down the source of this bug, try to load weights for EfficientNet that have been trained on COCO and see what happens if you try to finetune them further with a very low lr. If that doesn’t work, then that is a very strong indication that there are problems with the annotations.

Solution 2:[2]

Two suggestions: Batch-size is an essential hyper-parameter in deep learning. Different batch sizes may lead to various testing and training accuracies. Choosing an optimal batch size is crucial when training a neural network.[Source]

Using a batch-size of 1 (or 2) for a model with so many parameters may be the reason for lower accuracy.

A higher number of epochs does not compensate for lower batch-size.

Another point which I noticed is that the paper makes use of jitter for augmentation.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 former_Epsilon
Solution 2