'Matterport's mask rcnn doesn't train after setting up parameters

Task: Mask RCNN train_shapes.ipynb tutorial. Training to segment different shapes in the artificially generated shapes dataset.

Problem: Matterport's Mask RCNN implementation doesnt work out of the box for this notebook.

Thing's I have tried:

  1. Solved all the classes and package errors due to import files namely config, model, utils.
  2. Solved the TF2.x errors caused due to code deprecations.

Parameters I have set:

Configurations:
BACKBONE                       resnet101
BACKBONE_STRIDES               [4, 8, 16, 32, 64]
BATCH_SIZE                     1
BBOX_STD_DEV                   [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE         None
DETECTION_MAX_INSTANCES        100
DETECTION_MIN_CONFIDENCE       0.7
DETECTION_NMS_THRESHOLD        0.3
FPN_CLASSIF_FC_LAYERS_SIZE     1024
GPU_COUNT                      1
GRADIENT_CLIP_NORM             5.0
IMAGES_PER_GPU                 1
IMAGE_CHANNEL_COUNT            3
IMAGE_MAX_DIM                  128
IMAGE_META_SIZE                16
IMAGE_MIN_DIM                  128
IMAGE_MIN_SCALE                0
IMAGE_RESIZE_MODE              square
IMAGE_SHAPE                    [128 128   3]
LEARNING_MOMENTUM              0.9
LEARNING_RATE                  0.001
LOSS_WEIGHTS                   {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE                 14
MASK_SHAPE                     [28, 28]
MAX_GT_INSTANCES               100
MEAN_PIXEL                     [123.7 116.8 103.9]
MINI_MASK_SHAPE                (56, 56)
NAME                           shapes
NUM_CLASSES                    4
POOL_SIZE                      7
POST_NMS_ROIS_INFERENCE        1000
POST_NMS_ROIS_TRAINING         2000
PRE_NMS_LIMIT                  6000
ROI_POSITIVE_RATIO             0.33
RPN_ANCHOR_RATIOS              [0.5, 1, 2]
RPN_ANCHOR_SCALES              (8, 16, 32, 64, 128)
RPN_ANCHOR_STRIDE              1
RPN_BBOX_STD_DEV               [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD              0.7
RPN_TRAIN_ANCHORS_PER_IMAGE    256
STEPS_PER_EPOCH                5
TOP_DOWN_PYRAMID_SIZE          256
TRAIN_BN                       False
TRAIN_ROIS_PER_IMAGE           5
USE_MINI_MASK                  False
USE_RPN_ROIS                   True
VALIDATION_STEPS               5
WEIGHT_DECAY                   0.0001

Implementation details:

  1. I am using coco weights to initialize my model.
  2. Model in training mode.
  3. Training heads first.
  4. Epoch = 1
  5. Learning rate = 0.001

Output:


Starting at epoch 0. LR=0.001

Checkpoint Path: /logs/shapes20211123T0437/mask_rcnn_shapes_{epoch:04d}.h5
Selecting layers to train
fpn_c5p5               (Conv2D)
fpn_c4p4               (Conv2D)
fpn_c3p3               (Conv2D)
fpn_c2p2               (Conv2D)
fpn_p5                 (Conv2D)
fpn_p2                 (Conv2D)
fpn_p3                 (Conv2D)
fpn_p4                 (Conv2D)
rpn_model              (Functional)
mrcnn_mask_conv1       (TimeDistributed)
mrcnn_mask_bn1         (TimeDistributed)
mrcnn_mask_conv2       (TimeDistributed)
mrcnn_mask_bn2         (TimeDistributed)
mrcnn_class_conv1      (TimeDistributed)
mrcnn_class_bn1        (TimeDistributed)
mrcnn_mask_conv3       (TimeDistributed)
mrcnn_mask_bn3         (TimeDistributed)
mrcnn_class_conv2      (TimeDistributed)
mrcnn_class_bn2        (TimeDistributed)
mrcnn_mask_conv4       (TimeDistributed)
mrcnn_mask_bn4         (TimeDistributed)
mrcnn_bbox_fc          (TimeDistributed)
mrcnn_mask_deconv      (TimeDistributed)
mrcnn_class_logits     (TimeDistributed)
mrcnn_mask             (TimeDistributed)

/usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/gradient_descent.py:102: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.
  super(SGD, self).__init__(name, **kwargs)
  • This is the only thing i can see. And there is no progress bar of epoch run. And this stays the same for 2-3 Hours.
  • I later found out that this individual has done the code clean up as well. So i also experimented with his ".py" files and still the same occurs.

System harware specifications:

  1. Intel Xeon 12 CPU
  2. 25GB RAM
  3. 64GB Storage.
  4. Ubuntu 20.04 Desktop. VM running on company's internal server.

Software Specifications:

  1. Anaconda Latest version
  2. TF 2.7.0
  3. Keras 2.4

Questions:

  1. Why does the training doesn't start even after 3 hours?
  2. Is there an error in my configuration?
  3. Is my system sufficient?
  4. Is the implementation correct?
  5. What changes should be done to make this work?

Notebook: Colab notebook



Solution 1:[1]

The training hangs, and this is actually kind of a known issue. The fix is simple: Find the fit function in the model.py file (should be somewhere around line 2360-2370 in the TF2 project), and set the 'workers' argument to 1 and the 'use_multiprocessing' argument to False.

Solution 2:[2]

Try this:

1- Inside the (mrcnn) folder open the file (model.py).

2- Change line 2362 from:

workers = multiprocessing.cpu_count()

to:

workers = 1

3- Change line 2374 from:

use_multiprocessing=True,

to:

use_multiprocessing=False,

Or you can try using this fork where I already did these changes. https://github.com/manasrda/Mask_RCNN This fixed a similar problem for me.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 D.Manasreh