'Matterport's mask rcnn doesn't train after setting up parameters
Task: Mask RCNN train_shapes.ipynb tutorial. Training to segment different shapes in the artificially generated shapes dataset.
Problem: Matterport's Mask RCNN implementation doesnt work out of the box for this notebook.
Thing's I have tried:
- Solved all the classes and package errors due to import files namely config, model, utils.
- Solved the TF2.x errors caused due to code deprecations.
Parameters I have set:
Configurations:
BACKBONE resnet101
BACKBONE_STRIDES [4, 8, 16, 32, 64]
BATCH_SIZE 1
BBOX_STD_DEV [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE None
DETECTION_MAX_INSTANCES 100
DETECTION_MIN_CONFIDENCE 0.7
DETECTION_NMS_THRESHOLD 0.3
FPN_CLASSIF_FC_LAYERS_SIZE 1024
GPU_COUNT 1
GRADIENT_CLIP_NORM 5.0
IMAGES_PER_GPU 1
IMAGE_CHANNEL_COUNT 3
IMAGE_MAX_DIM 128
IMAGE_META_SIZE 16
IMAGE_MIN_DIM 128
IMAGE_MIN_SCALE 0
IMAGE_RESIZE_MODE square
IMAGE_SHAPE [128 128 3]
LEARNING_MOMENTUM 0.9
LEARNING_RATE 0.001
LOSS_WEIGHTS {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE 14
MASK_SHAPE [28, 28]
MAX_GT_INSTANCES 100
MEAN_PIXEL [123.7 116.8 103.9]
MINI_MASK_SHAPE (56, 56)
NAME shapes
NUM_CLASSES 4
POOL_SIZE 7
POST_NMS_ROIS_INFERENCE 1000
POST_NMS_ROIS_TRAINING 2000
PRE_NMS_LIMIT 6000
ROI_POSITIVE_RATIO 0.33
RPN_ANCHOR_RATIOS [0.5, 1, 2]
RPN_ANCHOR_SCALES (8, 16, 32, 64, 128)
RPN_ANCHOR_STRIDE 1
RPN_BBOX_STD_DEV [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD 0.7
RPN_TRAIN_ANCHORS_PER_IMAGE 256
STEPS_PER_EPOCH 5
TOP_DOWN_PYRAMID_SIZE 256
TRAIN_BN False
TRAIN_ROIS_PER_IMAGE 5
USE_MINI_MASK False
USE_RPN_ROIS True
VALIDATION_STEPS 5
WEIGHT_DECAY 0.0001
Implementation details:
- I am using coco weights to initialize my model.
- Model in training mode.
- Training heads first.
- Epoch = 1
- Learning rate = 0.001
Output:
Starting at epoch 0. LR=0.001
Checkpoint Path: /logs/shapes20211123T0437/mask_rcnn_shapes_{epoch:04d}.h5
Selecting layers to train
fpn_c5p5 (Conv2D)
fpn_c4p4 (Conv2D)
fpn_c3p3 (Conv2D)
fpn_c2p2 (Conv2D)
fpn_p5 (Conv2D)
fpn_p2 (Conv2D)
fpn_p3 (Conv2D)
fpn_p4 (Conv2D)
rpn_model (Functional)
mrcnn_mask_conv1 (TimeDistributed)
mrcnn_mask_bn1 (TimeDistributed)
mrcnn_mask_conv2 (TimeDistributed)
mrcnn_mask_bn2 (TimeDistributed)
mrcnn_class_conv1 (TimeDistributed)
mrcnn_class_bn1 (TimeDistributed)
mrcnn_mask_conv3 (TimeDistributed)
mrcnn_mask_bn3 (TimeDistributed)
mrcnn_class_conv2 (TimeDistributed)
mrcnn_class_bn2 (TimeDistributed)
mrcnn_mask_conv4 (TimeDistributed)
mrcnn_mask_bn4 (TimeDistributed)
mrcnn_bbox_fc (TimeDistributed)
mrcnn_mask_deconv (TimeDistributed)
mrcnn_class_logits (TimeDistributed)
mrcnn_mask (TimeDistributed)
/usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/gradient_descent.py:102: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.
super(SGD, self).__init__(name, **kwargs)
- This is the only thing i can see. And there is no progress bar of epoch run. And this stays the same for 2-3 Hours.
- I later found out that this individual has done the code clean up as well. So i also experimented with his ".py" files and still the same occurs.
System harware specifications:
- Intel Xeon 12 CPU
- 25GB RAM
- 64GB Storage.
- Ubuntu 20.04 Desktop. VM running on company's internal server.
Software Specifications:
- Anaconda Latest version
- TF 2.7.0
- Keras 2.4
Questions:
- Why does the training doesn't start even after 3 hours?
- Is there an error in my configuration?
- Is my system sufficient?
- Is the implementation correct?
- What changes should be done to make this work?
Notebook: Colab notebook
Solution 1:[1]
The training hangs, and this is actually kind of a known issue. The fix is simple: Find the fit function in the model.py file (should be somewhere around line 2360-2370 in the TF2 project), and set the 'workers' argument to 1 and the 'use_multiprocessing' argument to False.
Solution 2:[2]
Try this:
1- Inside the (mrcnn) folder open the file (model.py).
2- Change line 2362 from:
workers = multiprocessing.cpu_count()
to:
workers = 1
3- Change line 2374 from:
use_multiprocessing=True,
to:
use_multiprocessing=False,
Or you can try using this fork where I already did these changes. https://github.com/manasrda/Mask_RCNN This fixed a similar problem for me.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | D.Manasreh |