'Error using MultiWorkerMirroredStrategy to train object detection research model ssd_mobilenet_v1_fpn_640x640_coco17_tpu-8

I'm trying to train research model ssd_mobilenet_v1_fpn_640x640_coco17_tpu-8 using the MultiWorkerMirroredStrategy (by setting --num_workers=2 in the invocation of model_main_tf2.py). I'm trying to train across two workers (0 and 1), each with a single GPU. However, when I attempt this I get the following error, always on worker 1:

Traceback (most recent call last):
  File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 553, in __next__
    return self.get_next()
  File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 610, in get_next
    return self._get_next_no_partial_batch_handling(name)
  File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 642, in _get_next_no_partial_batch_handling
    replicas.extend(self._iterators[i].get_next_as_list(new_name))
  File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 1594, in get_next_as_list
    return self._format_data_list_with_options(self._iterator.get_next())
  File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\data\ops\multi_device_iterator_ops.py", line 580, in get_next
    result.append(self._device_iterators[i].get_next())
  File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 889, in get_next
    return self._next_internal()
  File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 819, in _next_internal
    ret = gen_dataset_ops.iterator_get_next(
  File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\ops\gen_dataset_ops.py", line 2922, in iterator_get_next
    _ops.raise_from_not_ok_status(e, name)
  File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\framework\ops.py", line 7186, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [Op:IteratorGetNext]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\JS\Desktop\Tensorflow\models\research\object_detection\model_main_tf2.py", line 114, in <module>
    tf.compat.v1.app.run()
  File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\platform\app.py", line 36, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\absl\app.py", line 312, in run
    _run_main(main, args)
  File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\absl\app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "C:\Users\JS\Desktop\Tensorflow\models\research\object_detection\model_main_tf2.py", line 105, in main
    model_lib_v2.train_loop(
  File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\object_detection\model_lib_v2.py", line 605, in train_loop
    load_fine_tune_checkpoint(
  File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\object_detection\model_lib_v2.py", line 401, in load_fine_tune_checkpoint
    _ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors)
  File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\object_detection\model_lib_v2.py", line 161, in _ensure_model_is_built
    features, labels = iter(input_dataset).next()
  File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 549, in next
    return self.__next__()
  File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 555, in __next__
    raise StopIteration
StopIteration

Worker 0 eventually fails after detecting that worker 1 has gone down.

This error happens regardless of the physical machines on which the two workers run. In other words I see it if I'm running both workers on a single machine (using localhost) OR different machines on the same network.

Based on the trace in the error messages, the error appears to be occurring whenever the training loop attempts to iterate over the training data generated by strategy.experimental_distribute_datasets_from_function. Note that if I change the strategy to MirroredStrategy it runs fine on a single machine (no other changes made). I'm not sure if I'm doing something wrong or if there is a bug in the object detection API.

My setup on both machines is identical (I basically followed the setup instructions on the object detection web-site):

  1. Windows 10
  2. Tensorflow 2.8.0
  3. Cuda Toolkit 11.2
  4. cudnn 8.1

Has anyone ever seen this error before? If so, is there a way around it?



Solution 1:[1]

Ok, I think I understand the issue. In the object detection library there is a file called dataset_builder.py that builds the training dataset from the TFRecord stored in the file specified in the pipeline.config file (in the input_path item of the tf_record_input_reader). The function that actually reads the TFRecord file is _read_dataset_internal. This function treats the input_path of the pipeline config as a LIST OF FILES and then applies a sharding function (passed as an argument) to divide the files between the replicas doing the training (one replica per worker). Since my input_path only specified a single TFRecord file it was assigned to the first replica and the other replicas were given empty filenames!! Thus only the first replica actually had an input dataset to work with, hence the crash.

The solution was to split the training data across two files (two TFRecords) and then set the input_path in pipeline.config to be a list of paths rather than a single path. Once I did this it appears as though the model trained successfully (at least it didn't crash).

I'm not sure if this is a bug in the object detection code or not. I assumed that if I only had one training record (visible to both workers) that both workers would use it and just batch the data accordingly. I'm just not sure if the assumption itself is wrong or if the assumption is correct and the code is wrong.

Anyway, I this helps anyone who might be wrestling with the same issue.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 jschreib