'Tensorflow reported CUDA_ERROR_ILLEGAL_ADDRESS bug while train yolo

It is a really strange bug. Environment: tf 1.12 + cuda9.0 + cudnn 7.5 + single RTX 2080

Today I tried to train YOLO V3 network on my new device. Batch size is 4. Every thing went right at the beginning, training started as usual and I could see the loss reduction during train process.

But, at around 35 round, it reported a message:

2020-03-20 13:52:01.404576: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered 2020-03-20 13:52:01.404908: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1

and exited train process.

I have tried several times. It happened randomly. Maybe 30 minutes or several hours after training process started.

But if I changed batch size to 2. It could train successfully.

So why this happened? If my environment is not right or not suitable for RTX 2080, this bug should happened at the early begining of the train progress but middle. The layers in my yolo network was all trainable at beginning so there was nothing change during training process. Why it could train correctly at the first round but fail in middle? Why smaller batch size could train successfully?

And what should I do now? The solutions I can thought are: 1:Compile tf 1.12 in cuda 10 + cudnn 7.5 and try again. 2:Maybe update tensorflow and cuda? All cost a lot.

Solution 1:^[1]

Check if Cuda/Cudnn/Driver versions are ok for your card https://docs.nvidia.com/deeplearning/cudnn/support-matrix/index.html#cudnn-versions-764-765.

If above check turn to be OK then this issue might be because of broken GPU card as commented by @ChrisM.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	TFer

'Tensorflow reported CUDA_ERROR_ILLEGAL_ADDRESS bug while train yolo

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]