'Vertex AI Custom Training Job Container not finding my module: Error while finding module for '...' (ModuleNotFoundError: No module named '...')

I have a PyTorch training job that I am packaging in a Python software distribution (.tar.gz file). I upload the sdist to a GCS bucket and run it in a container using the gcloud ai custom-jobs create CLI.

Up until a couple of weeks ago this worked fine but in recent days my jobs consistently fail with messages like these appearing in their logs:

Running command: python3 -m MyPackage.MyModule --job-dir=gs://my-bucket/my-job/model --model-name=my-model

/opt/conda/bin/python3: Error while finding module specification for 'MyPackage.MyModule' (ModuleNotFoundError: No module named 'MyPackage.MyModule')

MyPackage.MyModule is my module where my training code runs, naturally.

As I've mentioned above the same procedure worked until recently. There have not been any changes to it and I can clearly see that MyModule.py is located under MyPackage in my .tar.gz file.

The container image that I am using us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-9:latest and from what I can tell it has not changed since the time I successfully used it before.

Why is the Vertex AI container not finding my training module? How can I further debug and fix this?

UPDATE (2022-05-09): A minimal reproduction of the issue can be found here: https://github.com/SilentiumIsrael/vertex-repro-sanitized. Please note instructions in README.md there.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source