'How to sharing code between two projects on Azure Databricks

I have two ML projects on Azure Databricks that work almost the same except that they are for different clients. Essentially I want to use some management system so I can share and reuse the same code across different projects. (i.e. python files that store helpful functions for feature engineering, Databricks notebooks that perform similar initial data preprocessing, some configuration files, etc.) At same time, if an update is made in the shared code, it needed to be sync with all the projects that use the code.

I know for Git we can use submodule to do this where we have common code stored in Repo C, and add it as a submodule to Repo A and Repo B. But the problem is that Azure Databricks doesn't support submodule. Also, it only supports working branch up to 200 MB, so I cannot do Monorepo (i.e. have all the code in one repository) either. I was thinking creating a package for shared Python files, but I also have a few core version of notebooks that I want to share which I don't think is possible to built as a package?

Is there any other ways that I can do this on Databricks so I can reuse the code and don't just copy and paste?



Solution 1:[1]

It seems the recommended solution from Databricks is to

  1. clone the common code repo to a seperate path /Workspace/Repos/<user-name>/<repo-name>
  2. add the above path to sys.path in the notebook that needs access to the common code repo with
import sys
sys.path.append("/Workspace/Repos/<user-name>/<repo-name>")

This will enable you to import python modules from the common code repo. Depending on the exact location of your module in the repo, you might need to change the path that you append to sys.path

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 fskj