'Snakemake version mismatch of python, pandas, and numpy
Original question:
As far as I understand the best practice for snakemake conda envs (see e.g., this answer and the comment by Johannes Köster), each rule should have its own env yaml file to stay modular and maintainable. In my case that's ~25 environments. Typically, these should only specify the minimally needed tools for each rule, I think.
However, that fails for some of the rules, where the tools installed within the rule env pull versions of pandas
or numpy
that are not identical to the versions in the "parent" env in which snakemak is being run.
For example, a pandas
mismatch results in
AttributeError: Can't get attribute '_unpickle_block' on <module 'pandas._libs.internals'
And with numpy
I get
ImportError: Unable to import required dependencies:
numpy:
Both are easy to solve by specifying the same versions of pandas
and numpy
as the parent environment. Say, the conda env that runs snakemake has these dependencies:
- python ==3.7.10
- pandas ==1.3.1
- numpy ==1.21.0
Then, adding these three lines to all my individual rule.yaml files solves the above problems.
However, this does not seem like a good or even correct solution to me. With this, any update of the version of either of these dependencies requires me to change it in all rule yaml files.
As far as I understand, this stems from conda's default to deactivate an env before activating another, instead of stacking/nesting them (see here).
Is there:
- A way to activate rule envs by stacking/nesting them within the main env from which snakemake is being run (I don't want to set the global
auto_stack
of conda, as this might interfere with other stuff)? - Or a way to refer to or 'include' a general yaml files into each rule file, so that the three dependencies do not have to be listed in each of them?
- Another solution that I am not seeing?
Edit: It seems that at least the numpy error is an incredibly convoluted bug in snakemake. I have submitted this as an issue to snakemake now. Still, the general question remains of how to avoid duplication of environment specifications.
Update:
Since asking the question originally, I have played around with it a bit more, see here, which also contains a minimal example to test this.
The issue boils down to this:
- Snakemake stacks (nests) environments when calling a rule that has a
conda
environment specification. My initial guess above was hence wrong. - When this environment has a version of python/numpy/pandas that conflicts with the parent env in which Snakemake is run, we can easily get errors (as shown above).
- This for example happens when a tool requested in that env has a (implicit) dependency on python/pandas, and conda pulls in a conflicting version.
- To prevent this, we can of course explicitly request non-conflicting versions of python in the env, but that seems wasteful and harder to maintain.
Are there better solutions to this?
Solution 1:[1]
As far as I understand the best practice for snakemake conda envs, each rule should have its own env yaml file to stay modular and maintainable.
Do you have a reference for this? As you say, that means creating lots of environments. Besides, running conda activate
before every execution of every rule is going to increase the running time quite a bit.
I usually create a conda environment for the entire project, i.e. for the whole snakemake pipeline, and create rule-specific environments in case of incompatibilities (e.g. the rule needs python2).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | dariober |