'Updating tracked dir in DVC
According to this tutorial when I update file I should remove file from under DVC control first (i.e. execute dvc unprotect <myfile>.dvc
or dvc remove <myfile>.dvc
) and then add it again via dvc add <mifile>
. However It's not clear if I should apply the same workflow for the directories.
I have the directory under DVC control with the following structure:
data/
1.jpg
2.jpg
Should I run dvc unprotect data
every time the directory content is updated?
More specifically I'm interested if I should run dvc unprotect data
in the following use cases:
- New file is added. For example if I put
3.jpg
image in the data dir - File is deleted. For example if I delete
2.jpg
image in thedata
dir - File is updated. For example if I edit
1.jpg
image via graphic editor. - A combination of the previous use cases (i.e. some files are updated, other deleted and new files are added)
Solution 1:[1]
Only when file is updated - i.e. edit 1.jpg
with your editor AND only if hadrlink or symlink cache type is enabled.
Please, check this link:
updating tracked files has to be carried out with caution to avoid data corruption when the DVC config option cache.type is set to hardlink or/and symlink
I would strongly recommend reading this document: Performance Optimization for Large Files it explains benefits of using hardlinks/symlinks.
Solution 2:[2]
Links above do not work anymore -> here is the up-to-date link and also pasting the instructions here:
Modifying content
Unlink the file with dvc unprotect. This will make train.tsv safe to edit:
dvc unprotect train.tsv
Then edit the content of the file, for example with:
echo "new data item" >> train.tsv
Add the new version of the file back with DVC:
dvc add train.tsv
git add train.tsv.dvc
git commit -m "modify train data"
If you have remote storage and/or an upstream repo:
dvc push
git push
Replacing files
If you want to replace the file altogether, you can take the following steps.
First, stop tracking the file by using dvc remove on the .dvc file. This will remove train.tsv from the workspace (and unlink it from the cache):
dvc remove train.tsv.dvc
Next, replace the file with new content:
echo new > train.tsv
And start tracking it again:
dvc add train.tsv
git add train.tsv.dvc .gitignore
git commit -m "new train data"
If you have remote storage and/or an upstream repo:
dvc push
git push
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Shcheklein |
Solution 2 | Henryk Borzymowski |