'Updating tracked dir in DVC

According to this tutorial when I update file I should remove file from under DVC control first (i.e. execute dvc unprotect <myfile>.dvc or dvc remove <myfile>.dvc) and then add it again via dvc add <mifile>. However It's not clear if I should apply the same workflow for the directories.

I have the directory under DVC control with the following structure:

data/
    1.jpg
    2.jpg

Should I run dvc unprotect data every time the directory content is updated?

More specifically I'm interested if I should run dvc unprotect data in the following use cases:

  • New file is added. For example if I put 3.jpg image in the data dir
  • File is deleted. For example if I delete 2.jpg image in the data dir
  • File is updated. For example if I edit 1.jpg image via graphic editor.
  • A combination of the previous use cases (i.e. some files are updated, other deleted and new files are added)
dvc


Solution 1:[1]

Only when file is updated - i.e. edit 1.jpg with your editor AND only if hadrlink or symlink cache type is enabled.

Please, check this link:

updating tracked files has to be carried out with caution to avoid data corruption when the DVC config option cache.type is set to hardlink or/and symlink

I would strongly recommend reading this document: Performance Optimization for Large Files it explains benefits of using hardlinks/symlinks.

Solution 2:[2]

Links above do not work anymore -> here is the up-to-date link and also pasting the instructions here:

Modifying content

Unlink the file with dvc unprotect. This will make train.tsv safe to edit:

dvc unprotect train.tsv

Then edit the content of the file, for example with:

echo "new data item" >> train.tsv

Add the new version of the file back with DVC:

dvc add train.tsv
git add train.tsv.dvc
git commit -m "modify train data"

If you have remote storage and/or an upstream repo:

dvc push
git push

Replacing files

If you want to replace the file altogether, you can take the following steps.

First, stop tracking the file by using dvc remove on the .dvc file. This will remove train.tsv from the workspace (and unlink it from the cache):

dvc remove train.tsv.dvc

Next, replace the file with new content:

echo new > train.tsv

And start tracking it again:

dvc add train.tsv
git add train.tsv.dvc .gitignore
git commit -m "new train data"

If you have remote storage and/or an upstream repo:

dvc push
git push

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Shcheklein
Solution 2 Henryk Borzymowski