'Flush local disk of Google Colab to Google Drive

I am downloading a 350 GB dataset to Google Drive by setting my current working directory to a folder on Google Drive in Google Colab, and using the curl command to download the dataset through a URL.
But instead of downloading it into Google Drive, Google Colab seems to be downloading it into its local disk which is around 250 GB in size. Since the dataset is larger in size than the local disk space, it crashes after reaching that size limit. I think Google Colab first downloads the data in a local buffer and asynchronously flushes it to Google Drive (correct me if I am wrong here).

How should I download the dataset to Google Drive/ flush local Google Colab disk to Drive when it is full?



Solution 1:[1]

I had the same doubt and a possible workaround, though possibly a bit taxing because it is not automated, is that you can use wget in the colab notebook with the URL. Periodically, say after an hour or so, stop the running cell. And you need to wait for sometime (say a couple of minutes). Post the same, verify once on your drive if the downloaded file size matches the size of your download till now. If yes, move to the next step, which is re-running the cell that had the wget command, but with a -c flag which ensures that the download continues where it left off. You can factory reset the runtime (in new colab I think this means deleting the runtime, and then reconnecting) and then run the cell again with the -c flag. This was working for me.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Arya Bhattacharyya