'saving the shuffled images in a batch to disk with original filename

I have a dataset in a single dir, that I wish to split into training and validation set, then save all of images of each set to a different dir

I'm trying to do this by using the tf.keras.preprocessing.image_dataset_from_directory() and tf.keras.preprocessing.image.save_img() functions, and the tf.data.Dataset.file_paths attribute

code looks something like this:

train_dataset = image_dataset_from_directory(PATH_DS,
                                             shuffle=True,
                                             labels='inferred', 
                                             label_mode='categorical',
                                             class_names=class_names,
                                             batch_size=1,
                                             image_size=[1080, 1920],
                                             validation_split=0.15,
                                             subset="training",
                                             seed=456)
validation_dataset = image_dataset_from_directory(PATH_DS,
                                             shuffle=True,
                                             labels='inferred', 
                                             label_mode='categorical',
                                             class_names=class_names,
                                             batch_size=1,
                                             image_size=[1080, 1920],
                                             validation_split=0.15,
                                             subset="validation",
                                             seed=456)

filepaths_val = validation_dataset.file_paths
filepaths_train = train_dataset.file_paths

for idx, (batch, filepath) in enumerate(zip(train_dataset.as_numpy_iterator(), train_dataset.file_paths)):
    images, labels = batch

    tf.keras.preprocessing.image.save_img(os.path.join(PATH_WD, f"test/train/{class_names[np.argmax(labels[0])]}/{os.path.basename(filepath)}"), images[0], "channels_last", "png")

I need to have the images shuffled because they have filenames such that a alphanumerical sort would result in data leakage between the sets

The problem I am running into seems to be that the dataset iterator has random initialization. The filepaths object is just a list that I can slice, and I've already verified that each seed always returns the same file paths.

However, calling the dataset always returns a different element. I've tried the Dataset.unbatch() method, as_numpy_iterator(), etc. Every time I call the iterator for the first time, it returns a different element.

Solution 1:^[1]

You're loss function has to be differentiable in the whole domain, that means no sharp turning points. In layman's terms, has to be "smooth".

Solution 2:^[2]

What you are taking is the argmax of the tensor. Since this operator is not differentiable. In practical terms, this means you can't backpropagate through that operation i.e. call backward on the results of pred.max(1).indices or similarly pred.argmax(1).

Here have a look:

>>> pred = torch.rand(10, 10, requires_grad=True)
>>> values, indices = pred.max(1)

>>> values.grad_fn # can be backpropagated on:
<MaxBackward0 at 0x7febc10d2ed0>

>>> indices.grad_fn # can't be backpropagated on:
None

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	jjaskulowski
Solution 2

'saving the shuffled images in a batch to disk with original filename

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]