'PyTorch Lightning training console output is weird
When training a PyTorch Lightning model in a Jupyter Notebook, the console log output is awkward:
Epoch 0: 100%|█████████▉| 2315/2318 [02:05<00:00, 18.41it/s, loss=1.69, v_num=26, acc=0.562]
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/1 [00:00<?, ?it/s]
Epoch 0: 100%|██████████| 2318/2318 [02:09<00:00, 17.84it/s, loss=1.72, v_num=26, acc=0.500, val_loss=1.570, val_acc=0.564]
Epoch 1: 100%|█████████▉| 2315/2318 [02:04<00:00, 18.63it/s, loss=1.56, v_num=26, acc=0.594, val_loss=1.570, val_acc=0.564]
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/1 [00:00<?, ?it/s]
Epoch 1: 100%|██████████| 2318/2318 [02:08<00:00, 18.07it/s, loss=1.59, v_num=26, acc=0.528, val_loss=1.490, val_acc=0.583]
Epoch 2: 100%|█████████▉| 2315/2318 [02:01<00:00, 19.02it/s, loss=1.53, v_num=26, acc=0.617, val_loss=1.490, val_acc=0.583]
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/1 [00:00<?, ?it/s]
Epoch 2: 100%|██████████| 2318/2318 [02:05<00:00, 18.42it/s, loss=1.57, v_num=26, acc=0.500, val_loss=1.460, val_acc=0.589]
Expectingly, the "correct" output from the same training should be:
Epoch 0: 100%|██████████| 2318/2318 [02:09<00:00, 17.84it/s, loss=1.72, v_num=26, acc=0.500, val_loss=1.570, val_acc=0.564]
Epoch 1: 100%|██████████| 2318/2318 [02:08<00:00, 18.07it/s, loss=1.59, v_num=26, acc=0.528, val_loss=1.490, val_acc=0.583]
Epoch 2: 100%|██████████| 2318/2318 [02:05<00:00, 18.42it/s, loss=1.57, v_num=26, acc=0.500, val_loss=1.460, val_acc=0.589]
How comes epoch lines are uselessly repeated and split in this manner? Also I'm not sure what use the Validating
lines are, since they don't seem to provide any information.
Training and validation steps from the model are as follow:
def training_step(self, train_batch, batch_idx):
x, y = train_batch
y_hat = self.forward(x)
loss = torch.nn.NLLLoss()(torch.log(y_hat), y.argmax(dim=1))
acc = tm.functional.accuracy(y_hat.argmax(dim=1), y.argmax(dim=1))
self.log("acc", acc, prog_bar=True)
return loss
def validation_step(self, valid_batch, batch_idx):
x, y = valid_batch
y_hat = self.forward(x)
loss = torch.nn.NLLLoss()(torch.log(y_hat), y.argmax(dim=1))
acc = tm.functional.accuracy(y_hat.argmax(dim=1), y.argmax(dim=1))
self.log("val_loss", loss, prog_bar=True)
self.log("val_acc", acc, prog_bar=True)
Solution 1:[1]
I've had this problem before when terminal windows are resized. The default PL progress bar uses tqdm and you can have issues if tqdm doesn't redraw the screen correctly.
The PL docs mention another, "rich" progress bar you might try instead, and also discuss how to write your own.
Solution 2:[2]
By default, Trainer
is configured to run the validation loop after each epoch. You can change this setting using check_val_every_n_epoch
flag in Trainer
. See docs here.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Will Brannon |
Solution 2 | Harsh |