'Why is my Pytorch code significantly slower than Tensorflow?
I am trying to move my code from Tensorflow to Pytorch. Before doing this, I just simply test myself two frameworks. I expected two frameworks should show similar performance. However, in my simple benchmark code, Tensorflow is much faster than Pytorch. I could not find the reason why Pytorch is slow.
Below is my TF code
import tensorflow as tf
import os
from pathlib import Path
import numpy as np
import time
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
tf.get_logger().setLevel('WARNING')
my_dtype = 'float32'
tf.keras.backend.set_floatx(my_dtype)
#%% Generate data set
x = np.linspace(0, 10, int(3 * np.power(2, 12))).astype('float32')
y = np.sin(x).astype('float32')
training_data_sets_tuple = (x, y)
dataset_imported = tf.data.Dataset.from_tensor_slices(training_data_sets_tuple)
batch_size = np.power(2, 12)
dataset_imported = dataset_imported.batch(batch_size)
#%% Create model
num_neuron = 120
num_hidden_layers = 5
act_hidden = 'tanh'
layer_input = tf.keras.Input(shape=(1,), dtype=my_dtype)
layer_dense = [0] * num_hidden_layers
for i in range(num_hidden_layers):
if i == 0:
layer_dense[i] = tf.keras.layers.Dense(num_neuron,
activation=act_hidden, dtype=my_dtype)(layer_input)
else:
layer_dense[i] = tf.keras.layers.Dense(num_neuron, activation=act_hidden, dtype=my_dtype)(layer_dense[i - 1])
layer_output = tf.keras.layers.Dense(1, name='', dtype=my_dtype)(layer_dense[-1])
model = tf.keras.Model(inputs=layer_input, outputs=layer_output, name='mainnetwork')
optimizer_main_net = tf.keras.optimizers.Adam(learning_rate=1e-4)
loss_fn_main_net = tf.keras.losses.MeanSquaredError()
#%% tf.function
@tf.function
def my_training(model, mini_batch):
with tf.GradientTape() as tape:
pred_output = model(mini_batch[0], training=True)
loss_val = loss_fn_main_net(pred_output, mini_batch[1])
grads = tape.gradient(loss_val, model.trainable_weights)
optimizer_main_net.apply_gradients(zip(grads, model.trainable_weights))
return loss_val
#%% training
epoch = 1
tic = time.perf_counter()
while epoch < 1e3 + 1:
loss_epoch = 0
for step, mini_batch in enumerate(dataset_imported):
loss_val = my_training(model, mini_batch)
loss_epoch += float(loss_val)
loss_epoch = loss_epoch / len(dataset_imported)
if epoch % 200 == 0:
toc = time.perf_counter()
print(f'Epoch: {epoch}, Elapsed: {(toc-tic):.2f} sec')
tic = time.perf_counter()
print("Loss for Training on Epoch " +str(epoch) + " is "+ str(loss_epoch))
epoch += 1
Below is my PT code
import numpy as np
import time
import torch
import torch.nn as nn
import os
from torchsummary import summary
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
torch.set_default_dtype(torch.float32)
# %% Dataset
x = np.linspace(0, 10, int(3 * np.power(2, 12))).astype('float32')
y = np.sin(x).astype('float32')
class MyDataset_simple(torch.utils.data.Dataset):
def __init__(self, x, y, device):
self.x = torch.tensor(x.reshape(-1, 1)).to(device)
self.y = torch.tensor(y.reshape(-1, 1)).to(device)
def __len__(self):
return len(self.y)
def __getitem__(self, i):
return self.x[i], self.y[i]
dataset_simple = MyDataset_simple(x, y, device)
my_batch_size = int(np.power(2, 12))
train_dataloader = torch.utils.data.DataLoader(dataset_simple, batch_size=my_batch_size)
# %% Model
class MyBenchModelSimple(nn.Module):
def __init__(self):
super(MyBenchModelSimple, self).__init__()
self.my_layer = nn.Sequential(nn.Linear(1, 120),
nn.Tanh(),
nn.Linear(120, 120),
nn.Tanh(),
nn.Linear(120, 120),
nn.Tanh(),
nn.Linear(120, 120),
nn.Tanh(),
nn.Linear(120, 120),
nn.Tanh(),
nn.Linear(120, 1))
def forward(self, x):
x = self.my_layer(x)
return x
model = MyBenchModelSimple().to(device)
summary(model, input_size=(1,))
# %% Training
epoch = 1
my_loss_func = nn.MSELoss()
my_optimizer = torch.optim.Adam(params=model.parameters(), lr=1e-4)
tic = time.perf_counter()
model.train()
while epoch < 1e3 + 1:
loss_individual_epoch = 0
for i, (train_input_x, train_input_y) in enumerate(train_dataloader):
my_optimizer.zero_grad()
outputs = model(train_input_x)
loss = my_loss_func(outputs, train_input_y)
loss.backward()
my_optimizer.step()
loss_individual_epoch += loss.item()
loss_individual_epoch = loss_individual_epoch / len(train_dataloader)
if epoch % 200 == 0:
toc = time.perf_counter()
print(f'Epoch: {epoch}, Elapsed: {(toc - tic):.0f} sec')
tic = time.perf_counter()
print("Loss for Training on Epoch " + str(epoch) + " is " + str(loss_individual_epoch))
epoch += 1
For two frames, the model parameters are the same as 58,441. Here is my result for TF code.
Epoch: 200, Elapsed: 5.58 sec
Loss for Training on Epoch 200 is 0.2962424506743749
Epoch: 400, Elapsed: 5.22 sec
Loss for Training on Epoch 400 is 0.2422607938448588
Epoch: 600, Elapsed: 5.24 sec
Loss for Training on Epoch 600 is 0.20201120525598526
Epoch: 800, Elapsed: 5.24 sec
Loss for Training on Epoch 800 is 0.14385090892513594
Epoch: 1000, Elapsed: 4.57 sec
Loss for Training on Epoch 1000 is 0.022997068629289668
Below is my result for PY code
Epoch: 200, Elapsed: 14 sec
Loss for Training on Epoch 200 is 0.2270326167345047
Epoch: 400, Elapsed: 13 sec
Loss for Training on Epoch 400 is 0.18032070621848106
Epoch: 600, Elapsed: 13 sec
Loss for Training on Epoch 600 is 0.14652210349837938
Epoch: 800, Elapsed: 14 sec
Loss for Training on Epoch 800 is 0.07957464456558228
Epoch: 1000, Elapsed: 13 sec
Loss for Training on Epoch 1000 is 0.06703292826811473
Where might I have made a mistake in the PT code?
Solution 1:[1]
CUDA_LAUNCH_BLOCKING
disables asynchronous kernel launches and should be used for debugging only.
See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#asynchronous-concurrent-execution for detailed information.
Solution 2:[2]
I think it's because the way of data loading between TF an PT are different, TF would generate batch based the consecutive items in the dataset, so that the memory is also consecutive, which is efficient. While PT dataset is a map dataset, it's not that efficient on tabular data, you should consider using a iterable-style dataset.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Qin Heyang |
Solution 2 | Tianqi Wang |