'Pytorch model weights change when put on GPU

I noticed a very strange behaviour regarding the 3D Resnet by Facebookresearch. Using their sample code from the website, I receive different results, when putting the model on GPU. While on cpu the correct class (archery) is predicted, the model fails to predict it on GPU. Can anyone replicate this and confirm that this is indeed the case? Does anyone know, why this is happening and how to prevent it? Following, you will find some code to quickly test it out:

import torch
import json
import urllib
from pytorchvideo.data.encoded_video import EncodedVideo

from torchvision.transforms import Compose, Lambda
from torchvision.transforms._transforms_video import (
    CenterCropVideo,
    NormalizeVideo,
)
from pytorchvideo.transforms import (
    ApplyTransformToKey,
    ShortSideScale,
    UniformTemporalSubsample
)


def predict_archery(model, device):
    json_url = "https://dl.fbaipublicfiles.com/pyslowfast/dataset/class_names/kinetics_classnames.json"
    json_filename = "kinetics_classnames.json"
    try:
        urllib.URLopener().retrieve(json_url, json_filename)
    except:
        urllib.request.urlretrieve(json_url, json_filename)

    with open(json_filename, "r") as f:
        kinetics_classnames = json.load(f)

    # Create an id to label name mapping
    kinetics_id_to_classname = {}
    for k, v in kinetics_classnames.items():
        kinetics_id_to_classname[v] = str(k).replace('"', "")

    side_size = 256
    mean = [0.45, 0.45, 0.45]
    std = [0.225, 0.225, 0.225]
    crop_size = 256
    num_frames = 8
    sampling_rate = 8
    frames_per_second = 30

    # Note that this transform is specific to the slow_R50 model.
    transform = ApplyTransformToKey(
        key="video",
        transform=Compose(
            [
                UniformTemporalSubsample(num_frames),
                Lambda(lambda x: x / 255.0),
                NormalizeVideo(mean, std),
                ShortSideScale(
                    size=side_size
                ),
                CenterCropVideo(crop_size=(crop_size, crop_size))
            ]
        ),
    )

    # The duration of the input clip is also specific to the model.
    clip_duration = (num_frames * sampling_rate) / frames_per_second

    url_link = "https://dl.fbaipublicfiles.com/pytorchvideo/projects/archery.mp4"
    video_path = 'archery.mp4'
    try:
        urllib.URLopener().retrieve(url_link, video_path)
    except:
        urllib.request.urlretrieve(url_link, video_path)

    # Select the duration of the clip to load by specifying the start and end duration
    # The start_sec should correspond to where the action occurs in the video
    start_sec = 0
    end_sec = start_sec + clip_duration

    # Initialize an EncodedVideo helper class and load the video
    video = EncodedVideo.from_path(video_path)

    # Load the desired clip
    video_data = video.get_clip(start_sec=start_sec, end_sec=end_sec)

    # Apply a transform to normalize the video input
    video_data = transform(video_data)

    # Move the inputs to the desired device
    inputs = video_data["video"]
    inputs = inputs.to(device)

    # Pass the input clip through the model
    preds = model(inputs[None, ...])

    # Get the predicted classes
    post_act = torch.nn.Softmax(dim=1)
    preds = post_act(preds)
    pred_classes = preds.topk(k=5).indices[0]

    # Map the predicted classes to the label names
    pred_class_names = [kinetics_id_to_classname[int(i)] for i in pred_classes]
    print("Top 5 predicted labels: %s" % ", ".join(pred_class_names))


if __name__ == '__main__':
    # Choose device
    # device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    device = torch.device("cpu")

    # Choose the `slow_r50` model
    model = torch.hub.load('facebookresearch/pytorchvideo', 'slow_r50', pretrained=True).to(device)
    model = model.eval()

    predict_archery(model, device)

Results on cpu:

Top 5 predicted labels: archery, throwing axe, playing paintball, stretching arm, riding or walking with horse

Results on GPU:

Top 5 predicted labels: flying kite, air drumming, beatboxing, smoking, reading book

Edit:

Apparently, this issue cannot be reproduced on google colab. I therefore assume that the issue is related to the specific hardware / cuda version. I am using a NVIDIA TITAN Xp and cuda version 11.4.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source