'Anyway to work with Tensorflow in Mac with Apple Silicon (M1, M1 Pro, M1 Max) GPU?
I have a MacBook Pro with an M1 Max processor and I want to run Tensorflow on this GPU. I have followed the steps from https://developer.apple.com/metal/tensorflow-plugin but I don't know why it runs slower on my GPU. I tested with MNIST tutorial from google official page).
Code I tried
import tensorflow as tf
import tensorflow_datasets as tfds
DISABLE_GPU = False
if DISABLE_GPU:
try:
# Disable all GPUS
tf.config.set_visible_devices([], 'GPU')
visible_devices = tf.config.get_visible_devices()
for device in visible_devices:
assert device.device_type != 'GPU'
except:
# Invalid device or cannot modify virtual devices once initialized.
pass
print(tf.__version__)
(ds_train, ds_test), ds_info = tfds.load('mnist', split=['train', 'test'], shuffle_files=True, as_supervised=True,
with_info=True)
def normalize_img(image, label):
return tf.cast(image, tf.float32) / 255., label
ds_train = ds_train.map(normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)
ds_test = ds_test.map(
normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.AUTOTUNE)
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
optimizer=tf.keras.optimizers.Adam(0.001),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)
model.fit(ds_train, epochs=6, validation_data=ds_test, )
Output (GPU):
462/469 [============================>.] - ETA: 0s - loss: 0.3619 - sparse_categorical_accuracy: 0.9003
469/469 [==============================] - 4s 5ms/step - loss: 0.3595 - sparse_categorical_accuracy: 0.9008 - val_loss: 0.1963 - val_sparse_categorical_accuracy: 0.9432
Epoch 2/6
469/469 [==============================] - 2s 5ms/step - loss: 0.1708 - sparse_categorical_accuracy: 0.9514 - val_loss: 0.1392 - val_sparse_categorical_accuracy: 0.9606
Epoch 3/6
469/469 [==============================] - 2s 5ms/step - loss: 0.1224 - sparse_categorical_accuracy: 0.9651 - val_loss: 0.1233 - val_sparse_categorical_accuracy: 0.9650
Epoch 4/6
469/469 [==============================] - 2s 5ms/step - loss: 0.0956 - sparse_categorical_accuracy: 0.9725 - val_loss: 0.0988 - val_sparse_categorical_accuracy: 0.9696
Epoch 5/6
469/469 [==============================] - 2s 5ms/step - loss: 0.0766 - sparse_categorical_accuracy: 0.9780 - val_loss: 0.0875 - val_sparse_categorical_accuracy: 0.9727
Epoch 6/6
469/469 [==============================] - 2s 5ms/step - loss: 0.0633 - sparse_categorical_accuracy: 0.9813 - val_loss: 0.0842 - val_sparse_categorical_accuracy: 0.9745
Output (without GPU)
469/469 [==============================] - 2s 1ms/step - loss: 0.3598 - sparse_categorical_accuracy: 0.9013 - val_loss: 0.1970 - val_sparse_categorical_accuracy: 0.9427
Epoch 2/6
469/469 [==============================] - 0s 933us/step - loss: 0.1705 - sparse_categorical_accuracy: 0.9511 - val_loss: 0.1449 - val_sparse_categorical_accuracy: 0.9589
Epoch 3/6
469/469 [==============================] - 0s 936us/step - loss: 0.1232 - sparse_categorical_accuracy: 0.9642 - val_loss: 0.1146 - val_sparse_categorical_accuracy: 0.9655
Epoch 4/6
469/469 [==============================] - 0s 925us/step - loss: 0.0955 - sparse_categorical_accuracy: 0.9725 - val_loss: 0.1007 - val_sparse_categorical_accuracy: 0.9690
Epoch 5/6
469/469 [==============================] - 0s 946us/step - loss: 0.0774 - sparse_categorical_accuracy: 0.9781 - val_loss: 0.0890 - val_sparse_categorical_accuracy: 0.9732
Epoch 6/6
469/469 [==============================] - 0s 971us/step - loss: 0.0647 - sparse_categorical_accuracy: 0.9811 - val_loss: 0.0844 - val_sparse_categorical_accuracy: 0.9752
Solution 1:[1]
The reason why it runs slower could be because of the small batch size used in the tutorial. However, make sure you have set up everything correctly as below. We will use miniforge instead of anaconda as it doesn't have GPU support.
Setting up miniforge for TensorFlow support
- Download Miniforge3-MacOSX-arm64.sh
- Run the file using the following command:-
./Miniforge3-MacOSX-arm64.sh
- (Don't run above as
sudo
. If you get permission error, first runchmod +x ./Miniforge3-MacOSX-arm64.sh
)
- It will download miniforge in the current directory. Now you have to activate it. Use the following command to do so.
source miniforge3/bin/activate
- You should see
(conda)
is prepended in your command line. To make sure it is activated during terminal start-up. Use the following command.conda init
- or if you are using zsh,
conda init zsh
- Make sure it is activated properly. To check it use
which python
. It should show.../miniforge3/bin/python
. If it doesn't show it, first removeminiforge3
directory and try to install again from step 2. Also, make sure youranaconda
environment is disabled.
Now we'll install TensorFlow and its dependencies.
- Create a new environment on the top of
conda
environment using the following command and activate it.conda create -n tensorflow python=<your-python-version
- (use
python --version
to find it out) conda activate tensorflow
- Now install the TensorFlow dependencies using the following command.
conda install -c apple tensorflow-deps
.
- Install Tensorflow and Tensorflow metal for mac using following command.
pip install tensorflow-macos
pip install tensorflow-metal
Additional packages
- Install jupyter using following command.
conda install -c conda-forge jupyterlab
Troubleshooting
'miniforge3/envs/tensorflow/lib/libcblas.3.dylib' (no such file) or similar libcblas error.
Solution:conda install -c conda-forge openblas
/tensorflow/core/framework/tensor.h:880] Check failed: IsAligned() ptr = 0x101511d60
Solution: I found this error when using Tensorflow >2.5.0 for some program. Use TensorFlow version 2.5.0. To reinstall it do following.pip uninstall tensorflow-macos
pip uninstall tensorflow-metal
conda install -c apple tensorflow-deps==2.5.0 --force-reinstall
(optional, try only if you get error)pip install tensorflow-mac==2.5.0
pip install tensorflow-metal
You may encounter an import error in jupyter when you import tensorflow. This should be fixed by installing a new kernel.
python -m ipykernel install --user --name tensorflow --display-name "Python <your-python-version> (tensorflow)"
- Important: When you launch jupyter, make sure to select this kernel. Also, jupyter outside tensorflow environment can import tensorflow using this kernel (i.e. You don't have to activate tensorflow environment everytime you want to use it in jupyter).
Testing (M1 Max, 10-core CPU, 24-core GPU version)
Code:
import tensorflow as tf
import tensorflow_datasets as tfds
DISABLE_GPU = False
if DISABLE_GPU:
try:
# Disable all GPUS
tf.config.set_visible_devices([], 'GPU')
visible_devices = tf.config.get_visible_devices()
for device in visible_devices:
assert device.device_type != 'GPU'
except:
# Invalid device or cannot modify virtual devices once initialized.
pass
print(tf.__version__)
(ds_train, ds_test), ds_info = tfds.load('mnist', split=['train', 'test'], shuffle_files=True, as_supervised=True,
with_info=True)
def normalize_img(image, label):
return tf.cast(image, tf.float32) / 255., label
ds_train = ds_train.map(normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)
ds_test = ds_test.map(
normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.AUTOTUNE)
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
optimizer=tf.keras.optimizers.Adam(0.001),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)
model.fit(ds_train, epochs=6, validation_data=ds_test, )
For batch size = 128
Output (GPU)
462/469 [============================>.] - ETA: 0s - loss: 0.3619 - sparse_categorical_accuracy: 0.9003
469/469 [==============================] - 4s 5ms/step - loss: 0.3595 - sparse_categorical_accuracy: 0.9008 - val_loss: 0.1963 - val_sparse_categorical_accuracy: 0.9432
Epoch 2/6
469/469 [==============================] - 2s 5ms/step - loss: 0.1708 - sparse_categorical_accuracy: 0.9514 - val_loss: 0.1392 - val_sparse_categorical_accuracy: 0.9606
Epoch 3/6
469/469 [==============================] - 2s 5ms/step - loss: 0.1224 - sparse_categorical_accuracy: 0.9651 - val_loss: 0.1233 - val_sparse_categorical_accuracy: 0.9650
Epoch 4/6
469/469 [==============================] - 2s 5ms/step - loss: 0.0956 - sparse_categorical_accuracy: 0.9725 - val_loss: 0.0988 - val_sparse_categorical_accuracy: 0.9696
Epoch 5/6
469/469 [==============================] - 2s 5ms/step - loss: 0.0766 - sparse_categorical_accuracy: 0.9780 - val_loss: 0.0875 - val_sparse_categorical_accuracy: 0.9727
Epoch 6/6
469/469 [==============================] - 2s 5ms/step - loss: 0.0633 - sparse_categorical_accuracy: 0.9813 - val_loss: 0.0842 - val_sparse_categorical_accuracy: 0.9745
Output (without GPU)
469/469 [==============================] - 2s 1ms/step - loss: 0.3598 - sparse_categorical_accuracy: 0.9013 - val_loss: 0.1970 - val_sparse_categorical_accuracy: 0.9427
Epoch 2/6
469/469 [==============================] - 0s 933us/step - loss: 0.1705 - sparse_categorical_accuracy: 0.9511 - val_loss: 0.1449 - val_sparse_categorical_accuracy: 0.9589
Epoch 3/6
469/469 [==============================] - 0s 936us/step - loss: 0.1232 - sparse_categorical_accuracy: 0.9642 - val_loss: 0.1146 - val_sparse_categorical_accuracy: 0.9655
Epoch 4/6
469/469 [==============================] - 0s 925us/step - loss: 0.0955 - sparse_categorical_accuracy: 0.9725 - val_loss: 0.1007 - val_sparse_categorical_accuracy: 0.9690
Epoch 5/6
469/469 [==============================] - 0s 946us/step - loss: 0.0774 - sparse_categorical_accuracy: 0.9781 - val_loss: 0.0890 - val_sparse_categorical_accuracy: 0.9732
Epoch 6/6
469/469 [==============================] - 0s 971us/step - loss: 0.0647 - sparse_categorical_accuracy: 0.9811 - val_loss: 0.0844 - val_sparse_categorical_accuracy: 0.9752
At this point we can see running in CPU is way faster (x5) than running on GPU but don't become disappointed by that. We will run for large batches.
Batch size = 1024
Output (GPU)
58/59 [============================>.] - ETA: 0s - loss: 0.4862 - sparse_categorical_accuracy: 0.8680
59/59 [==============================] - 2s 11ms/step - loss: 0.4839 - sparse_categorical_accuracy: 0.8686 - val_loss: 0.2269 - val_sparse_categorical_accuracy: 0.9362
Epoch 2/6
59/59 [==============================] - 0s 8ms/step - loss: 0.1964 - sparse_categorical_accuracy: 0.9442 - val_loss: 0.1610 - val_sparse_categorical_accuracy: 0.9543
Epoch 3/6
59/59 [==============================] - 1s 9ms/step - loss: 0.1408 - sparse_categorical_accuracy: 0.9605 - val_loss: 0.1292 - val_sparse_categorical_accuracy: 0.9624
Epoch 4/6
59/59 [==============================] - 1s 9ms/step - loss: 0.1067 - sparse_categorical_accuracy: 0.9707 - val_loss: 0.1055 - val_sparse_categorical_accuracy: 0.9687
Epoch 5/6
59/59 [==============================] - 1s 9ms/step - loss: 0.0845 - sparse_categorical_accuracy: 0.9767 - val_loss: 0.0912 - val_sparse_categorical_accuracy: 0.9723
Epoch 6/6
59/59 [==============================] - 1s 9ms/step - loss: 0.0683 - sparse_categorical_accuracy: 0.9814 - val_loss: 0.0827 - val_sparse_categorical_accuracy: 0.9747
Output (without GPU)
59/59 [==============================] - 2s 15ms/step - loss: 0.4640 - sparse_categorical_accuracy: 0.8739 - val_loss: 0.2280 - val_sparse_categorical_accuracy: 0.9338
Epoch 2/6
59/59 [==============================] - 1s 12ms/step - loss: 0.1962 - sparse_categorical_accuracy: 0.9450 - val_loss: 0.1626 - val_sparse_categorical_accuracy: 0.9537
Epoch 3/6
59/59 [==============================] - 1s 12ms/step - loss: 0.1411 - sparse_categorical_accuracy: 0.9602 - val_loss: 0.1304 - val_sparse_categorical_accuracy: 0.9613
Epoch 4/6
59/59 [==============================] - 1s 12ms/step - loss: 0.1091 - sparse_categorical_accuracy: 0.9700 - val_loss: 0.1020 - val_sparse_categorical_accuracy: 0.9698
Epoch 5/6
59/59 [==============================] - 1s 12ms/step - loss: 0.0864 - sparse_categorical_accuracy: 0.9764 - val_loss: 0.0912 - val_sparse_categorical_accuracy: 0.9716
Epoch 6/6
59/59 [==============================] - 1s 12ms/step - loss: 0.0697 - sparse_categorical_accuracy: 0.9812 - val_loss: 0.0834 - val_sparse_categorical_accuracy: 0.9749
As you can see now, running on GPU is faster (x1.3) than running on CPU. Increasing the batch size can significantly improve the performance of GPU.
Fig: Runtime graph for various batch sizes.
Sources:
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |