When is GPU better than CPU for Deep Learning on Apple Silicon?

Apple Silicon has brought about significant improvements in performance and energy efficiency, making it a viable choice for deep learning tasks. However, choosing between a GPU and a CPU can still be confusing, especially when it comes to choosing the right one for the task at hand. In this post, we’ll explore the conditions under which GPUs outperform CPUs on Apple Silicon when using deep learning libraries like TensorFlow and PyTorch.

To determine when to use a GPU over a CPU on Apple Silicon for deep learning tasks, I conducted several experiments using TensorFlow and PyTorch libraries. I used a base-model MacBook Air (M1, 8GB RAM, 7-core GPU) with a thermal pad mod to eliminate any potential throttling.

What does the running time depend on?

When comparing the performance of GPUs and CPUs for deep learning tasks, several factors come into play. Generally speaking, GPUs can perform computations faster than CPUs when there is a high amount of parallelism in the computations, and when there is little need for accessing memory. When working with a simple feed-forward network, there are several factors to consider:

  • Number of layers
  • Number of hidden layer neurons
  • Batch size
  • Backend (TensorFlow or pyTorch)

While exploring different combinations of these numbers, I found that the computation time is mostly proportional to the number of layers, and it won’t contribute to changing the ratio of the speed between CPUs and GPUs. So, I fixed it to 4 layers for this exercise.

Now, let’s focus on the computations per layer. After tinkering with these numbers, I realized that the computation time mostly depends on the total amount of computation per layer. The total amount is determined by the number of hidden layer neurons and the batch size. More precisely, it can be calculated using this equation:

tflops = (2 * n_hidden * n_hidden + n_hidden) * batch_size / 1e12

This equation assumes a dense connection without batch normalization and uses the ReLU activation function, which has a small overhead. To test the performance in different scenarios, I varied the batch size from 16 to 1024 and the number of hidden neurons from 128 to 2048.

TensorFlow results

TensorFlow result

The results from my tests using TensorFlow showed that performance predictably improved with larger per-layer computations. While batch size isn’t explicit in the figure, it can be inferred from the values of n_hidden and tflops_per_layer. As long as tflops_per_layer is the same, different combinations of batch size and n_hidden perform similarly (except for cases with large networks, i.e., n_hidden 1024 or 2048, where memory allocation may have impacted the results). However, in all of the test cases, the performance of GPUs never exceeded that of CPUs.

PyTorch results

PyTorch Results

The pattern of the results from using PyTorch is more irregular than that of TensorFlow, especially for the CPU. Generally speaking, PyTorch was faster than TensorFlow, except for some rare cases (n_hidden: 1024 or 2048, batch size is relatively small). Notably, the performance of the CPU at small per-layer computations and the performance of the GPU at large per-layer computations were both incredible. It was great to see the CPU achieve good performance (~700 Gflops) across a broad range of configurations, while the GPU exceeded that performance and reached ~1.3 Tflops. The theoretical performance of the GPU is ~2 Tflops, so it is getting close to that.

Conclusion

In this simple exercise, I demonstrated following things

  • The matrix calculation performance improves as a function of the amount of computation per layer.
  • In my setup, PyTorch is faster than TensorFlow in general.
  • In PyTorch, GPU speed exceed CPU speed at around ~100Mflops per layer.

Of course, these things will depend on multiple factors such as software version, CPU performance, or number of GPU cores, so I suggest running a test on your own. On the base M1 chip, the domain where the GPU outperforms the CPU was rather limited, but this might be very different on other chips that has a larger GPU core counts.

Software versions

TensorFlow environment:
Python: 3.9.5
TensorFlow: 2.6.0

PyTorch envisonment:
Python: 3.8.13
PyTorch: 1.13.0.dev20220709

Author: Shinya

I'm a Scientist at Allen Institute. I'm developing a biophysically realistic model of the primary visual cortex of the mouse. Formerly, I was a postdoc at University of California, Santa Cruz. I received Ph.D. in Physics at Indiana University, Bloomington. This blog is my personal activity and does not represent opinions of the institution where I belong to.