Apple Silicon deep learning performance

gl3lan · Oct 26, 2021

In case anyone is interested, in ran a fairly simple MNIST benchmark (proposed here : https://github.com/apple/tensorflow_macos/issues/25) on my recently acquired M1 Pro MBP (16-core GPU, 16GB RAM). I installed Tensorflow using the following guide (https://developer.apple.com/metal/tensorflow-plugin/).

For reference, this benchmark seems to run at around 24ms/step on M1 GPU.

On the M1 Pro, the benchmark runs at between 11 and 12ms/step (twice the TFLOPs, twice as fast as an M1 chip).

The same benchmark run on an RTX-2080 (fp32 13.5 TFLOPS) gives 6ms/step and 8ms/step when run on a GeForce GTX Titan X (fp32 6.7 TFLOPs). A similar level of performance should be also expected on the M1 Max GPU (which should run twice as fast as the M1 Pro).

Of course, this benchmark runs a fairly simple CNN model but it already gives an idea. Keep also in mind that RTX generation cards are able to run faster at fp16 precision, I am not sure it would apply to Apple Silicon.

I would be happy to run any other benchmark if suggested (or help someone to run the benchmark on a M1 Max chip), even if I am more of a PyTorch guy. ;-)

[edit] Makes me wonder whether I should have gone for the M1 Max chip... probably not.

leman · Oct 26, 2021

Are those desktop Nvidia cards you are comparing it to? If so, no bad at all for a compact laptop class.

But yeah, the memory bandwidth a d large caches make these machines ideal for data science. There has been some unusually high activity on PyTorch GitHub recently asking for a native M1 backend. There is a good chance that 2022 is the year when Apple takes the ML community by storm.

yurkennis · Oct 27, 2021

gl3lan said:
I would be happy to run any other benchmark if suggested

How was it battery- / heat-wise? And curious how Pro compares to Max in terms of battery and heat.

senttoschool · Oct 27, 2021

leman said:
Are those desktop Nvidia cards you are comparing it to? If so, no bad at all for a compact laptop class.

But yeah, the memory bandwidth a d large caches make these machines ideal for data science. There has been some unusually high activity on PyTorch GitHub recently asking for a native M1 backend. There is a good chance that 2022 is the year when Apple takes the ML community by storm.

Getting 64GB of VRAM memory for "cheap" is huge.

Previously, you needed an $13k Nvidia A100 card for that.

gl3lan · Oct 27, 2021

yurkennis said:
How was it battery- / heat-wise? And curious how Pro compares to Max in terms of battery and heat.

Honestly it did not run long enough for me to evaluate battery-wise. Heat-wise, I have not heard the fan once since I have the machine, and the computer was slightly warm at some point (but orders of magnitudes cooler than my 2015 13 or my 2020 13).

If you are interested in more extensive benchmarks, you can have have a look here. Basically, the M1 Max is around 8 times slower than a RTX 3090 (the 3090 benchmark being run in fp16 precision for maximum speed), but consumes 8 times less power.

I ran the ResNet50 benchmark on my M1 Pro (16GB RAM) and achieved circa 65 img/sec (half the M1 Max throughput, as expected), the RAM pressure was sometimes orange during the benchmark.

I also ran the same benchmark on a RTX 2080 Ti (256 img/sec in fp32 precision, 620 img/sec in fp16 precision), and on a 2015 issued Geforce GTX TITAN X (128 img/sec in fp32 precision, 170 img/sec in fp16 precision).

Overall I think the M1 Max is a very promising GPU, especially considering it can be configured with 64 GB or RAM. Time will tell if the ML community will adopt those machine and if it will be relevant for PyTorch to be adapted as well.

Depending on the progress of subsequent Apple Silicon chip generations (and on the GPUs proposed on a future Mac Pro), deep learning on Mac might become attractive.

Xiao_Xi · Oct 27, 2021

Apple needs to help Pytorch devs with the Metal backend and improve its Tensorflow fork to be taken seriously in the ML community.

leman · Oct 27, 2021

These are awful results to be honest. M1 Max should be much faster than that. With Apple it's really half a step forward and then an awkward dance in all directions... they release an open source tensor flow version, the suddenly drop it and replace it by a closed source plugin that's hidden away on their website, with no documentation, no changeling, no anything. Just make it open source. Give the community the opportunity to fix bugs.

Also, it seems that the tensorflow plugin is not using AMX accelerators, which are the fastest matrix hardware on the M1. Why?

Regarding your notes on FP16 and FP64: M1 GPU does not support FP64. FP16 gets promoted to FP32 in the ALUs. There is no performance difference between FP16 and FP32 (except that FP16 uses less of register files and can improve hardware utilization on complex shaders, but I doubt that this applies to ML).

CarbonCycles · Oct 27, 2021

I would be curious if you had evaluated Apples CoreML libraries mentioned here:

Convert PyTorch models to Core ML - Tech Talks - Videos - Apple Developer

Bring your PyTorch models to Core ML and discover how you can leverage on-device machine learning in your apps. The PyTorch machine...

developer.apple.com

I'm a big fan of PyTorch and this is the one biggest frustration I have with the DL ecosystem right now...it's becoming a very closed-source/proprietary ecosystem. Outside of that, these systems have so much potential that is being wasted bc of the lack of support.

Xiao_Xi · Oct 27, 2021

@CarbonCycles It seems Core ML/Neural Engine is good for inference, but not for training.

Does anyone know which hardware and deep learning library Apple use to train its models?

gl3lan · Oct 27, 2021

Probably Nvidia hardware and Tensorflow or Pytorch librairies... is there any alternative ?

CarbonCycles · Oct 27, 2021

Xiao_Xi said:
@CarbonCycles It seems Core ML/Neural Engine is good for inference, but not for training.

Does anyone know which hardware and deep learning library Apple use to train its models?

Yea, works great for peripheral IOT devices and some federated learning.

I would think Apple used a combination of NVIDIA & AMD GPUs to train.

Xiao_Xi · Oct 28, 2021

I would say Pytorch on Linux with Nvidia GPUs. Pytorch seems to be more popular among researchers to develop new algorithms, so it would make sense that Apple would use Pytorch more than Tensorflow.

Are these new laptops suitable for reinforcement learning? I've read that reinforcement learning algorithms depends more on the CPU than on the GPU. Is it true? Any benchmark?

JMacHack · Oct 28, 2021

leman said:
These are awful results to be honest. M1 Max should be much faster than that. With Apple it's really half a step forward and then an awkward dance in all directions... they release an open source tensor flow version, the suddenly drop it and replace it by a closed source plugin that's hidden away on their website, with no documentation, no changeling, no anything. Just make it open source. Give the community the opportunity to fix bugs.

Probably a symptom of competing teams butting heads. I’m sure there’s many open source advocates working for Apple, and many who want to keep everything they can proprietary.

jerryk · Oct 28, 2021

gl3lan said:
In case anyone is interested, in ran a fairly simple MNIST benchmark (proposed here : https://github.com/apple/tensorflow_macos/issues/25) on my recently acquired M1 Pro MBP (16-core GPU, 16GB RAM). I installed Tensorflow using the following guide (https://developer.apple.com/metal/tensorflow-plugin/).

For reference, this benchmark seems to run at around 24ms/step on M1 GPU.

On the M1 Pro, the benchmark runs at between 11 and 12ms/step (twice the TFLOPs, twice as fast as an M1 chip).

The same benchmark run on an RTX-2080 (fp32 13.5 TFLOPS) gives 6ms/step and 8ms/step when run on a GeForce GTX Titan X (fp32 6.7 TFLOPs). A similar level of performance should be also expected on the M1 Max GPU (which should run twice as fast as the M1 Pro).

Of course, this benchmark runs a fairly simple CNN model but it already gives an idea. Keep also in mind that RTX generation cards are able to run faster at fp16 precision, I am not sure it would apply to Apple Silicon.

I would be happy to run any other benchmark if suggested (or help someone to run the benchmark on a M1 Max chip), even if I am more of a PyTorch guy. ;-)

[edit] Makes me wonder whether I should have gone for the M1 Max chip... probably not.

Be aware you are running code that was written using TensorFlow V1. This is an old and obsolete version of TF that may or may not be getting any updates. Tensorflow V2 has been out since 2019 or so and is where most development effort is centered.

If they have a more recent benchmark I suggest you use that.

gl3lan · Oct 29, 2021

jerryk said:
Be aware you are running code that was written using TensorFlow V1. This is an old and obsolete version of TF that may or may not be getting any updates. Tensorflow V2 has been out since 2019 or so and is where most development effort is centered.

If they have a more recent benchmark I suggest you use that.

gl3lan said:
Honestly it did not run long enough for me to evaluate battery-wise. Heat-wise, I have not heard the fan once since I have the machine, and the computer was slightly warm at some point (but orders of magnitudes cooler than my 2015 13 or my 2020 13).

If you are interested in more extensive benchmarks, you can have have a look here. Basically, the M1 Max is around 8 times slower than a RTX 3090 (the 3090 benchmark being run in fp16 precision for maximum speed), but consumes 8 times less power.

I ran the ResNet50 benchmark on my M1 Pro (16GB RAM) and achieved circa 65 img/sec (half the M1 Max throughput, as expected), the RAM pressure was sometimes orange during the benchmark.

I also ran the same benchmark on a RTX 2080 Ti (256 img/sec in fp32 precision, 620 img/sec in fp16 precision), and on a 2015 issued Geforce GTX TITAN X (128 img/sec in fp32 precision, 170 img/sec in fp16 precision).

Overall I think the M1 Max is a very promising GPU, especially considering it can be configured with 64 GB or RAM. Time will tell if the ML community will adopt those machine and if it will be relevant for PyTorch to be adapted as well.

Depending on the progress of subsequent Apple Silicon chip generations (and on the GPUs proposed on a future Mac Pro), deep learning on Mac might become attractive.

This second benchmark was run on V2 code.

Xiao_Xi · Oct 29, 2021

It seems Apple's Tensorflow fork doesn't support all Tensorflow raw_ops. So, we need to wait a little more to see the true potential of these new GPUs regarding deep learning.

ingambe · Oct 31, 2021

Xiao_Xi said:
Apple needs to help Pytorch devs with the Metal backend and improve its Tensorflow fork to be taken seriously in the ML community.

It seems to be the case, more info coming soon: https://github.com/pytorch/pytorch/issues/47702#issuecomment-953074900

ingambe · Oct 31, 2021

I must say the above benchmark is a bit disappointing, it's a pity Apple doesn't allow TensorFlow to use the neural engine

Base on the above result, it could be a good machine for Reinforcement learning, a domain where the neural network is small and there is a lot of CPU<->GPU communication

Xiao_Xi · Oct 31, 2021

ingambe said:
it's a pity Apple doesn't allow TensorFlow to use the neural engine

Tensorflow and Pytorch are open source projects. So, Apple could provide them with a Metal backend as it is doing with Blender, an open-source 3D computer graphics software.

By the way, it seems that the Neural Engine is a little tricky to use. https://github.com/hollance/neural-engine

leman · Oct 31, 2021

ingambe said:
I must say the above benchmark is a bit disappointing, it's a pity Apple doesn't allow TensorFlow to use the neural engine

Neural engine is limited purpose. The “real” deep learning accelerator is the AMX unit, but it’s unclear whether Pro/Max include more AMX resources.

ingambe · Oct 31, 2021

Xiao_Xi said:
Tensorflow and Pytorch are open source projects. So, Apple could provide them with a Metal backend as it is doing with Blender, an open-source 3D computer graphics software.

By the way, it seems that the Neural Engine is a little tricky to use. https://github.com/hollance/neural-engine

It seems that Apple is working with Google and Facebook to have a metal backend for Tensorflow (already the case) and Pytorch (WIP it seems). However, if it uses only GPU acceleration (as is the case with Tensorflow right now), we might not get groundbreaking performance. It may be nice for 1 epoch (i.e., prototyping) but not for large model training.
Letting developer uses the AME would be huge, but I don't see Apple doing this in a near future, I hope I'm wrong...

This repo is very interesting, thanks for sharing

I'm waiting a bit to see how the situation evolves, especially with Pytorch

ingambe · Oct 31, 2021

leman said:
Neural engine is limited purpose. The “real” deep learning accelerator is the AMX unit, but it’s unclear whether Pro/Max include more AMX resources.

Aren't the AMX units inside the AME? From Apple's website, it seems the number of Neural Engine cores are the same across the pro/max version

leman · Oct 31, 2021

ingambe said:
Aren't the AMX units inside the AME? From Apple's website, it seems the number of Neural Engine cores are the same across the pro/max version

No, AMX and AME are two different things. The confusing thing is that Apple offers three ways of doing ML on their hardware: AME, AMX and the GPU. The AME seems to be limited to tasks like audio and image processing that Apple uses for their apps, AMX is a general purpose matrix multiplication (good for model learning) and the GPU is the most flexible but also the least efficient of the three.

JimmyjamesEU · Oct 31, 2021

leman said:
Neural engine is limited purpose. The “real” deep learning accelerator is the AMX unit, but it’s unclear whether Pro/Max include more AMX resources.

You may find this interesting.

https://twitter.com/i/web/status/1453035362591420419

CarbonCycles · Oct 31, 2021

leman said:
No, AMX and AME are two different things. The confusing thing is that Apple offers three ways of doing ML on their hardware: AME, AMX and the GPU. The AME seems to be limited to tasks like audio and image processing that Apple uses for their apps, AMX is a general purpose matrix multiplication (good for model learning) and the GPU is the most flexible but also the least efficient of the three.

Something doesn't make sense...it seems like the AMX is a souped up math coprocessor, but how is it more efficient than using GPU? Something else is in play (i.e., they built the LApack/BLAS directly into the hardware instruction sets?!?)

ETA:
After reading the medium article on the AMX coprocessor, it makes more sense as they have highly tuned those math libraries to AMX. Kind of disturbing though...Apple can really mess with this ecosystem since it's closed-off (i.e., Apple's little secret).

Apple Silicon deep learning performance

macrumors newbie

macrumors Core

macrumors member

macrumors 68030

macrumors newbie

macrumors 68000

macrumors Core

macrumors regular

macrumors 68000

macrumors newbie

macrumors regular

macrumors 68000

Suspended

macrumors 604

macrumors newbie

macrumors 68000

macrumors 6502

macrumors 6502

macrumors 68000

macrumors Core

macrumors 6502

macrumors 6502

macrumors Core

Suspended

macrumors regular

Our Staff