GPU compute & thermal performance

Kronsteen · Jun 9, 2023

I have been holding off buying my first Apple silicon machine in order to consider the Mac Pro version. But, for what I would use, the new Mac Studio may be a great option. So I would be most grateful for answers to a couple of questions (and any other thoughts that any of you might have).

Background:

My main use case is GPU compute, with custom-written OpenCL and CUDA code. My two main applications can run for several minutes, occasionally several tens of minutes. The two main systems I currently use are a 2013 (trashcan) Mac Pro with upgraded GPU and an Nvidia Jetson AGX Xavier development kit. I also use a 16” MacBook Pro for development (but take care not to run the GPU too hard, having cooked the GPU in my old 2012 retina MacBook Pro).

So I am keen to try porting at least some of my code to Metal, to see just how good the M2 GPU is for my workload (which involves a lot of long integer and bitwise operations, not floating point or short AI-style integer work). One of the big attractions for me is the unified memory, which I would like to compare against Nvidia’s implementation.

Now, I had assumed that an M2 Mac Pro would be the unavoidable (if rather pricey) option. But I am astonished to find that the latest Mac Studio can offer the same M2 — and, in particular, GPU — versions. Expandability via PCIe cards is of no interest to me (I am assuming that there is no real prospect of being able to fit a heavy-duty Nvidia GPU card in the new Mac Pro!). It seems that, with the Studio, I could save myself around £3,000 (UK pounds).

Three questions:

1. For running the M2’s GPU hard, sometimes for tens of minutes, is there any material advantage in the Mac Pro’s thermal design, or should the Studio be perfectly adequate?

2. Would the answer to question (1) be the same for the M2 Max and Ultra? (I realise that the Mac Pro is Ultra only.) Obviously, the Ultra’s GPU is more powerful, but would it risk overheating, or at least throttling, the Studio if pushed too hard?

3. Is there any reason to suppose that the M2 GPU might have significantly different integer performance from recent (e.g. Ampere) Nvidia GPUs of comparable power?

With many thanks in advance for any replies or other thoughts — and apologies for the hopelessly long and rambling post!

Andrew

AdamSeen · Jun 9, 2023

1) It will be more than adequate.
2) It won't throttle and there's no real advantage to the Mac Pros.
3) The tensor-like cores on Apple are good, but the 4090 is significantly better in all areas. If you're looking for computing power, go with a NVIDIA GPU and save money by not purchasing the Ultra (this is what I've done). However, if you're interested in Metal and local Mac-optimized ML development (e.g., https://github.com/ggerganov/ggml), and want to build something for that market, then it's worth buying the Studio. Also, if you plan to run large LLMs with 65B parameters, you can do so with the Studio's huge shared memory (e.g., around 190GB compared to 24GB on the 4090). The shared memory is great, but it is still only LPDDR5, which is not nearly as fast as the 4090.

Kronsteen · Jun 11, 2023

AdamSeen said:
1) It will be more than adequate.
2) It won't throttle and there's no real advantage to the Mac Pros.
3) The tensor-like cores on Apple are good, but the 4090 is significantly better in all areas. If you're looking for computing power, go with a NVIDIA GPU and save money by not purchasing the Ultra (this is what I've done). However, if you're interested in Metal and local Mac-optimized ML development (e.g., https://github.com/ggerganov/ggml), and want to build something for that market, then it's worth buying the Studio. Also, if you plan to run large LLMs with 65B parameters, you can do so with the Studio's huge shared memory (e.g., around 190GB compared to 24GB on the 4090). The shared memory is great, but it is still only LPDDR5, which is not nearly as fast as the 4090.

Many thanks, @AdamSeen , that is most helpful and answers my questions perfectly.

In act, my applications (currently written to use CUDA and OpenCL, not Metal) are nothing to do with AI. Both, loosely speaking, are different flavours of computational number theory. But what you say about NVIDIA GPUs is, I'm sure, equally applicable.

So, for example, the Geekbench Metal benchmark list suggests that the M2 Ultra is a little faster -- roughly 10% for these benchmarks -- than the AMD RX 6950 XT. Whereas for OpenCL, the 6950 figure is roughly a third higher than the Ultra's. That is, no doubt, partly a function of the fact that OpenCL is far from being the optimum way to drive Apple silicon. Nonetheless, the RTX 4090's OpenCL score is nearly 90% greater than the 6950's -- and it may well be that OpenCL is far from optimal for NVIDIA GPUs, too.

That said, at some point, I would still like to try adapting my code to use Metal, not least because I would be interested to experiment with Apple's unified memory (both of my applications unavoidably do repeated transfers of data between GPU and CPU). I just need to decide whether the cost of at least an M2 Max Studio is justified, versus the potential value of a LINUX box with one of the more recent NVIDIA GPUs ....

Thanks again ....
Andrew

soleblaze · Jun 11, 2023

The only real difference between the Mac Studio and the Mac Pro is that the Mac Pro has 6 pcie slots. Ofc, from what I’ve heard all those slots are switched into 16 lanes controlled by one half of the ultra chip. I’d stick clear of the Mac Pro unless you have a workflow that requires multiple PCIe cards that don’t require more than 16 lines at once.

For gpu compute tasks the big benefit is the extra vram. Cheaper than nvidia once you need more than 24gb on one card.

Form thermals the cooling solution on the studio is already overkill for the chip. No need to worry there. However, these chips are still designed for power savings. From what I’ve seen, the M1 Ultra required specific metal coding optimizations to get it to use all its power/bandwidth. I’m not sure if the M2 changed any of that.

I ordered a M2 Ultra studio for messing with LLMs and other ML. I’d probably be better off with 2x 4090 or maybe an A6000, but I didn’t feel like messing with another custom build. Ofc, if it’s not great I’ll return it.

Kronsteen · Jun 12, 2023

soleblaze said:
The only real difference between the Mac Studio and the Mac Pro is that the Mac Pro has 6 pcie slots. Ofc, from what I’ve heard all those slots are switched into 16 lanes controlled by one half of the ultra chip. I’d stick clear of the Mac Pro unless you have a workflow that requires multiple PCIe cards that don’t require more than 16 lines at once.

For gpu compute tasks the big benefit is the extra vram. Cheaper than nvidia once you need more than 24gb on one card.

Form thermals the cooling solution on the studio is already overkill for the chip. No need to worry there. However, these chips are still designed for power savings. From what I’ve seen, the M1 Ultra required specific metal coding optimizations to get it to use all its power/bandwidth. I’m not sure if the M2 changed any of that.

I ordered a M2 Ultra studio for messing with LLMs and other ML. I’d probably be better off with 2x 4090 or maybe an A6000, but I didn’t feel like messing with another custom build. Ofc, if it’s not great I’ll return it.

Thanks, @soleblaze , that’s a really useful perspective.

I’m sure you‘re right, that a 4090 (or two) or A6000 would most likely give better performance, although something like an A6000 in a Dell workstation is by no means cheap (I don’t have the time or inclination to do a custom build). And, for my type of workload, I’m still quite interested to try the Mac’s unified memory (although I am dealing with small amounts of data, there is a lot of CPU/GPU interaction, so being able to avoid repeated data transfers back and forth could yield some useful gains).

Regarding the optimizations you mention (highlighted sentence above), and if I may ask another question:

This isn’t something I’ve come across (although I will investigate). Do you have any experience of having to implement such optimizations? Or can you suggest any pointers to illustrations of how this might be done?

(Okay, that was two questions …. 🤠 )

Thanks!
Andrew

Search

Search

GPU compute & thermal performance

Kronsteen

macrumors member

AdamSeen

macrumors 6502

Kronsteen

macrumors member

soleblaze

macrumors newbie

Kronsteen

macrumors member

Our Staff