MP All Models No, Apple is not going to beat AMD or Nvidia with their next ARM Mac Pro regarding GPU.

deconstruct60 · May 18, 2022

goMac said:
We've both agreed that AMD and Nvidia have immediate mode GPUs. That's fundamentally different than the work a deferred GPU can optimize out. AMD and Nvidia don't make deferred mode GPU. Deferred mode is not one of the optimizations they've made.

If crawl down deep into the "mud" of any particular GPU vendor's implementation they all have branches off the 'pure' basic modes.

"...
NVIDIA Maxwell/Pascal/Turing GPUs doesn't have PowerVR's "deferred tile render" but it has immediate mode tile cache render.

...
...
AMD Vega Whitepaper:

The Draw-Stream Binning Rasterizer (DSBR) is an important innovation to highlight. It has been designed to reduce unnecessary processing and data transfer on the GPU, which helps both to boost performance and to reduce power consumption.
...
PowerVR's deferred tile render is patent heavy. "

AMD Radeon VII Detailed Some More: Die-size, Secret-sauce, Ray-tracing, and More

I concur, however I was pointing out that the IMC has less consequences in a TBR & L2-ROP design. AMD would certainly be able to clock the gpu higher in case they integrated TBR, but also most of Nvidia's advantage is due to r:w amplification through TBR, not frequency alone. They can only write...

www.techpowerup.com

Even between Nvidia's 'Tile caching' and AMD DSBR there will be L2/L3 hit rate difference on Tile size variances. If crawl too far down into the weeds all implementations are different. AMD/Nvidia not trying to exactly implement TBDR because don't want a patent war.

But using a cache to avoid trips to memory in general... that is doable by anybody that puts in the effort. Whether there is a special cache chunk with "tile" tweaked content replacement parameters, or not , kind of missing the forest for a tree.

deconstruct60 · May 18, 2022

goMac said:
Hmmmm? It's all one driver stack. There is no discrete vs integrated driver stack. It's all the same.

Which 'end' of the stack looking at. The whole aggregate stack is like an iceberg. If looking at it from the top unifying top level abstractions at the userland API level then it is more homogeneous. If looking up from the bottom it isn't. It is a bigger , wider , more diverse "blob" if want want to group it all together. That much bigger part of the stack "below the waterline" is typically modularization to control the complexity. Some of those modules are themselves a "stack" that fits into the bigger "stack".

goMac said:
There are a few issues. Apple family GPUs promise a single address space. However - they don't necessarily promise unified memory. Such a change would be solvable. AMD has been doing a lot of work with single address space discrete cards. And Metal does allow an Apple family GPU to say it does not support unified memory.

Metal has "families" which again is indicative that it isn't just one big homogeneous group. But yes, Metal has broad-ish families. A consistent feature test boolean (that is nominally is false (or true) for a subset grouping isn't necessarily huge expansion of function as opposed to allowing developers to use a consistent code fragment to do a feature-test-before branch idiom. )

However, what I was getting at was instances like where a AMD GPU family shows up in a previous to release update of the OS. Or some expansion of Metal API semantics like expanding to cover Inifinity Fabric Link.

https://developer.apple.com/documen...s/transferring_data_with_infinity_fabric_link

deconstruct60 · May 18, 2022

mattspace said:
When there's an AMD ARM Driver for a 6x00, or any of their discrete GPUs, then AMD having an ARM driver and Apple choosing not to release drivers will suggest a strategy,

Scroll back to post 45 check out the P.P.S. section. Ampere Computing has the W6800 card working.

mattspace · May 18, 2022

deconstruct60 said:
Scroll back to post 45 check out the P.P.S. section. Ampere Computing has the W6800 card working.

Is there an official ARM driver released by AMD, yes, or no? IIRC Apple was effectively just reworking AMD's x86 Linux driver, not writing one from scratch. My point stands.

kvic · May 19, 2022

deconstruct60 said:
As for an Apple dGPU. Again foundational changes. All the extremely highly tuned Apple GPU apps expect an iGPU. So is Apple going to develop and another driver stack to do dGPU and tell folks to go back and redo all of your optimization work again. (after just asking folks to do major overhauls for iGPUs)? Probably not. At least not for a couple of years.

This saying has been spreading for more than a year now and people seem to take it for granted that's the case. But little justification has been given if that's true. Or put it in another way. Many people seem to take "unified memory" as new invention which is distinctively different from the past & its ancestor. By that they justify a totally new driver stack, and a different driver stack needed (hence no good) to support dGPU (Apple's or 3rd party).

Since I question this, you may guess now I hold a different (or rather opposite) view. But would first love to hear people's justification why it's the case. Put it another way. Why would it be an indescribable burden for a game dev, graphics app to cope with "unified memory" (Apple iGPU) or "non unified memory" (Apple or 3rd party dGPU).

--

I managed to get around the above trap. One member mentioned rumour about Apple dGPU in a previous post. I don't know how real it's in the coming year or two. However, now I do believe in an Apple dGPU is more elegant than e.g. multiple MPX-like SoCs daughter boards people joked about months ago. Multiple Apple dGPUs have many merits over multiple SoCs daughter boards. Too obvious that let's skip them for now.

Here is why I think a dGPU is very possible. As Apple SoCs grow in size, eventually (in a couple of years time) they will outgrow process node capability to have a single die of a future "M1 Max" unit. So the very likely outcome of future stitching will be one die for CPU cores and some caches, one die for GPU and caches, another die for other ASICs, another die for system level cache and etc.

Once Apple SoCs grow to such a state, it becomes silly every SoC has to stitch a CPU and a GPU together but not to be individually fungible. I believe that's the time people can expect a dGPU from Apple as it's such a low hanging fruit to manufacture. Though people should start worrying instead the connectivity between Apple SoC and Apple dGPU. Will it be standard PCIe or Apple proprietary. If Apple choose to go with standard PCIe (consider MPX or its future version belonging to this category), then 3rd party dGPU support is also low hanging fruit (driver for ARM? Easy job for Apple+AMD). If Apple is so determined to compete head on against highest end AMD/Nvidia GPUs, then expect the connectivity will be totally proprietary. Nevertheless, users still get dGPU.

goMac · May 19, 2022

deconstruct60 said:
But using a cache to avoid trips to memory in general... that is doable by anybody that puts in the effort. Whether there is a special cache chunk with "tile" tweaked content replacement parameters, or not , kind of missing the forest for a tree.

When you cut through the white paper down to the details, there is a cache, but not really the same sort of cache. There's an on die cache, but it's not like tile memory. I think you're implying that's a minor implementation detail... but it's not. The tile having it's own memory lets it work independently of other cores and merge render stages together. A higher level cache, and specifically a high level cache on an immediate mode GPU, won't be able to do that.

So you can't just say "well AMD and Nvidia don't have deferred mode and they don't have a tile level cache" but... that's the whole bit. It's the forest, not the tree.

I don't even think unified memory is the critical bit. I think Apple could do a GPU that didn't have unified memory, but had UMA and TBDR, and that would be a compelling product that would be compatible with the existing Metal ecosystem. It would work well for both apps designed for Apple Silicon, and unoptimized Metal apps. They could even do the best of both. Use TBDR, but shove a giant cache on the front like VRAM so that the GPU doesn't always have to go back to main memory for larger chunks of data. That would be a discrete card that nothing in the AMD or Nvidia ecosystems would have a clear answer for.

mattspace said:
Is there an official ARM driver released by AMD, yes, or no? IIRC Apple was effectively just reworking AMD's x86 Linux driver, not writing one from scratch. My point stands.

AMD writes the Radeon drivers for Mac. One big advantage for Apple Silicon is actually that Apple could drag driver development back in house. What I heard is that AMD would at least have some of their employees work out of Apple offices - but they were still AMD employees. I also heard a lot of noise about the initial macOS releases having bad versions of the GPU drivers because GPU driver vendors could never meet deadlines for final integration.

AMD could have been using the Linux driver. My impression is that they have a big pile of shared code at the center of all their drivers.

kvic · May 19, 2022

goMac said:
When you cut through the white paper down to the details, there is a cache, but not really the same sort of cache. There's an on die cache, but it's not like tile memory. I think you're implying that's a minor implementation detail... but it's not. The tile having it's own memory lets it work independently of other cores and merge render stages together. A higher level cache, and specifically a high level cache on an immediate mode GPU, won't be able to do that.

Unless people have worked in GPU design teams, detailed claims such as above likely end up being "you think this I think that". Anyway, here we go. Caches are on die, meaning it's not off chip like VRAM. Your "tile memory" is also on die and as a matter of fact one kind of caches. At nano scale, "on die" is a huge area, so lacking specifics means quite.. meaningless. But their designated usages might be different their purposes are same, reducing traffics to/from external VRAM.

I don't even think unified memory is the critical bit. I think Apple could do a GPU that didn't have unified memory, but had UMA and TBDR, and that would be a compelling product that would be compatible with the existing Metal ecosystem. It would work well for both apps designed for Apple Silicon, and unoptimized Metal apps. They could even do the best of both. Use TBDR, but shove a giant cache on the front like VRAM so that the GPU doesn't always have to go back to main memory for larger chunks of data. That would be a discrete card that nothing in the AMD or Nvidia ecosystems would have a clear answer for.

Without "unified memory" but it's UMA. A statement like this is begging the big question of what exactly it means..

Also, TBDR is just one way of designing a realtime rendering pipeline. Whether it's TBDR or IMR, it can be taken out of the context in discussing iGPU or dGPU. Since Apple currently uses TBDR approach, very surely it'll re-use it for dGPU. TBDR is orthogonal to what VRAM is, meaning its VRAM can be sharing system's main memory or having dedicated memory for its sole use and in its own address space.

Search

Search

MP All Models No, Apple is not going to beat AMD or Nvidia with their next ARM Mac Pro regarding GPU.

deconstruct60

macrumors G5

AMD Radeon VII Detailed Some More: Die-size, Secret-sauce, Ray-tracing, and More

deconstruct60

macrumors G5

deconstruct60

macrumors G5

mattspace

macrumors 68040

kvic

macrumors 6502a

goMac

Contributor

kvic

macrumors 6502a

Our Staff