MP All Models Mac Pro X,Y…WHAT IF…

deconstruct60 · Apr 19, 2024

impulse462 said:
There are 3 approaches I hypothesize:
1. stick with the unified memory approach. whatever program uses the accelerator card will have compute done on the card with access to only that memory on package

The major problem with that approach is Apple has spent the last 3+ years telling software developers to assume otherwise. So where are these "programs only for an accelerator" coming from?

The farther away the memory is placed from the main system RAM the less "unified" it will be. Might be flatly addressable, but the latency will creep up. Apple's "unified" basically covers both to make it highly transparent to software. And Apple has be pushing folks to assume that while writing highly optimized code.

If these cards fit into PCI-e standard slots. ( And Apple has show about zero interest in CXL. That doesn't avoid the latencies issue )

If push the "programs run on local resources" to a more complete state then the whole the whole stack ( except for network mounted storage drives ) is there. The OS/Libraries/Application/etc all on local resources.

impulse462 said:
2. take nvidia approach and use hbm memory which will necessarily be memory bandwidth limited (relatively speaking, itll still be /fast/) but it goes against their design philosophy of apple silicon

Apple Silicon's approach to memory already is "poor man's" HBM. Going to more expensive HBM really isn't going to 'solve' much if have introduced non-unified memory. Going to a A-series small chip also isn't going to have much expensive HBM affinity either (smaller dies lead to smaller edges which makes HBM less tractable. )

If have a 'full size' Mn/MnPro/MnMax package then already have the 'poor man's HBM" for free (no additional R&D costs or divergence from the semi-custom RAM packages used for those systems. ). For example
tweaked A10 for T2. A13 tossed into the Studio Display. Watch SoC tossed into HomePod. etc. etc.

impulse462 said:
3. some other way i haven't thought of

" ... Using this communication, the worker processes synchronize gradients before each iteration. I'll show this in action using four Mac Studios connected to each other with Thunderbolt cables. For this example, I will train ResNet, a classifier for images. The bar to the side of each Mac Studio shows the GPU utilization while training this network. For a single Mac Studio, the performance is about 200 images per second. When I add another Mac Studio connected via Thunderbolt, the performance almost doubles to 400 images per second since both GPUs are utilized to the fullest. Finally, when I connect two more Mac Studios, the performance is elevated to 800 images per second. This is almost linear scaling on your compute bound training workloads. ..."

Accelerate machine learning with Metal - WWDC22 - Videos - Apple Developer

Discover how you can use Metal to accelerate your PyTorch model training on macOS. We'll take you through updates to TensorFlow training...

developer.apple.com

If Apple was looking to get rid of the Thunderbolt cables (because Apple 'hates' wires) a simple way of doing it would be to just substituate out the standard PCI-e card edge (run a virtual Ethernet connection over it) for TB/10GbEthernet connection.

The cluster compute for that demo would work as well the exact same software with 10GbE between the studio boxes ( It is just easier to do 3 one-to-one connections with Thunderbolt because there are 4+ TBs ports on the MS and only just one 10GbE port on the system (at best could hook to only one other MS ). What is could do cluster with no wires and no external Ethernet switches. Also no external power cable wires ( one per MS in WWDC cluster)... so more 'wires eliminated' for the fewer wires Jihad.

Put a Mac on a card and just tell folks to use the same 'compute' clustering software they have been encouraging folks to make for more than several years.

A bit of copying what Nvidia does with NVlink for intra-box clustering using a consistent software memory model. Only probably using standard PCI-e v4 (which Apple already has). So it ends up closer to a "Poor man's NVLink". Following generation might go to PCI-e v5 which would be an incrementally better "Poor man's NVLink" for relatively little 'extra' work.

deconstruct60 · Apr 19, 2024

mattspace said:
4. Use overt and "thought leader" subvert marketing and propaganda to convince people to lower their expectations to within the capbilities of the current paradigm of beefed-up iPads funning a macOS UI skin,

The delusional marketing propaganda is that the notion that macOS isn't coupled to Apple Silicon going forward. That horse left the barn. Apple isn't telling folks to lower their expectations at all. Apple just isn't trying to be everything for everybody. They were not before the transition to Apple Silicon either. That is nothing 'new'.

Getting mac software developers to write optimized software for the vast majority of Macs sold is good for both Apple and vast majority of Mac userbase. That is not propaganda, it is just simple economics. The dGPUs being completely tossed out of the entire laptop line up and that laptop line up being 75+% of Mac user base is orders of magnitude more relevant than what is going on with the iPad running a substantively different OS. Hand waving at the phones and/or iPads is largely misdirection.

Boil · Apr 19, 2024

2024 Apple ASi Mac Pro Cube (previewed WWDC 2024, released October 2024)

M4 Extreme desktop-specific SoC (monolithic M4 Ultra UltraFusioned to a monolithic GPU/NPU-specific die)
64-core CPU (56P/8E)
192-core GPU
128-core Neural Engine
Maximum 1TB LPDDR5X RAM
2.16TB/s UMA bandwidth
Quad NAND blade slots (maximum 32TB)
Eight Thunderbolt 5 ports
Custom 750W Platinum-rated PSU

Apple ASi Compute cards (for the ASi Mac Pro):

ComputeSolo - One monolithic GPU/NPU-specific die, 128-core GPU, 64-core Neural Engine, maximum 512GB RAM
ComputeDuo - Two monolithic GPU/NPU-specific dies, 256-core GPU, 128-core Neural Engine, maximum 1TB RAM
ComputeQuadra - Four monolithic GPU/NPU-specific dies, 512-core GPU, 256-core Neural Engine, maximum 2TB RAM

avkills · Apr 19, 2024

Boil said:
View attachment 2370003

2024 Apple ASi Mac Pro Cube (previewed WWDC 2024, released October 2024)

M4 Extreme desktop-specific SoC (monolithic M4 Ultra UltraFusioned to a monolithic GPU/NPU-specific die)

64-core CPU (56P/8E)

192-core GPU

128-core Neural Engine

Maximum 1TB LPDDR5X RAM

2.16TB/s UMA bandwidth

Quad NAND blade slots (maximum 32TB)

Eight Thunderbolt 5 ports

Custom 750W Platinum-rated PSU

Apple ASi Compute cards (for the ASi Mac Pro):

ComputeSolo - One monolithic GPU/NPU-specific die, 128-core GPU, 64-core Neural Engine, maximum 512GB RAM

ComputeDuo - Two monolithic GPU/NPU-specific dies, 256-core GPU, 128-core Neural Engine, maximum 1TB RAM

ComputeQuadra - Four monolithic GPU/NPU-specific dies, 512-core GPU, 256-core Neural Engine, maximum 2TB RAM

Why would the GPU/NPU specific die fusioned to the CPU die be different than the ASi Compute Card one?

I like the way you are speculating, but I think the specific die would be 256/128 not 192/128. 🤷‍♂️

Boil · Apr 19, 2024

avkills said:
Why would the GPU/NPU specific die fusioned to the CPU die be different than the ASi Compute Card one?

I like the way you are speculating, but I think the specific die would be 256/128 not 192/128. 🤷‍♂️

M4 Ultra

64-core CPU (56P/8E)
64-core GPU
64-core Neural Engine

GPU/NPU-specific die

128-core GPU
64-core Neural Engine

Apple ASi ComputeSolo card would be one of the above GPU/NPU-specific dies...
Apple ASi ComputeDuo would be two these dies UltraFusioned together...
Apple ASi ComputeQuadra would be two of the UltraFusioned packages on one card...

avkills · Apr 20, 2024

Boil said:
M4 Ultra

64-core CPU (56P/8E)

64-core GPU

64-core Neural Engine

GPU/NPU-specific die

128-core GPU

64-core Neural Engine

Apple ASi ComputeSolo card would be one of the above GPU/NPU-specific dies...
Apple ASi ComputeDuo would be two these dies UltraFusioned together...
Apple ASi ComputeQuadra would be two of the UltraFusioned packages on one card...

How do you suppose Apple is going to make the on die GPU and the Compute cards look like a single GPU? Or are they not going to do that? Is the memory pool for both going to be connected somehow; or is this going to move to the same model as "CPU RAM" and "GPU RAM" which kind of defeats the purpose of unified memory.

Boil · Apr 20, 2024

avkills said:
How do you suppose Apple is going to make the on die GPU and the Compute cards look like a single GPU? Or are they not going to do that? Is the memory pool for both going to be connected somehow; or is this going to move to the same model as "CPU RAM" and "GPU RAM" which kind of defeats the purpose of unified memory.

SoC is the workstation, Compute cards are the render farm...

Change "render farm" to whatever remote process that needs crunched while end-user continues actively working on the SoC...

avkills · Apr 20, 2024

Isn't that only going to work if the particular software lets you "choose" the render (compute) device?

Boil · Apr 20, 2024

avkills said:
Isn't that only going to work if the particular software lets you "choose" the render (compute) device?

The ability/option to remote render is pretty much a standard thing in most 3D/DCC software, and seems like it shouldn't be too difficult to implement in other software that defaults to the integrated GPU...?

Harry Haller · Apr 20, 2024

Boil said:
View attachment 2370003

2024 Apple ASi Mac Pro Cube (previewed WWDC 2024, released October 2024)

M4 Extreme desktop-specific SoC (monolithic M4 Ultra UltraFusioned to a monolithic GPU/NPU-specific die)

64-core CPU (56P/8E)

192-core GPU

128-core Neural Engine

Maximum 1TB LPDDR5X RAM

2.16TB/s UMA bandwidth

Quad NAND blade slots (maximum 32TB)

Eight Thunderbolt 5 ports

Custom 750W Platinum-rated PSU

Apple ASi Compute cards (for the ASi Mac Pro):

ComputeSolo - One monolithic GPU/NPU-specific die, 128-core GPU, 64-core Neural Engine, maximum 512GB RAM

ComputeDuo - Two monolithic GPU/NPU-specific dies, 256-core GPU, 128-core Neural Engine, maximum 1TB RAM

ComputeQuadra - Four monolithic GPU/NPU-specific dies, 512-core GPU, 256-core Neural Engine, maximum 2TB RAM

Specs look good, but I don't know about the timing.
I hope you're right, but I have a feeling it's going to be '25 instead of '24.

singhs.apps · Apr 20, 2024

Boil said:
View attachment 2370003

2024 Apple ASi Mac Pro Cube (previewed WWDC 2024, released October 2024)

M4 Extreme desktop-specific SoC (monolithic M4 Ultra UltraFusioned to a monolithic GPU/NPU-specific die)

64-core CPU (56P/8E)

192-core GPU

128-core Neural Engine

Maximum 1TB LPDDR5X RAM

2.16TB/s UMA bandwidth

Quad NAND blade slots (maximum 32TB)

Eight Thunderbolt 5 ports

Custom 750W Platinum-rated PSU

Apple ASi Compute cards (for the ASi Mac Pro):

ComputeSolo - One monolithic GPU/NPU-specific die, 128-core GPU, 64-core Neural Engine, maximum 512GB RAM

ComputeDuo - Two monolithic GPU/NPU-specific dies, 256-core GPU, 128-core Neural Engine, maximum 1TB RAM

ComputeQuadra - Four monolithic GPU/NPU-specific dies, 512-core GPU, 256-core Neural Engine, maximum 2TB RAM

How to say I am the Sandman without saying I am the Sandman.

avkills · Apr 21, 2024

Boil said:
The ability/option to remote render is pretty much a standard thing in most 3D/DCC software, and seems like it shouldn't be too difficult to implement in other software that defaults to the integrated GPU...?

Seems like a lot of engineering to cater to 1 demographic of users. Do not get me wrong, I am in the demographic (although not my main focus); but have also experienced first hand the 2nd class status that Mac users generally receive from 3D application developers; although things seem to be changing now for the better.

I would much rather see Apple find a way to add functionality into the OS so that one could select to either compute using both pools, just the SoC GPU portion or the compute cards. 🤷‍♂️

The memory coherency issue needs to be solved.

It might work better and easier although slower if the memory was separated and all things had direct access. Maybe someone with CPU/GPU design knowledge can chime in.

Harry Haller · Apr 25, 2024

Late 2025.
Welp.

ZombiePhysicist · Apr 25, 2024

Harry Haller said:
Late 2025.
Welp.

so sad. This company has too many bozos. They need to purge like 90% of their employees. There is no excuse for a company this wealthy to be so incapable. Basic stuff. Release a new machine every year. All it takes is a mediocre amount of competence.

treehuggerpro · Apr 25, 2024

Reminds me of Brazil's . . .

. . . 'my complication had a little complication' scene.

Harry Haller · Apr 25, 2024

ZombiePhysicist said:
so sad. This company has too many bozos. They need to purge like 90% of their employees. There is no excuse for a company this wealthy to be so incapable. Basic stuff. Release a new machine every year. All it takes is a mediocre amount of competence.

I don't know if it's a lack of competence or a lack of willpower.
I'm leaning toward the latter.
I don't think Apple care anymore.
The MP is a bottom of the barrel profit item.
But they feel like they have to keep it going.

And it looks like an M3 Ultra Studio, which is the one I am waiting for, is going to be skipped for the M4 version because of the M1-3 vulnerability.

Boil · Friday at 11:16 AM

treehuggerpro said:
I think this article, posted by @Antony Newman, puts some context around where Apple has been heading and why . . .

How To Build A Better “Blackwell” GPU Than Nvidia Did

While a lot of people focus on the floating point and integer processing architectures of various kinds of compute engines, we are spending more and more

www.nextplatform.com

The Ultimate Chiplet Interconnect. - Eliyan - Home

Ultimate Chiplet Systems at a fraction of total cost of ownership. Without the downsides of advanced packaging, you achieve higher number of cores, compute performance, test coverage, and yield.

eliyan.com

Eliyan is pitching their products and approach at the AI / servers markets, but Apple looks to be squarely aimed at leveraging the same approach.

@deconstruct60 - Thoughts on the above in regards to a possible Mn Extreme...?

ZombiePhysicist · Friday at 11:34 AM

Thought this was interesting:

deconstruct60 · Saturday at 3:20 PM

Boil said:
@deconstruct60 - Thoughts on the above in regards to a possible Mn Extreme...?

In the Next Platform article there is this diagram.

Note that the 'connector' between the two ASIC compute dies is not being replaced by Eliyan . Bigger threat there is to relatively short distance NVLink/NVlink-interdie links than to the same role that UltraFusion is playing. The more longer distance NVLink is going to stick around over the longer term ( it isn't a die-to-die solution. )

Perhaps something like Eliyan links causes the SLC+Memory controller to decouple from the main die. But that is not necessarily going to bring you an "Extreme" version any quicker. So in the diagram where HBM memory is, those could be decoupled Cache. Small chance HBM gets cheap enough that Apple tosses their desire to use "Poor Man's HBM" that they have designed (e.g., probably still sticking with LPDDRx ). If the path to the memory controllers took up less space on the primary compute die that Apple would add more memory channels rather than inter-die connections ( i.e., the die edge space allocated to memory bandwidth would not go down. ). More AI throughput , less QoS limitations to the CPU cores, etc that's where the increase memory I/O would go. Not to "more dies" . There is still going to be the "not as fast as a Nvidia x090" crowd so bandwidth pressure is not going to ease up any time soon.

The example doesn't expand the number of "compute ASIC" in the diagram. The bump here is mainly getting more RAM bandwidth and capacity into an incrementally bigger package. (the package growth is largely just more HBM stacks. ). In terms of compute core distribution, this is far more in the "cling to bigger monolithic dies as long as you can" zone.

The claims in the table where "NuLink" is presented in part as being 'better than' UCIe I would take with a grain of salt. I'd rather wait to see if that really comes to play on real SoC designer silicon for a substantial number of players. Kind of reminds me of the tall claims that RAMBUS make decades ago about wiping DDR from the map. Simultaneous signals going into both directions at the same time on the exact same wire at almost completely error free rates, I'm a bit skeptical. (fiber with different wavelength light source/sinks. Yeah, that is more clear. Electrons on copper. ) I'd be more interested in what the error rates were than the 'raw bandwidth' speed. ( people have stuck with 'one way' wires for a long time for some good reasons. There is something 'clever' being done here and the question would be whether it is 'too clever'. )

[ P.S. I'm not sure that is really what the "Blackwell GPU" is really going to look like. GPU as in what the 5090 is going to be. Really think that their top end GPU is going to have HBM costs associated with it? That is likely going to get trimmed down to GDDR7. And the compute density on the compute ASIC is going to be different than what they have some for the datacenter version. It may or may not have multiple dies.

IHMO what Nvidia is buiilding for the top end datacenter cards is starting to diverge from 'GPU'. And given steeper competition in the future that is probably only going to get wider. I don't think the 'GPU' label is a good one for those. LLM training is largely not 'graphics'. ]

treehuggerpro · Tuesday at 7:05 PM

@Boil, and for anyone interested, the best antidote to the post above is the writeup itself. The author’s commentary is on profit motivated resistance in the industry and the general subject of current and proposed tech is well backgrounded and demystified straight-up.

The article does not specifically address Apple’s x4 problem, but you do come away with a good understanding for why Apple’s 2022 patent addressing an x4 solution correlates, point for point, with all the attributes of Eliyan’s solution. This does not mean Apple will ultimately be licensing Eliyan’s tech to get over their x4 hurdle, but it is clear they have and/or are exploring this approach. (Apple’s patent - PDF)

The slide used in the post above was taken out of context. The slide was in the article as part of the author’s narrative on resistance from the GPU makers and how Eliyan is trying to ease them into the uptake of their IP. This is the slide in context:

Easy money does weird things to companies and their roadmaps.

Imagine a beast as shown above with 576 GB of HBM3E memory and 19 TB/sec of bandwidth! Remember: The memory is not a huge part of the power draw, the GPU is, and the little evidence we have seen to date surely indicates that the GPUs being put into the field by Nvidia, AMD, and Intel are all constricted on HBM memory capacity and bandwidth – and have been for a long time because of the difficulty in making this stacked memory. These companies make GPUs, not memory, and they maximize their revenues and profits by giving as little HBM memory as possible against formidable amounts of compute. They always show more than the last generation, but the GPU compute always goes up faster than the memory capacity and bandwidth. Such [a] design as Eliyan is proposing can snap the compute and memory back into balance – and make these devices cheaper, too.

Perhaps that was a bit too strong for the GPU makers, so with the UMI launch, the company has backed off a little and showed how a mix of interposers and organic substrates plus the NuLink PHYs might be used to make a larger and more balanced Blackwell GPU complex.

Below on the left is how you would create a Blackwell-Blackwell superchip with a single NVLink port running at 1.8 TB/sec linking the two dual-chiplet Blackwell GPUs together:

I can only assume the poster above selected this slide without reading the article, consequently each of the conclusions they’ve made from it are wrong.

Eliyan’s tech is not specific to HBM memory and works die-to-die in both Advanced Packaging (on a silicon interposer at short distances) and in Standard packaging scenarios. But the primary advantages being gained / spruiked by Eliyan are in the latter, because it is a die-to-die solution that enables die spacings of 20-30 mm and much bigger substrate sizes. This goes well beyond the distance and size limitations of silicon interposers, in line with Apple’s patent, and has the following inherent advantages summed up in the background section of their patent . . .

[0020]
In one aspect it, has been observed that for large MCMs, accommodating die-to-die routing between a large number of dies becomes more difficult to support due to limited wiring line space and via pad pitch in the MCM routing substate. This can force lower wiring counts and require higher data rates to achieve a target bandwidth. These higher data rates in turn may require more silicon area for the dies, larger shoreline to accommodate input/output (I/O) regions and physical interface (PHY) regions (e.g. PHY analog and PHY digital controller), more power, and higher speed Serializers/Deserializers (SerDes) among other scalability challenges.

[0021]
In accordance with embodiments routing arrangements are described in which significant wiring count gains can be obtained, allowing the data rate to be lowered, thereby improving signal integrity, reducing energy requirements, and a potential reduction in total area. In accordance with embodiments, bandwidth requirements may be met by expanding die-to-die placement for logic dies, which would appear counter-intuitive since increased distance can increase signal integrity losses. However, the expanded die-to-die placement in accordance with embodiments can provide for additional signal routing, thus lowering necessary raw bandwidth. Additionally, signal integrity may be preserved by including the main long routing lines (wiring) in a single dedicated metal routing layer, while shorter signal routes between neighboring dies can use two or more metal routing layers. The additional signal routing can be achieved in multiple metal routing layers, which can also allow for finer wiring width (W), spacing (S) and pitch (P), as well as smaller via size. Furthermore, the routing substrates in accordance with embodiments can be formed using thin film deposition, plating and polishing techniques to achieve finer wiring and smoother metal routing layers compared to traditional MCM substrates. In accordance with embodiments additional signal routing requirements can also be met using a die-last MCM packaging sequence. In a die-first packaging sequence the dies can be first molded in a molding compound layer followed by the formation of the routing substrate directly on the molded dies. In one aspect, it has been observed that a die-first packaging sequence can be accompanied by yield limits to the number of metal routing layers that can be formed in the routing substrate. In a die-last packaging sequence a routing substrate can be pre-formed followed by the mounting of dies onto the pre-formed routing substrate, for example by flip chip bonding. It has been additionally observed that a die-last packaging sequence may allow for the combination of known good dies with a known good routing substrate with a greater number of metal routing layers, facilitating additional wiring gain counts.

[0022]
The routing arrangements in accordance with embodiments may support high bandwidth with reduced data rates. This may be accomplished by increased wiring gain counts obtained by both expanding chip-to-chip placement and a die-last MCM processing sequence. Additionally, signal integrity can be protected by lower power requirements, and routing architectures in which main long routing lines between logic dies can be primarily located in a single dedicated wiring layer in the routing substrate, while shorter signal routes between neighboring dies can use two or more wiring layers. This may be balanced by achieving overall approximate equal signal integrity.

Apple’s consumer focus puts them in a different position to the GPU makers in the AI / server market. For Apple, there’s no downside. Unlike Nvidia etc . . . who it seems (from the various jibes in this article) see leveraging the advantages Eliyan are spruiking as potentially just eating up demand for their own core IP and products.

One more thing, Eliyan have already secured a customer. This is from the article's author in the comments section . . .

Timothy Prickett Morgan says:
MARCH 29, 2024 AT 8:37 PM
They have done their PHY in 5 nanometer and 3 nanometer processes as far as I know. And they have one large scale customer, someone not quite a hyperscaler but more than a typical service provider.

MP All Models Mac Pro X,Y…WHAT IF…

macrumors G5

macrumors G5

macrumors 68040

macrumors 65816

macrumors 68040

macrumors 65816

macrumors 68040

macrumors 65816

macrumors 68040

macrumors 6502a

macrumors 6502a

macrumors 65816

macrumors 6502a

macrumors 68030

macrumors member

macrumors 6502a

macrumors 68040

macrumors 68030

macrumors G5

macrumors member

Our Staff