Mac Pro should use NVIDIA cards

crazy dave · Mar 9, 2024

theorist9 said:
Actually, perhaps I could try one more round , by simplifying things:

It seems if you took your argument (that if Apple offered upgradeable RAM on the MP, they would allow 3rd party options), and applied it to the upgradeable NAND on the MP, you would conclude Apple would allow 3rd party options there. Yet they don't. I assert that invalidates your argument.

I see no evidence that they block 3rd party NANDs? While Apple requires specific kinds of NANDs they don't make them, those are buyable from 3rd parties. The whole point of the other thread is how to buy and install your own even on soldered systems - the Mac Pro should theoretically be easier. However, even if that wasn't the case and Apple blocked anything other than Apple NANDs for their internal storage the rest of my argument still holds.

theorist9 said:
Can you explain precisely why upgradeable RAM on the MP represents a qualitatively different business case from upgradeable NAND on the MP, such that Apple would allow the former when they don't allow the latter? That's what I've not been able to find in your arguments.

I apologize for not being more clear then: Because the upgradeable DRAM on the MP would be much more akin to PCIe storage (and in fact CXL it really would be over PCIe, it'd be PCIe DRAM). The unified memory is the built-in memory akin to the internal storage NAND.

All the arguments you made for why PCIe storage is "allowed" applies to CXL/DRAM (if you've got DRAM sticks and if those are on a PCIe board, people will have expectations) and on top of that it isn't the main RAM of the system, as we're positing that it's basically used by the OS as overflow/faster swap. Thus, if you want the high performance necessary to feed the CPU/GPU, you'd still have to buy a substantial UMA memory pool from Apple. This even stands in contrast with PCIe storage which is still internal to the device and can indeed basically replace buying internal storage from Apple. So if anything 3rd party CXL/DRAM is even less likely to affect Apple's bottom line selling expensive UMA LPDDR RAM than PCIe storage does for them selling expensive SSD storage.

Where I feel you err is thinking that the upgradeable CXL/DDR RAM would be a replacement for buying UMA RAM from Apple. It wouldn't be. It would be what you optionally buy on top of buying UMA RAM from Apple. If Apple were to go down this route and this is all hypothetical since we have no idea what their plans are, then Apple would still be selling built-in non-upgradeable UMA RAM. Thus the analogy is:

UMA RAM <---> internal storage
CXL/DRAM <---> PCIe storage

Therefore upgradeable CXL/DRAM would far more likely be something you could get 3rd party DRAM/PCIe cards for than not.

JouniS said:
I don't think that would change things that much. Apple Silicon Macs already feel like NUMA systems, even if it's hidden from the user. For example, look at these memory latency measurements I've made:

Working set iMac (i9-10910, 128 GiB) MBP (M2 Max, 96 GiB)
1 GiB 94 ns 117 ns
2 GiB 96 ns 123 ns
4 GiB 108 ns 123 ns
8 GiB 134 ns 129 ns
16 GiB 164 ns 143 ns
32 GiB 182 ns 201 ns
64 GiB 191 ns 365 ns

The exact numbers don't matter, as they are noisy. However, there is a huge increase in latency on M2 Max when the working set increases from 16 GiB (below 1/4 capacity) to 32 GiB (above 1/4), and again from 32 GiB (below 1/2) to 64 GiB (above 1/2). I'd expect that the effect would be even more significant with Ultra.

quarkysg said:
I'm not sure Apple wants to add NUMA support into macOS without much financial benefit. I would think latency would be an order of magnitude worst or more with multi-socket CPU as synchronisation will have to be done by the OS.

Indeed, @JouniS while those latency numbers are really interesting as the RAM gets higher that's not really the pattern expected from non-NUMA aware multi-chip systems, especially since you know the Max is all one die. (In fact, even when Apple had multiple chips they never implemented NUMA.) In multi-chips systems, it's about the latency/bandwidth from core A on chip 0 to RAM stick 4 on chip 1 and the latency is generally another factor of X when that happens and that's beyond the latency of trying to access large data sets (i.e. you could multiply all those numbers by like 4 such that accessing large data sets on the opposite chip could be an order of magnitude more time consuming than accessing the 1GiB set on the same chip). Building in NUMA-awarness, which again macOS does not have, is meant to avoid that scenario as much as possible and keep data localized to the right processor.

EDIT: and the real concern is the GPU and other accelerators on the SOC, again you'd basically break UMA to add a second socket with an interconnect. Even though the Ultra is two dies with an interconnect and I rather suspect that you're right that on the Ultra latency from die 0 to 1 is worse than anything we see here, Ultrafusion is still fast enough *and high enough bandwidth*. Trying to do multi-socket with a GPU on board ... well the reason why multi-GPU gaming systems largely went by the wayside (even with special connectors like the original nvlink) and no one built multi-die GPUs before modern packaging techniques like Ultrafusion and the M1 Ultra. For those you've got to have a fat enough pipe between the chips. The only way to do that is to package them together. Building multi-socket SOCs with GPUs and accelerators, I'm not saying it's impossible but that would be a mess.

JouniS · Mar 9, 2024

crazy dave said:
Indeed, @JouniS while those latency numbers are really interesting as the RAM gets higher that's not really the pattern expected from non-NUMA aware multi-chip systems, especially since you know the Max is all one die. (In fact, even when Apple had multiple chips they never implemented NUMA.) In multi-chips systems, it's about the latency/bandwidth from core A on chip 0 to RAM stick 4 on chip 1 and the latency is generally another factor of X when that happens and that's beyond the latency of trying to access large data sets (i.e. you could multiply all those numbers by like 4 such that accessing large data sets on the opposite chip could be an order of magnitude more time consuming than accessing the 1GiB set on the same chip). Building in NUMA-awarness, which again macOS does not have, is meant to avoid that scenario as much as possible and keep data localized to the right processor.

I fixed a little issue and reran the benchmark, including on a dual-Xeon AWS instance:

Working set	iMac (i9-10910, 128 GiB)	MBP (M2 Max, 96 GiB)	AWS (2x Xeon 8488C, 384 GiB)
1 GiB	94 ns	123 ns	156 ns
2 GiB	98 ns	127 ns	168 ns
4 GiB	108 ns	128 ns	176 ns
8 GiB	141 ns	146 ns	183 ns
16 GiB	169 ns	165 ns	191 ns
32 GiB	182 ns	212 ns	209 ns
64 GiB	193 ns	311 ns	244 ns
128 GiB	–	–	275 ns
256 GiB	–	–	337 ns

The AWS instance is virtualized, so there could be some overhead because of that. numactl says that the distance to the other node is 21, or 110% higher than to local memory, which seems plausible according to the latency numbers. M2 Max gets a little better numbers with a 64 GiB working set, but it exhibits similar scaling to an actual NUMA system.

crazy dave · Mar 9, 2024

JouniS said:
I fixed a little issue and reran the benchmark, including on a dual-Xeon AWS instance:

Working set iMac (i9-10910, 128 GiB) MBP (M2 Max, 96 GiB) AWS (2x Xeon 8488C, 384 GiB)
1 GiB 94 ns 123 ns 156 ns
2 GiB 98 ns 127 ns 168 ns
4 GiB 108 ns 128 ns 176 ns
8 GiB 141 ns 146 ns 183 ns
16 GiB 169 ns 165 ns 191 ns
32 GiB 182 ns 212 ns 209 ns
64 GiB 193 ns 311 ns 244 ns
128 GiB – – 275 ns
256 GiB – – 337 ns

The AWS instance is virtualized, so there could be some overhead because of that. numactl says that the distance to the other node is 21, or 110% higher than to local memory, which seems plausible according to the latency numbers. M2 Max gets a little better numbers with a 64 GiB working set, but it exhibits similar scaling to an actual NUMA system.

A NUMA system is trying to keep all the data localized to the right chip, which again is something that macOS wouldn't be doing. Unless I'm misunderstanding what you did, you'd have to measure socket-to-socket latency: pin the memory on one chip and then pin the thread to the other chip and measure latency. Is that what you did? Otherwise the above until 256 GiB is still a measure of local memory latency. If you didn't, then my suspicion is that, with a distance of 21, you will be multiplying a lot of the numbers on the right by 2.1x. Although I have to say, damn 2.1x is a lot better than I remember from multi socket systems. But that's been a very long while I'll admit.

Anandtech measured latency from socket to socket on Ampere Altra to be 350ns though they admitted that was substantially worse than Intel/AMD.

Then there's the bandwidth issue that I brought up in my edit of the post you quoted (sorry I do that all the time). The real issue with hypothetical dual socket M-series SOCs is feeding the accelerators like the GPU and organizing that work if the memory doesn't go to the right place or can't because it is too big. Bandwidth is one of the key differentiators of something like Ultrafusion relative to standard socket interconnects. It isn't just about latency anymore, they aren't just CPUs.

Having said that, these are really interesting numbers, would you mind sharing your code? I have an M3 Max, but base memory so I don't have enough memory to run this up to 64 GiB but it could be cool to compare if M3's latency is any different - probably not since it's the same RAM but who knows the fabric could've changed?

theorist9 · Mar 9, 2024

crazy dave said:
I see no evidence that they block 3rd party NANDs? While Apple requires specific kinds of NANDs they don't make them, those are buyable from 3rd parties. The whole point of the other thread is how to buy and install your own even on soldered systems - the Mac Pro should theoretically be easier. However, even if that wasn't the case and Apple blocked anything other than Apple NANDs for their internal storage the rest of my argument still holds.

Of course they block 3rd-party slotted NAND. Even OWC doesn't offer upgradeable NAND for the Mac Studio (which was introduced in Mar. 2022, and thus has been out for two years) or the Mac Pro. Given OWC's business model (which centers around Mac upgrades), and their technical expertise with Mac systems, if it were possible to offer it, they would—particularly with the prices Apple charges for NAND upgrades. It's an obvious business opportunity for them.

Yet no one offers a 3rd-party version of this, or anything equivalent for the Mac Studio:

Apple 4TB SSD Upgrade Kit for Mac Pro

The SSD kit for Mac Pro enables you to upgrade the internal SSD storage capacity of yourMac Pro. Installation required. Buy now at apple.com.

www.apple.com

A 4 TB Samsung 990 Pro is currently $340 at retail on AZ. The retail price of 4 TB of top-quality NAND alone would be less than that. And even less at wholesale. Let's call it ≈$250. Apple charges $1,200 to upgrade the Studio from 512 GB to 4 TB. Imagine the profit someone could make by offering a 3rd-party slotted 4 TB upgrade for $400 to $600. Yet all I've heard from 3rd-party suppliers is....crickets.

Yes, absence of evidence is not evidence of absence and all that, but common sense also needs to apply here. If no one is making 3rd-party slotted NAND, in spite of the obvious profit potential, it means Apple has effectively locked down the system to prevent it.

As for the soldered NAND, upgrading that is blocked, period, since doing so kills the warranty.

crazy dave said:
Where I feel you err is thinking that the upgradeable CXL/DDR RAM would be a replacement for buying UMA RAM from Apple. It wouldn't be. It would be what you optionally buy on top of buying UMA RAM from Apple.

Nope, not all. I don't think secondary DRAM would be a replacement for UMA DRAM. Indeed, given that I carefully described how it would operate at a different hierarchical level (I even called it "secondary"!), and you agreed with that (writing "That's exactly how I think it'll be used"), it's clear I view it as qualitatively different from UMA DRAM—the same as you. Thus I'm quite surprised you would come such an erroneous conclusion about my thinking.

crazy dave said:
If Apple were to go down this route and this is all hypothetical since we have no idea what their plans are, then Apple would still be selling built-in non-upgradeable UMA RAM. Thus the analogy is:

UMA RAM <---> internal storage
CXL/DRAM <---> PCIe storage

Hierarchial DRAM is not equivalent to PCIe storage, because Apple required no architectural changes or development costs to allow PCIe boards to make use of the latter. By contrast, it would require custom changes to its architecture, and attendant development costs, to make use of hierarchial DRAM. That makes it inherently more proprietary for Apple. Plus they would want to recoup their development costs. As Leman pointed out, they're not even a member of the CXL group, so might well roll their own version. I thus think you're conflating something that's already an industry standard (PCIe storage) with something that's completely custom (hierarchical DRAM on top of Apple's UMA). Of course they're going to want to make a profit off of it, rather than allowing customers to use 3rd-party solutions.

crazy dave · Mar 9, 2024

theorist9 said:
Of course they block 3rd-party slotted NAND. Even OWC doesn't offer upgradeable NAND for the Mac Studio (which was introduced in Mar. 2022, and thus has been out for two years) or the Mac Pro. Given OWC's business model (which centers around Mac upgrades), and their technical expertise with Mac systems, if it were possible to offer it, they would—particularly with the prices Apple charges for NAND upgrades. It's an obvious business opportunity for them.

Yet no one offers a 3rd-party version of this, or anything equivalent for the Mac Studio:

Apple 4TB SSD Upgrade Kit for Mac Pro

The SSD kit for Mac Pro enables you to upgrade the internal SSD storage capacity of yourMac Pro. Installation required. Buy now at apple.com.

www.apple.com

A 4 TB Samsung 990 Pro is currently $340 at retail on AZ. The retail price of 4 TB of top-quality NAND alone would be less than that. And even less at wholesale. Let's call it ≈$250. Apple charges $1,200 to upgrade the Studio from 512 GB to 4 TB. Imagine the profit someone could make by offering a 3rd-party slotted 4 TB upgrade for $400 to $600. Yet all I've heard from 3rd-party suppliers is....crickets.

Yes, absence of evidence is not evidence of absence and all that, but common sense also needs to apply here. If no one is making 3rd-party slotted NAND, in spite of the obvious profit potential, it means Apple has effectively locked down the system to prevent it.

As for the soldered NAND, upgrading that is blocked, period, since doing so kills the warranty.

Except that it *works*. We know that it actually works, physically. That's the key. If it were blocked by something other than warranty, it wouldn't work and if they are blocking it physically on the Studio/Mac Pro, why do they allow it, physically on iPhones/laptops? Instead of just voiding the warranty they could ensure it wouldn't work there as well. According to the other thread, guys like Louis Rossman wasn't even aware of the Chinese market. I don't know how true that is. But it may be difficult to source blank NANDs en bulk for OWC and other aftermarket sellers (I have no idea why that would be the case). Also why develop something new when your customers could just buy your current storage and basically get the exact same benefit? (okay not for the Studio, but the Mac Pro)

Look I'll be blunt I'm not 100% sure what's going on here inside Apple and 3rd party sellers, but clearly physically upgrades are possible. This is however, a bit of tangent now as interesting as it is.

theorist9 said:
Nope, not all. I don't think secondary DRAM would be a replacement for UMA DRAM. Indeed, given that I carefully described how it would operate at a different hierarchical level (I even called it "secondary"!), and you agreed with that (writing "That's exactly how I think it'll be used"), it's clear I view it as qualitatively different from UMA DRAM—the same as you. Thus I'm quite surprised you would come such an erroneous conclusion about my thinking.

Because you seemed to think Apple wouldn't be able to make bank selling RAM to the end user. They would and their margins on UMA would be protected. I understand now that we agree on this, but now understand that you think that the development of the custom hierarchal, modular DRAM system would be so expensive as to require custom DRAM modules so Apple despite touting this hypothetical system as modular would still need to lock it down. Why?

theorist9 said:
Hierarchial DRAM is not equivalent to PCIe storage, because Apple required no architectural changes or development costs to allow PCIe boards to make use of the latter. By contrast, it would require custom changes to its architecture, and attendant development costs, to make use of hierarchial DRAM. That makes it inherently more proprietary for Apple. Plus they would want to recoup their development costs. As Leman pointed out, they're not even a member of the CXL group, so might well roll their own version. I thus think you're conflating something that's already an industry standard (PCIe storage) with something that's completely custom (hierarchical DRAM on top of Apple's UMA). Of course they're going to want to make a profit off of it, rather than allowing customers to use 3rd-party solutions.

Except for me, if this is so expensive and difficult to develop and is only going to be on the Mac Pro and they aren't going to use off-the-shelf CXL ... would Apple even do it? This is the problem with this discussion: we're positing hypotheticals on top of hypotheticals on a device they haven't even yet given its own chip to yet (though given your recent track record if you'd like to make a prediction about the M3 Extreme coming this year, I'd be more than happy!

). Could they create a bespoke new system that was super expensive and difficult to make with layer upon layer of engineering required which would add a huge, modular memory pool but only through Apple and only for one low-volume device in their entire lineup? Sure, I guess, but why? Like even their current Mac Studio/Mac Pro SSD system is an extension of the laptop system which is the same as the iPhone system. Back when they were still selling socketed Intel Mac Pros they didn't even add NUMA-awarness to macOS. So if this is going to happen at all, could such a RAM system be much easier to develop where trying to corner the selling of standard DRAM modules isn't the best way to recoup the more limited expenses of the new system? Also, yes. Could Apple join CXL in-between now and releasing this mythical device with this property? Also, yes! Basically the latter two are where I'm at. We don't really have any basis for have an in-depth discussion because all we have right now is an M2 Ultra Mac Pro that's just the Studio's chip inside the Mac Pro's body and some patents and industry groups which may or may not be relevant on some unknown timescale. But if you're asking me about my base assumptions such if such a system were to be developed to specifically expand the Mac Pro's capabilities then it would likely use standard DRAM modules and/or be CXL-based. In which case, 3rd party systems here we go.

JouniS · Mar 9, 2024

crazy dave said:
A NUMA system is trying to keep all the data localized to the right chip, which again is something that macOS wouldn't be doing. Unless I'm misunderstanding what you did, you'd have to measure core-to-core latency or pin the memory on one chip and then pin the thread to the other chip and measure latency. Is that what you did? Otherwise the above until 256 GiB is still a measure of local memory latency. My suspicion is that with a distance of 21 you will be multiplying a lot of the numbers on the right by 2.1x. Although I have to say, damn 2.1x is a lot better than I remember from multi socket systems. But that's been a very long while I'll admit.

Then there's the bandwidth issue that I brought up in my edit of the post you quoted (sorry I do that all the time). The real issue with dual socket SOCs is feeding the accelerators like the GPU and organizing that work if the memory doesn't go to the right place or can't because it is too big. Bandwidth is one of the key differentiators of something like Ultrafusion relative to standard socket interconnects.

Having said that, these are really interesting numbers, would you mind sharing your code? I have an M3 Max, but base memory so I don't have enough memory to run this up to 64 GiB but it could be cool to compare if M3's latency is any different - probably not since it's the same RAM but who knows the fabric could've changed?

It's part of a larger set of benchmarks, and unfortunately it cannot be separated clearly from the rest of the code. The basic idea is rather simple:

Allocate an array of 2^n 64-bit integers.
Fill the array with random numbers that can be used as offsets in the array.
Generate another m (e.g. 10 million) random offsets in the array.
Iterate over the offsets, and read the value from (offset XOR previous value).
Also calculate the sum of the fetched values and print it after the loop, to prevent the compiler from removing the benchmark.

The XOR of two random values is necessary. If we use only the precomputed offsets, the CPU can fetch multiple values in parallel. And if we use only the offsets we read from the array, there can be short cycles, and the values may end up being cached.

The purpose of the benchmark is to measure the average (read) latency of memory, as seen by user-space software. The size of the array varies from kilobytes to hundreds of gigabytes, to see the effects from caches, virtual memory, and NUMA. For the NUMA part, if there are two NUMA nodes with 192 GiB memory each and array size is 256 GiB, we should expect that between 1/2 and 3/4 of the fetches are from local memory and the rest from the other node.

Apple Silicon uses 16 KiB virtual memory pages, while Intel uses 4 KiB. With uniform memory access, I would expect that Apple Silicon scales better as the working set grows. Because the branching factor is higher, we need fewer physical memory accesses for each logical access. We see this with M2 Max vs. i9 from 1 GiB to 16 GiB. Then, as the working set grows beyond 1/4 memory capacity, M2 Max starts scaling worse. NUMA, or at least NUMA-like behavior, would be an easy answer.

I haven't really considered the impact on memory bandwidth. In the kind of work I do, the working sets are large and latency is the key. It's rare for a CPU core to consume even a few gigabytes per second.

crazy dave · Mar 9, 2024

JouniS said:
It's part of a larger set of benchmarks, and unfortunately it cannot be separated clearly from the rest of the code.

Got it, fair enough. Very cool benchmark.

JouniS said:
The basic idea is rather simple:

Allocate an array of 2^n 64-bit integers.

Fill the array with random numbers that can be used as offsets in the array.

Generate another m (e.g. 10 million) random offsets in the array.

Iterate over the offsets, and read the value from (offset XOR previous value).

Also calculate the sum of the fetched values and print it after the loop, to prevent the compiler from removing the benchmark.

The XOR of two random values is necessary. If we use only the precomputed offsets, the CPU can fetch multiple values in parallel. And if we use only the offsets we read from the array, there can be short cycles, and the values may end up being cached.

The purpose of the benchmark is to measure the average (read) latency of memory, as seen by user-space software. The size of the array varies from kilobytes to hundreds of gigabytes, to see the effects from caches, virtual memory, and NUMA. For the NUMA part, if there are two NUMA nodes with 192 GiB memory each and array size is 256 GiB, we should expect that between 1/2 and 3/4 of the fetches are from local memory and the rest from the other node.

Absolutely but surely then communication with the other node's latency dominates and the final answer on the 256 GiB simply reflects latency between the sockets? Again as I wrote in an edit: (again sorry) Anandtech measured latency from socket to socket on Ampere Altra to be 350ns though they admitted that was substantially worse than Intel/AMD.

I think I read somewhere that AMD's infinity fabric latency from end-to-end is 600ns but I'm not sure the context. Presumably that's very chip(let)-size dependent. It'd be interesting to see similar data across a wider range of chips like the base M and the Ultra. Apple has been lauded for its fabric but as far as I know not much is known about its properties. Maybe latency is a tradeoff that they've made to increase its bandwidth to serve the GPU? But the same is true for AMD's infinity fabric ... so not sure ...

JouniS said:
Apple Silicon uses 16 KiB virtual memory pages, while Intel uses 4 KiB. With uniform memory access, I would expect that Apple Silicon scales better as the working set grows. Because the branching factor is higher, we need fewer physical memory accesses for each logical access. We see this with M2 Max vs. i9 from 1 GiB to 16 GiB. Then, as the working set grows beyond 1/4 memory capacity, M2 Max starts scaling worse. NUMA, or at least NUMA-like behavior, would be an easy answer.

But the point is in multi-socket systems Apple could start scaling like that at 1 GiB, the 1 GiB data latency would look awfully similar to the 256 GB data latency, because macOS is supposedly not NUMA aware. Basically you could have 1 GiB data on node 0 being accessed by thread 1 on node 1. That's why I'm saying the equivalent test on the Xeon would be to try to pin your memory on one node and read with the other if that were possible. Because unless macOS were re-engineered for NUMA that would happen. Now, I have no idea about the ease/difficulty of that for Apple, presumably they could make macOS NUMA aware, but they seemed to avoid doing even when they had such socketed CPUs and now they have to feed more than just CPUs. They don't really seem to want to create custom OS systems just to serve the needs of the Mac Pro, even in the early Intel days.

JouniS said:
I haven't really considered the impact on memory bandwidth. In the kind of work I do, the working sets are large and latency is the key. It's rare for a CPU core to consume even a few gigabytes per second.

Aye I mean Sapphire Rapids has accelerators too but more akin to AMX and so forth. In fact as an aside I think they even call their matrix engine AMX too. But I don't think you can really compare to the needs of the GPU. Maybe it would still work, but with the history of multi-GPU systems and multi-die GPUs, I think there's a reason people didn't do it. The two socketed GPUs would have to behave like multi-GPUs and I just don't think Apple would go that way. Especially not after the trash can Pro.

theorist9 · Mar 9, 2024

crazy dave said:
Except that it *works*. We know that it actually works, physically. That's the key. If it were blocked by something other than warranty, it wouldn't work and if they are blocking it physically on the Studio/Mac Pro, why do they allow it, physically on iPhones/laptops as well? Instead of just voiding the warranty they could ensure it wouldn't work.

Except that it *doesn't work* on the system's we're talking about--the slotted systems like the Studio and MP. If Apple offered upgradeable secondary RAM on the MP, it would, by its nature, be slotted rather than soldered. Those are the types of systems Apple has bothered to block. Apple didn't need to block upgrades to soldered NAND, precisely because only a tiny percentage are going upgrade that, for a variety of reasons.

crazy dave said:
you think the hierarchal, modular DRAM system would be so expensive as to require custom DRAM modules so Apple despite touting this hypothetical system as modular would still need to lock it down. Why?

Nope, I never said Apple's OEM costs on the modules would be high.

I think if Apple rolls its own, the parts could be purchased relatively inexpensively by Apple (it's just DRAM). [Samsung might charge a lot for its branded CXL modules, but that's different from the actual parts cost.] What I instead said would be expensive for Apple would be the development costs to rework its architecture to enable heirarchical RAM, so they're going to want to recoup that.*

Plus, more broadly, look at the way Apple operates. I think you're looking at this from a technical perspective, but it's really about the money. Apple has been moving increasingly to lock down all sources of profit--that's why it configures its products, as much as it can, so that upgrades can only be purchased from Apple. If it changes its MP architecture to allow secondary RAM, it's going to want customers to buy that secondary RAM from it alone--for high-profit-margin prices. Given Apple's behavior, why would you think otherwise?

*And even if the development costs aren't significant, they're still going to want that secondary RAM to be purchased from them alone, because, why not?!

The PCIe storage is the exception, not the rule, because it's an industry standard to which Apple would have had to block access. Secondary RAM on Apple's UMA is entirely different.

leman · Mar 9, 2024

JouniS said:
I don't think that would change things that much. Apple Silicon Macs already feel like NUMA systems, even if it's hidden from the user. For example, look at these memory latency measurements I've made:

Working set iMac (i9-10910, 128 GiB) MBP (M2 Max, 96 GiB)
1 GiB 94 ns 117 ns
2 GiB 96 ns 123 ns
4 GiB 108 ns 123 ns
8 GiB 134 ns 129 ns
16 GiB 164 ns 143 ns
32 GiB 182 ns 201 ns
64 GiB 191 ns 365 ns

The exact numbers don't matter, as they are noisy. However, there is a huge increase in latency on M2 Max when the working set increases from 16 GiB (below 1/4 capacity) to 32 GiB (above 1/4), and again from 32 GiB (below 1/2) to 64 GiB (above 1/2). I'd expect that the effect would be even more significant with Ultra.

I think what you are seeing here is the effect of cache hierarchy. The latency goes up if your working set starts exceeding the SLC cache size. Fundamentally, pretty much any modern computing device is NUMA - the latency depends whether the data is cached and where it’s cached. Apple Silicon is more NUMA-like in this regard than most x86 platforms, and I think it has to do with both energy efficiency goals and the UMA architecture. Whats more, inter-core communication latency on Apple Silicon is very high and presumably communicating between different processors on the SoC is even slower.

quarkysg said:
I'm not sure Apple wants to add NUMA support into macOS without much financial benefit. I would think latency would be an order of magnitude worst or more with multi-socket CPU as synchronisation will have to be done by the OS.

They can do quite a lot of things without NUMA in classical sense. Modular RAM (implemented as a slow large memory pool) for example could work entirely transparently. I think even multiple SoCs would work if they combine them with a fast enough interconnect. That of course is more tricky. I think they will stick with tightly integrated chips for a while. Extreme is probably the most we can hope for.

leman · Mar 10, 2024

crazy dave said:
But the point is in multi-socket systems Apple could start scaling like that at 1 GiB, the 1 GiB data latency would look awfully similar to the 256 GB data latency, because macOS is supposedly not NUMA aware. Basically you could have 1 GiB data on node 0 being accessed by thread 1 on node 1. That's why I'm saying the equivalent test on the Xeon would be to try to pin your memory on one node and read with the other if that were possible. Because unless macOS were re-engineered for NUMA that would happen. Now, I have no idea about the ease/difficulty of that for Apple, presumably they could make macOS NUMA aware, but they seemed to avoid doing even when they had such socketed CPUs and now they have to feed more than just CPUs. They don't really seem to want to create custom OS systems just to serve the needs of the Mac Pro, even in the early Intel days.

I would be very surprised if Apple introduces “traditional” NUMA-like APIs for the CPU (although they do have NUMA APIs for the GPU). It’s just not their style. If they go that route, they will make sure stuff works transparently.

As we discuss here, in many ways Apple Silicon already exhibits NUMA-like behavior, especially on Ultra. It is very possible that your thread will request data from a memory page served by a memory controlled from the other die, in which case the request has to go through UltraFusion. What is also very interesting is that they do not cache stuff across chips. SLC blocks are linked to specific memory controllers and service independent hardware memory ranges. I assume that this radically simplifies implementation as you don’t have to worry about there being multiple copies of the data at the same cache level. The latency will obviously be higher than in some other implementations. However, Apple has many of ways to deal with this stuff. First, large CPU caches. Second, organization of CPU cores into clusters where communication is fast (and it’s these clusters that are the basic multiprocessing block, Apples APIs stop at this level). Third, their patents describe memory copying/folding schemes where they would move hardware memory pages closer to where they are needed (no idea if this is or ever will be in use). As you and others have already mentioned, all of this is predicated on having very high bandwidth between the “nodes”. Latency is less important (and is high in current implementations). So whatever Apple does, I doubt they will change this fundamental approach.

crazy dave · Mar 10, 2024

theorist9 said:
Except that it *doesn't work* on the system's we're talking about--the slotted systems like the Studio and MP.

Except you don't actually know that. EDIT: actually that SSD upgrade thread indicates the opposite as they talk about upgrading the M1 Ultra and I'm pretty sure I remember Hector Martin saying he was sure it was possible.

theorist9 said:
If Apple offered upgradeable secondary RAM on the MP, it would, by its nature, be slotted rather than soldered. Those are the types of systems Apple has bothered to block. Apple didn't need to block upgrades to soldered NAND, precisely because only a tiny percentage are going upgrade that, for a variety of reasons.

Sure but what's the block and why not lock everyone down? (as per EDIT above probably not blocked) The MP is already only a sliver of their market too. Apparently selling soldered SSD services is big business in China, one of the biggest markets in the world on Apple's most important device, the iPhone. And again for the Mac Pro, people can just buy PCIe storage, so Apple's strategy makes no sense whatsoever. If I'm a price conscious consumer, why would I pay Apple for 8TB of SSD memory when I don't have to? The PCIe storage is even internal to the Pro.

theorist9 said:
Nope, I never said Apple's OEM costs on the modules would be high.

I think if Apple rolls its own, the parts could be purchased relatively inexpensively by Apple (it's just DRAM). [Samsung might charge a lot for its branded CXL modules, but that's different from the actual parts cost.] What I instead said would be expensive for Apple would be the development costs to rework its architecture to enable heirarchical RAM, so they're going to want to recoup that.*

That's what I meant. Here's what I wrote later:

Except for me, if this is so expensive and difficult to develop and is only going to be on the Mac Pro and they aren't going to use off-the-shelf CXL ... would Apple even do it? This is the problem with this discussion: we're positing hypotheticals on top of hypotheticals on a device they haven't even yet given its own chip to yet (though given your recent track record if you'd like to make a prediction about the M3 Extreme coming this year, I'd be more than happy! ). Could they create a bespoke new system that was super expensive and difficult to make with layer upon layer of engineering required which would add a huge, modular memory pool but only through Apple and only for one low-volume device in their entire lineup? Sure, I guess, but why?

theorist9 said:
Plus, more broadly, look at the way Apple operates. I think you're looking at this from a technical perspective, but it's really about the money. Apple has been moving increasingly to lock down all sources of profit--that's why it configures its products, as much as it can, so that upgrades can only be purchased from Apple. If it changes its MP architecture to allow secondary RAM, it's going to want customers to buy that secondary RAM from it alone--for high-profit-margin prices. Given Apple's behavior, why would you think otherwise?

*And even if the development costs aren't significant, they're still going to want that secondary RAM to be purchased from them alone, because, why not?!

The PCIe storage is the exception, not the rule, because it's an industry standard to which Apple would have had to block access. Secondary RAM on Apple's UMA is entirely different.

Except modular DRAM is also an industry standard. Using it for super-swap doesn't change that. Again this all hypotheticals on top hypotheticals. It's an interesting thought experiment but unless something concrete comes out, hell just an Extreme SOC, to even spur interest that they'll even bother making something, anything specifically for the Mac Pro, it remains just that.

==========
Anyway I wrote earlier and in my exhaustion then accidentally erased a goodbye on this topic to you and @JouniS. I'll try to recapitulate it: I'm really not feeling great and exhausted. I'll read what you guys write and these discussion/debates have been a lot of fun but I can't really continue. Apologies for any overly lengthy and confusing posts, to butcher a famous quote: "if I'd had more [mental bandwidth], I'd have written a shorter[, clearer] [post]." Ta!

crazy dave · Mar 10, 2024

leman said:
I would be very surprised if Apple introduces “traditional” NUMA-like APIs for the CPU (although they do have NUMA APIs for the GPU).

Really?! Why? Do they use it for anything? Wait, is this a hold over from something like dGPUs communicating with each other in the trash can Pro?

leman said:
It’s just not their style. If they go that route, they will make sure stuff works transparently.

As we discuss here, in many ways Apple Silicon already exhibits NUMA-like behavior, especially on Ultra. It is very possible that your thread will request data from a memory page served by a memory controlled from the other die, in which case the request has to go through UltraFusion. What is also very interesting is that they do not cache stuff across chips. SLC blocks are linked to specific memory controllers and service independent hardware memory ranges. I assume that this radically simplifies implementation as you don’t have to worry about there being multiple copies of the data at the same cache level. The latency will obviously be higher than in some other implementations. However, Apple has many of ways to deal with this stuff. First, large CPU caches. Second, organization of CPU cores into clusters where communication is fast (and it’s these clusters that are the basic multiprocessing block, Apples APIs stop at this level). Third, their patents describe memory copying/folding schemes where they would move hardware memory pages closer to where they are needed (no idea if this is or ever will be in use). As you and others have already mentioned, all of this is predicated on having very high bandwidth between the “nodes”. Latency is less important (and is high in current implementations). So whatever Apple does, I doubt they will change this fundamental approach.

Agreed.

leman · Mar 10, 2024

crazy dave said:
Really?! Why? Do they use it for anything? Wait, is this a hold over from something like dGPUs communicating with each other in the trash can Pro?

To support AMD infinity link on the Intel Mac Pro.

Transferring Data Between Connected GPUs | Apple Developer Documentation

Use high-speed connections between GPUs to transfer data quickly.

developer.apple.com

I can certainly imagine them using these APIs stuff for Apple Silicon if they decide to go multi-socket. With enough bandwidth these APIs won’t be needed for the CPUs.

MRMSFC · Mar 11, 2024

ajacocks said:
Modularity. I really want an Apple Silicon version of the 2019 Mac Pro, or even better, the 2010 Mac Pro. As much upgradability as possible.

- Alex

Can you expand upon that?

Apple’s CPU architecture pretty much rules out separate RAM and GPU.

I do think they could make the SoC socketable (at least for the Mac Pro), and have the SSD modules upgradeable as well, since there’s a software lock preventing it.

Would that hypothetical model be acceptable?

MRMSFC · Mar 11, 2024

ajacocks said:
The performance and size advantages are clear, but they are by nature unexpandable. That’s not great for my personal use-case, but it also creates ewaste, which is why I am such a strong believer in modular design, in servers, desktops and laptops.

- Alex

I also believe that reducing e waste is important, but I’m of the opinion that software is far and above the better solution than a modular hardware.

Don’t get me wrong, I do believe there should be a law that says consumer devices must have a battery and a drive that can be booted from that is user replaceable.

I think we both agree that the best way to reduce waste is to use the device you have for the longest possible time.

In my experience, that means running a third party OS when official support ends.

Hardware technology changes quickly and often in ways one cannot predict. And trying to force standards may run the risk of making a technological dead end a requirement. Or, in order to avoid that scenario, be so broad as to be ineffective.

In contrast, open source software has been shown to be incredibly successful, and because running it on a device that’s EoL by definition means that device has been out for a long time, gives open source developers the time to make the software available on said device.

jinnyman · Mar 11, 2024

Well nVidia will soon surpass Apple. Who would've thought that when Apple dropped nVidia?

I think Apple wasted too much time in Apple car with failed strategy and not catching up with generative AI stuff soon enough.

bcortens · Mar 11, 2024

jinnyman said:
Well nVidia will soon surpass Apple. Who would've thought that when Apple dropped nVidia?

I think Apple wasted too much time in Apple car with failed strategy and not catching up with generative AI stuff soon enough.

The car was always a pointless distraction and should have been killed years ago.

I think NVidia market cap passing apple’s isn’t really that important. What matters is rather that NVidia is really the only game in town when it comes to high performance generative AI in the cloud. If Apple wants to compete there they are in trouble. If they plan to do more on-device generative AI models then they will still likely be doing their model training on NVidia (which has to sting) but most of the inference should be on Apple Silicon and in that case not having NVidia is a boon as it ensures that they are optimizing for a common platform from the watch to the Mac Pro.

FilthyMuppetInnuendo · Mar 13, 2024

jinnyman said:
Well nVidia will soon surpass Apple. Who would've thought that when Apple dropped nVidia?

Nonsense. I couldn’t care less about market cap. I can’t buy an “nVidia computer” and will never want to. Their chips are power sinks. Their products are three times as expensive as they should be. And they abandoned Apple to rot when they had it proven that their entire chip line was faulty from the factory. I lost my MacBook Pro to a melted 8600M and nVidia got away scot-free. It would still otherwise be working today, and I’d still be using it if not for that. Apple was right to abandon them for that behavior. I expect M5 graphics to have parity with then-current nVidia offerings at half the power draw.

This “AI” hoax is as much a fad for hardware sales as the crypto scheme was. They’ll have a catastrophic market correction soon enough.

jinnyman · Mar 14, 2024

FilthyMuppetInnuendo said:
Nonsense. I couldn’t care less about market cap. I can’t buy an “nVidia computer” and will never want to. Their chips are power sinks. Their products are three times as expensive as they should be. And they abandoned Apple to rot when they had it proven that their entire chip line was faulty from the factory. I lost my MacBook Pro to a melted 8600M and nVidia got away scot-free. It would still otherwise be working today, and I’d still be using it if not for that. Apple was right to abandon them for that behavior. I expect M5 graphics to have parity with then-current nVidia offerings at half the power draw.

This “AI” hoax is as much a fad for hardware sales as the crypto scheme was. They’ll have a catastrophic market correction soon enough.

I get that there's people who will never buy "nVidia computer". But if you insist AI is hoax, well you will be awaken to uncomfortable reality.

Well if you are retired, none of AI stuff will probably matter though. I give you that.

FilthyMuppetInnuendo · Mar 14, 2024

jinnyman said:
But if you insist AI is hoax, well you will be awaken to uncomfortable reality.

I’ll clarify. It’s beyond the purview of this thread, but specifically the delusion that 1. AI can be proven to exist (in its ‘hard’ form) and that we have AI today. 2. that large language models will actually "take people’s jobs” and prove luddites correct for the first time in history. There’s no indication that the UBI propagandists are correct, never mind the people who think “Skynet is imminent."

“AI” exists, absolutely. Not as intelligence, not as a sapient being, incapable of making decisions or thinking creatively, and in a form equivalent to a machine. Generative algorithms aren’t a hoax. But the media calling it “AI” doesn’t make it so.

Elusi · Mar 14, 2024

FilthyMuppetInnuendo said:
I’ll clarify. It’s beyond the purview of this thread, but specifically the delusion that 1. AI can be proven to exist (in its ‘hard’ form) and that we have AI today. 2. that large language models will actually "take people’s jobs” and prove luddites correct for the first time in history. There’s no indication that the UBI propagandists are correct, never mind the people who think “Skynet is imminent."

“AI” exists, absolutely. Not as intelligence, not as a sapient being, incapable of making decisions or thinking creatively, and in a form equivalent to a machine. Generative algorithms aren’t a hoax. But the media calling it “AI” doesn’t make it so.

The sexy media idea that LLM is sentient and that it will look at us like plants feels like a year-old discussion. I don't think people and companies are buying their hardware right now with the expectation that they can put together a Skynet of sorts.

I think the current trend of keeping faith in "AI" is driven by the observation that so many find good use cases for chatGPT and similar tools. They provide direct benefit for people's jobs _today_ and that's where I think the likeness to crypto falters.

avkills · Mar 14, 2024

Is AI really "AI" or just a very complex search/database mechanism? Language parsing has been around for quite a long time.

I don't buy it completely.

Regulus67 · Mar 14, 2024

avkills said:
Is AI really "AI" or just a very complex search/database mechanism? Language parsing has been around for quite a long time.

I don't buy it completely.

Martin Armstrong who created Socrates, says "AI" has nothing to do with AI. He claims his system is the only in the world that use AI.
He even presented a blog about this. And if anyone ought to understand this subject, it must be him.

ai computers

avkills · Mar 14, 2024

Regulus67 said:
Martin Armstrong who created Socrates, says "AI" has nothing to do with AI. He claims his system is the only in the world that use AI.
He even presented a blog about this. And if anyone ought to understand this subject, it must be him.

ai computers

I believe a better test of a system claiming to be AI, is to not allow it to have any historical data and only access to data since it was "launched" (birthed, whatever term you want to use.)

When I was in High School (late eighties), I was quite fascinated with the idea of natural language parsing; mostly because of the games Zork and Hitchhiker's Guide. Went so far as actually buying a book on AI (I can't remember the name and I think I loaned it to my nephew.) I have no idea what they are doing now, but back then the rage was WASP parsing on the natural language front. English is quite difficult because of the many words that are similar but mean different things.

But it really comes down to the strength of the database in my opinion. Which is why I think it would be a good experiment to make the system build its own data base from nothing and see how it fares. Obviously you will need the LLM in place as a start.

Regulus67 · Mar 14, 2024

avkills said:
But it really comes down to the strength of the database in my opinion. Which is why I think it would be a good experiment to make the system build its own data base from nothing and see how it fares. Obviously you will need the LLM in place as a start.

From what I have heard Martin Armstrong explain. His system collects information on the internet, and builds it's own database. And that is how the π (pi) value was discovered in cycles.

Why would you disregard historical data? It could be argued that is what they are doing with climate models.

Curiosity is perhaps the most important trait in humans, that helps build intelligence. And to see things in perspective, a knowledge of the past seems vital to me.
I feel confident that most of the errors, or mistakes, I have made is/was caused by lack of information or experience.

Will it be any different for computer AI? Or is it to help the AI start with this limitation, and as it develops, more can be added?

Mac Pro should use NVIDIA cards

macrumors 65816

macrumors 6502a

macrumors 65816

macrumors 68040

macrumors 65816

macrumors 6502a

macrumors 65816

macrumors 68040

macrumors Core

macrumors Core

macrumors 65816

macrumors 65816

macrumors Core

macrumors 6502

macrumors 6502

macrumors 6502a

macrumors 65816

macrumors member

macrumors 6502a

macrumors member

macrumors regular

macrumors 65816

macrumors 6502

macrumors 65816

macrumors 6502

Our Staff