M4+ Chip Generation - Speculation Megathread [MERGED]

MayaUser · Apr 23, 2024

Will, M4 be close to M1 Pro i wonder, or could it be?

AdamBuker · Apr 23, 2024

I wonder if it would be possible, likely, and practical for Apple to use a fusion connector for the the Ultra, but instead of 2x MnMax chips glued together, it would be an SOC with CPU/NPU for the 1st chip and a second chip that would offer some combinations of GPU/NPU/CPU cores (or possibly just GPU cores) whilst not unnecessarily duplicating other SOC components. The die sizes of both would be larger and need more power, but for a desktop system I would not see that as a problem.

If they went this route I could see the base chip having maybe 20 P-cores and a 16 core NPU (and a binned version with maybe 16 P-cores). The second chip could have one of two standard configurations (plus binned versions). The first would be GPU heavy with most of the die for GPU cores (say maybe 80-120). The second would have maybe 8-12 additional P-cores, 16 additional NPU cores and 64-80 GPU cores. My numbers here are just spitballing from wishful drinking and ignorance so feel free to substitute more realistic numbers. My hope is that the second option would allow for more PCIe bandwidth to the expansion slots in the Mac Pro.

Populus · Apr 23, 2024

Guenter said:
We can say that a larger portion of the m4 will be used by the ai units, leaving less room for improvements on the cpu and gpu side.
I expect that improvements for cpu ang gpu compared to m3 will be minor.
I am no expert on ai, are there reasons that larger ai units require or strongly benefit from more ram?
If not i dont think the base ram will stay at 8 GB, may be 12 GB but not more.

I’m not an expert either, but yes, I expect the M4 to have a special focus on the Neural Engine, while the CPU stays largely the same (maybe more efficient cores?) and a bit more powerful GPU, paired with 12GB of base RAM.

deconstruct60 · Apr 23, 2024

Guenter said:
We can say that a larger portion of the m4 will be used by the ai units, leaving less room for improvements on the cpu and gpu side.

With some limitations/constraints Apple can choose to make the die larger. That larger portion/percentage could be taken from a larger overall die. The M2 dies were generally bigger than the M1 dies. Unlike that situation though, N3E is going to lead to incrementally bigger dies even if Apple did no changes ( the caches will grow larger as N3E backslides on density ). Even N3P is a slight backslide. Unless Apple sticks with N3B, they are going to have to backslide a bit on density no matter which subsystem gets 'improvements'.

Counterbalancing N3E wafers are incrementally cheaper than N3B wafers. So a larger die could cost a bit less if the growth is less than the wafer price 'swap'. Also Apple has a variety of "AI units". There are AMX facility more tightly coupled to the CPU cores. And also "AI like' features can leverage in the GPU. Boosted AI might be spread out and not just solely in the NPU cores/cache.

Guenter said:
I am no expert on ai, are there reasons that larger ai units require or strongly benefit from more ram?

Larger generally means more parallelism ( more cores or 'bigger cores that can do more than one thing a at time' ). That is mostly independent of more RAM. More RAM leads to a larger "neural model"/'data encoding" that has more information in it. That is more useful if trying to make a 'does everything for everybody' model. That actually isn't 100% necessary if have focused use in mind. If there is a larger amount of data to do 'divide and conquer' it can lead to a faster 'conquer' the task if 'attack' with more cores when the process is highly parallel. Smaller chunks get done quicker.

However, 'too large' of a model likely also means fewer older devices can run it. So a 'large as possible' model isn't necessarily going to get you better coverage/scale. It also leads to higher costs of operation.

Guenter said:
If not i dont think the base ram will stay at 8 GB, may be 12 GB but not more.

'bigger' NPU would likely put more direct pressure to improve RAM bandwidth rather than RAM capacity. If Apple went to newer tech , LPDDR5X that more cheaply came at higher die densities then would likely would be a bigger 'bump' to capacity mostly as a side-effect.

Loading a relatively very large memory model ( 4-8+ GB ) should be an optional thing. If folks are not using it much then that is rather bulky dead piece of memory being used.

deconstruct60 · Apr 23, 2024

AdamBuker said:
I wonder if it would be possible, likely, and practical for Apple to use a fusion connector for the the Ultra, but instead of 2x MnMax chips glued together, it would be an SOC with CPU/NPU for the 1st chip and a second chip that would offer some combinations of GPU/NPU/CPU cores (or possibly just GPU cores) whilst not unnecessarily duplicating other SOC components. The die sizes of both would be larger and need more power, but for a desktop system I would not see that as a problem.

If the memory controllers on the two dies are not balanced roughly equal that will put undo pressure on the UltraFusion connector. That is contributing factor as to why Apple makes the dies on either side symmetrical. Unless one of the dies on either side can substantively 'punt' on latency, it is easier to keep it equal if there is a desire for uniform shared memory access. Substantially skewed is likely going to bring increased NUMA issues.

(e.g., too many GPU cores on a die with 'not enough' on die memory will lend harder on a UltraFusion connector to get to the 'rest' of the memory on the other side of the Fusion connector. )

It also will get more costly to have asymmetric dies. One die used twice is cheaper than building, testing, qualifying two different ones. If going to build two different dies then get more 'chiplet benefit' traction by being able to use at least one of those dies in a package multiple times. ( e.g., AMD's 2-8 CPU chiplets + 1 I/O die in their 'CPU' packages The I/O die being the only one with connection to memory tends to make the access 'uniform' because everyone else goes about equally slower. ).

leman · Apr 24, 2024

AdamBuker said:
I wonder if it would be possible, likely, and practical for Apple to use a fusion connector for the the Ultra, but instead of 2x MnMax chips glued together, it would be an SOC with CPU/NPU for the 1st chip and a second chip that would offer some combinations of GPU/NPU/CPU cores (or possibly just GPU cores) whilst not unnecessarily duplicating other SOC components. The die sizes of both would be larger and need more power, but for a desktop system I would not see that as a problem.

If they went this route I could see the base chip having maybe 20 P-cores and a 16 core NPU (and a binned version with maybe 16 P-cores). The second chip could have one of two standard configurations (plus binned versions). The first would be GPU heavy with most of the die for GPU cores (say maybe 80-120). The second would have maybe 8-12 additional P-cores, 16 additional NPU cores and 64-80 GPU cores. My numbers here are just spitballing from wishful drinking and ignorance so feel free to substitute more realistic numbers. My hope is that the second option would allow for more PCIe bandwidth to the expansion slots in the Mac Pro.

We do know that they are considering designs like this that combine an SoC with additional specialized dies. This is from a recent Apple patent:

leman · Apr 24, 2024

Guenter said:
We can say that a larger portion of the m4 will be used by the ai units, leaving less room for improvements on the cpu and gpu side.
I expect that improvements for cpu ang gpu compared to m3 will be minor.
I am no expert on ai, are there reasons that larger ai units require or strongly benefit from more ram?
If not i dont think the base ram will stay at 8 GB, may be 12 GB but not more.

There are different ways how an "AI focus" can be achieved. They could ship larger NPU (ML inference unit). They could redesign the NPU to be more efficient at what it does (and they have a lot of patents going in that direction). They could upgrade the GPU units to have better ML performance.

To dramatically improve ML performance without significantly increasing the die cost the best strategy would probably be enhancing the GPU. It is already the largest consumer of die area on all chips, so it would be a good use of resources. Of course, the GPU will never be as energy efficient as the NPU proper. I do believe we will see GPU enhancements simply because Apple has been investing a lot of resources in GPU ML programming and it wound't make much sense to me unless they also planned to improve the performance at the hardware level.

AdamBuker · Apr 24, 2024

deconstruct60 said:
It also will get more costly to have asymmetric dies. One die used twice is cheaper than building, testing, qualifying two different ones. If going to build two different dies then get more 'chiplet benefit' traction by being able to use at least one of those dies in a package multiple times. ( e.g., AMD's 2-8 CPU chiplets + 1 I/O die in their 'CPU' packages The I/O die being the only one with connection to memory tends to make the access 'uniform' because everyone else goes about equally slower. ).

So if they use a 'chiplet' design, would there be any way to mitigate increased memory latency?

leman said:
We do know that they are considering designs like this that combine an SoC with additional specialized dies. This is from a recent Apple patent:

This makes me wonder if this could be the way Apple introduces a kind of quasi-modularity for the Mac Pro where you could swap an SOC package for another if they were placed on a removable daughter card. I doubt they would actually go this direction but if memory serves me correctly, many of the PowerMac G4's could be upgraded as they had their processors on daughter cards. More likely than that, they could have an in-store service where if you make an appointment they could swap out your old motherboard for an upgraded one (with the benefit of a huge discount over buying an outright new machine) provided they keep all other aspects of the case design the same across generations. What I think is most likely is that Apple will maintain the status quo where you choose the configuration at purchase, but a man can dream.

diamond.g · Apr 24, 2024

AdamBuker said:
So if they use a 'chiplet' design, would there be any way to mitigate increased memory latency?

More (and/or faster) cache.

deconstruct60 · Apr 24, 2024

leman said:
We do know that they are considering designs like this that combine an SoC with additional specialized dies. This is from a recent Apple patent:

What was the title (focus) of that patent? That is more dis-integrated and functionally decomposed then Intel's Current Ultra "Meteor Lake". It is a bit excessive if perf/watt is a top primary goal. Is it actually about a 'double sided' interposers or the stuff glued to the interposer? In that first case, this may be just trying to show the wide variety of possible interfaces. That would not necessarily be a highly considered design direction.

It wouldn't be surprising if that was reticle limited also.

leman · Apr 24, 2024

deconstruct60 said:
What was the title (focus) of that patent? That is more dis-integrated and functionally decomposed then Intel's Current Ultra "Meteor Lake". It is a bit excessive if perf/watt is a top primary goal. Is it actually about a 'double sided' interposers or the stuff glued to the interposer? In that first case, this may be just trying to show the wide variety of possible interfaces. That would not necessarily be a highly considered design direction.

It wouldn't be surprising if that was reticle limited also.

This is the patent: https://patentscope.wipo.int/search/en/detail.jsf?docId=US426464916&_cid=P21-LVE0SW-78241-1

The general topic to my layman eyes seems to be combining multiple dies manufactured at different nodes while minimizing overall package area and managing heat. They also mention optical interconnects between the dies. It does not seem to me that perf/watt is the major goal here. Instead, this is about performance and optimal utilization of nodes (e.g. splitting logic and cache into different dies manufactured at different node sizes).

vanc · Apr 24, 2024

leman said:
This is the patent: https://patentscope.wipo.int/search/en/detail.jsf?docId=US426464916&_cid=P21-LVE0SW-78241-1

The general topic to my layman eyes seems to be combining multiple dies manufactured at different nodes while minimizing overall package area and managing heat. They also mention optical interconnects between the dies. It does not seem to me that perf/watt is the major goal here. Instead, this is about performance and optimal utilization of nodes (e.g. splitting logic and cache into different dies manufactured at different node sizes).

In general, it sounds like the chiplet design AMD poineered. Perhaps Apple has some new tricks in its sleeves.

leman · Apr 24, 2024

vanc said:
In general, it sounds like the chiplet design AMD poineered. Perhaps Apple has some new tricks in its sleeves.

This is very different. AMD focuses on 2D packaging and I quickly browsed their patents and could not find notable references to 3D packaging technology. In contrast, Intel and Apple have tons of advanced 2.5D and 3D packaging patents.

deconstruct60 · Apr 24, 2024

AdamBuker said:
So if they use a 'chiplet' design, would there be any way to mitigate increased memory latency?

generally broader while 'smaller' and shorter connections ( like UltraFusion or shorter ). What Apple is doing is not good chiplet design is more so because the dies are too chunky. They have stuff that don't need to scale at the same rates coupled together. ( more than 8 TB ports ... probably not useful. Also things that don't fabrication improve at the same rate. ).

AdamBuker said:
This makes me wonder if this could be the way Apple introduces a kind of quasi-modularity for the Mac Pro where you could swap an SOC package for another if they were placed on a removable daughter card. I doubt they would actually go this direction but if memory serves me correctly, many of the PowerMac G4's could be upgraded as they had their processors on daughter cards.

The Mac Pro 2009-2012 did. But the base tech was only 2009-2010. ( it was the same base chipset and socket technology from Intel that whole time. Apple just dribble out some speed bumps in 2012 to label it "new" while the rest of workstation industry upgraded to the new tech. ) The CPU socket(s) and RAM DIMMs were on daughter card, but most of the basic I/O was not. The most 'stuff' that is baked inside the singular SoC package the more complicated that 'daughter card' connector gets. The more entangled the connector gets with different I/O protocols' evolution rates ( PCI-e , USB, Thunderbolt, DisplayPort , etc.) the less likely that is going to be a multiple generation connector. In the end, all you end up with a vastly more expensive module. And most hyper modularity folks usually have 'lower costs' at the kernel of their focus.

The G4-G5 boards were even more decoupled from containing what was RAM , and North/Southbridge IO chipsets. And again none of this stuff was particularly multiple generational with I/O tech upgrades. (chipsets change the socket changes.)

Transistor density progress mean more stuff is being pulled inside the main package. Same general trend that lead from separate vacuum tube to using IC chips. Over time more stuff is going to get swallowed up by the SoC package's density. Somewhat a like 'black hole'.

RAM soldered to the daughter card isn't going to uplift 'quasi-modularity' for very large fraction of folks.
A decent chance the SSD module connectors on on this 'daughterboard' also.

AdamBuker said:
More likely than that, they could have an in-store service where if you make an appointment they could swap out your old motherboard for an upgraded one (with the benefit of a huge discount over buying an outright new machine) provided they keep all other aspects of the case design the same across generations.

Alas "huge discount', the modularity is primarily an indirect for cheaper costs. The Thunderbolt controllers may/may not change... the PHYS card with the physical ports still good? Maybe not. So that card may go also.
The I/O card ( did USB change?) .

Apple's motherboards are not cheap even if detach a SoC+RAM+SSD complex from it. There is a two input, server grade PCI-e switch there ( probably in the $1000 range for the 'minimal' glued on basics Apple has there. ). If attach the SoC+RAM the price just gets that much higher. SSD modules may/may not work.. so perhaps those also.

Multiple generation reuse of just the space frame, fans , and perhaps the power supply is not going to control costs much.

Apple is very diligent about keeping as low as inventory as possible. There are moves to make Apple hold onto a larger and longer parts inventory. Pretty likely Apple will pass along all of that additional cost overhead in full.

AdamBuker said:
What I think is most likely is that Apple will maintain the status quo where you choose the configuration at purchase, but a man can dream.

Even in high in HPC server space there is more HBM now because that turns in better/faster results than DIMMs. As the packaging technology improves the performance gains at there to be taken. Modularity just for modularity sake isn't going to get top efficient performance. The 80's notion of a PC as a box with slots was working with fundamentally different base level fabrication technology constraints. It is been 30+ years since then.
( 30 years prior to the 80's often was looking at vacuum tubes computers).

deconstruct60 · Apr 24, 2024

leman said:
This is the patent: https://patentscope.wipo.int/search/en/detail.jsf?docId=US426464916&_cid=P21-LVE0SW-78241-1

The general topic to my layman eyes seems to be combining multiple dies manufactured at different nodes while minimizing overall package area and managing heat. They also mention optical interconnects between the dies.

from the text:
" ... In various embodiments, description is made with reference to figures. However, certain embodiments may be practiced without one or more of these specific details, or in combination with other known methods and configurations. In the following description, numerous specific details are set forth, such as specific configurations, dimensions and processes, etc., in order to provide a thorough understanding of the embodiments. ..."

Some stuff is here in the diagrams because it is illustrative of the 3D "gluing" technology being outlined here, not that it is a specific decomposition objective. This isn't a product; it is more general approach to making a product.

leman said:
It does not seem to me that perf/watt is the major goal here.

It is. Chiplets that can follow down to better perf/watt fab processes are being persued here. Some stuff won't cost effective go down to lower nodes so it is stopped and decomposed out into a chiplet. However, I'm a bit skeptical that CPU and GPU cores are going to drift apart that much. There are more problems at high NA EUV fabrications where the max reticle size is going to get smaller. (i.e., going to be harder to make 400+ mm2 dies like Max (and bigger ... like Nividia is doing). Apple (and everyone else) will not have a choice but to go to smaller dies. So yeah there will be a bit of 'backslide' , but that isn't because they eased off the perf/watt objective because they "wanted to" or "deprioritized it". Going to have to hope the forced smaller dies can save enough and be composed to make back up the backslide.

leman said:
Instead, this is about performance and optimal utilization of nodes (e.g. splitting logic and cache into different dies manufactured at different node sizes).

This pretty likely going to drive costs up also. It just may be cheaper than trying to fab it all on one die with a process that 'misses' of effectiveness so a substantial fraction of the components. Not sure if this is "optimal" as much as 'hits the wall' on what just can't be done anymore ( make 600-700mm^2 dies at the bleeding edge node). Practical to do is a component.

deconstruct60 · Apr 24, 2024

vanc said:
In general, it sounds like the chiplet design AMD poineered. Perhaps Apple has some new tricks in its sleeves.

This goes incrementally past what AMD is doing with MI300 and 3D cache chip bonding. But this patent is pretty far decoupled from what AMD has been doing in Zen 1-5 SoCs. Only in a too vague, broad, general sense to be not appropriate for a patent. AMD didn't invent the general notion of chiplets. AMD acquired Xlinix who had done more aggressive stuff than AMD did before the acquisition.

" ...
A search of the patent databases reveals use of the chiplet term as early as 1969. However, in the integrated circuit field, it is only in IBM applications published in late 2000 that our current understanding of the term and technology align. ...
...
However for system-in-package heterogeneous integration, IBM, GlobalFoundries and Intel are the most active companies.
..."

Chiplets: A Short History - EE Times

The idea of the disagreggation of ICs into smaller physical pieces of silicon — chiplets — has been generating buzz for some time.

www.eetimes.com

Papermaster arriving at AMD from IBM (with a brief stop at Apple where they tagged him with the "you are holding it wrong" iPhone ) , the work AMD did was not particularly surprising.

leman · Apr 24, 2024

AdamBuker said:
So if they use a 'chiplet' design, would there be any way to mitigate increased memory latency?

The memory latency of Apple Silicon is already very high. Communication between CPU clusters on a single die is comparable to multi-socket x86 systems. Apple fabric is optimized for bandwidth, not latency. And it seems they can already deal with it quite well, so additional latency likely won't hurt them too much.

deconstruct60 · Apr 24, 2024

leman said:
The memory latency of Apple Silicon is already very high. Communication between CPU clusters on a single die is comparable to multi-socket x86 systems. Apple fabric is optimized for bandwidth, not latency. And it seems they can already deal with it quite well, so additional latency likely won't hurt them too much.

The CPU clusters are not the primary clusters that the fabric is designed around. Lots of the discussion around these Apple SoC revolves around a pre-integration viewpoint that the whole crux of the system revovles around the CPU cores and the rest is all built around that. Apple more so is tacking the CPU cores onto what is there for the GPU cluster(s). The CPU core clusters are on QoS limiters ( so yes they are structured to deal with some latency). The Max die in particular is closer to a GPU package with some CPU cores tacked on. Pro pretty much the same way. THe plain Mn has far more non-CPU stuff on die than CPU clusters.

And it is not so much 'high' latency are regular latency. Throwning substantively more irregular latency at the GPU processing stack to user experience probably won't be 'smooth sailing'.

leman · Apr 24, 2024

deconstruct60 said:
The CPU clusters are not the primary clusters that the fabric is designed around. Lots of the discussion around these Apple SoC revolves around a pre-integration viewpoint that the whole crux of the system revovles around the CPU cores and the rest is all built around that. Apple more so is tacking the CPU cores onto what is there for the GPU cluster(s). The CPU core clusters are on QoS limiters ( so yes they are structured to deal with some latency). The Max die in particular is closer to a GPU package with some CPU cores tacked on. Pro pretty much the same way. THe plain Mn has far more non-CPU stuff on die than CPU clusters.

And it is not so much 'high' latency are regular latency. Throwning substantively more irregular latency at the GPU processing stack to user experience probably won't be 'smooth sailing'.

GPUs are experts at hiding latency. TBDR is hardly latency-optimized either. And we are talking about hundreds of ns or few microseconds, not milliseconds. It’s not something a user will notice. I wrote “high latency” because hundred ns for intra-CPU communication is quite a lot in a single-die CPU world.

Fully agree with your characterization of the fabric.

BigSplash · Apr 25, 2024

What are the fundamental operations involved in evaluating a LLM? I had thought (based on experience long, long ago) that it involved very large matrix operations which are algorithms using block-based memory access patterns. Is this correct?

leman · Apr 25, 2024

BigSplash said:
What are the fundamental operations involved in evaluating a LLM? I had thought (based on experience long, long ago) that it involved very large matrix operations which are algorithms using block-based memory access patterns. Is this correct?

Pretty much, yes. It’s a lot of matrix multiplication with occasional vector normalization.

BigSplash · Apr 25, 2024

leman said:
Pretty much, yes. It’s a lot of matrix multiplication with occasional vector normalization.

@leman Thank you. In this case I don't think that building larger caches will help much and that memory bandwidth will be the critical resource. So I would look for structures supporting a large memory throughput even at the cost of an increase in latency.

leman · Apr 25, 2024

BigSplash said:
@leman Thank you. In this case I don't think that building larger caches will help much and that memory bandwidth will be the critical resource. So I would look for structures supporting a large memory throughput even at the cost of an increase in latency.

Yep. ML is fundamentally bandwidth starved these days. That’s why a lot of attention is on making the data smaller (via limited precision data types and element compression), and on the designing very fast data interconnects on the high end. Companies are also aggressively researching in-memory compute and complex node topologies.

Guenter · Apr 28, 2024

Is it likeky that we get more information about the m4 at WWDC? I know WWDC is about software, but software depends on the capability of the hardware and the focus on AI Development may be a reason to say something about hardware.

leman · Apr 28, 2024

Guenter said:
Is it likeky that we get more information about the m4 at WWDC? I know WWDC is about software, but software depends on the capability of the hardware and the focus on AI Development may be a reason to say something about hardware.

If they have some big announcements (e.g. a redesign of the GPU programming model that is hopefully incoming, or SVE), maybe. Otherwise, unlikely.

M4+ Chip Generation - Speculation Megathread [MERGED]

macrumors 68030

macrumors regular

macrumors 601

macrumors G5

macrumors G5

macrumors Core

macrumors Core

macrumors regular

macrumors G4

macrumors G5

macrumors Core

macrumors 6502

macrumors Core

macrumors G5

macrumors G5

macrumors G5

macrumors Core

macrumors G5

macrumors Core

macrumors member

macrumors Core

macrumors member

macrumors Core

macrumors newbie

macrumors Core

Our Staff