I’m not an expert either, but yes, I expect the M4 to have a special focus on the Neural Engine, while the CPU stays largely the same (maybe more efficient cores?) and a bit more powerful GPU, paired with 12GB of base RAM.We can say that a larger portion of the m4 will be used by the ai units, leaving less room for improvements on the cpu and gpu side.
I expect that improvements for cpu ang gpu compared to m3 will be minor.
I am no expert on ai, are there reasons that larger ai units require or strongly benefit from more ram?
If not i dont think the base ram will stay at 8 GB, may be 12 GB but not more.
We can say that a larger portion of the m4 will be used by the ai units, leaving less room for improvements on the cpu and gpu side.
I am no expert on ai, are there reasons that larger ai units require or strongly benefit from more ram?
If not i dont think the base ram will stay at 8 GB, may be 12 GB but not more.
I wonder if it would be possible, likely, and practical for Apple to use a fusion connector for the the Ultra, but instead of 2x MnMax chips glued together, it would be an SOC with CPU/NPU for the 1st chip and a second chip that would offer some combinations of GPU/NPU/CPU cores (or possibly just GPU cores) whilst not unnecessarily duplicating other SOC components. The die sizes of both would be larger and need more power, but for a desktop system I would not see that as a problem.
I wonder if it would be possible, likely, and practical for Apple to use a fusion connector for the the Ultra, but instead of 2x MnMax chips glued together, it would be an SOC with CPU/NPU for the 1st chip and a second chip that would offer some combinations of GPU/NPU/CPU cores (or possibly just GPU cores) whilst not unnecessarily duplicating other SOC components. The die sizes of both would be larger and need more power, but for a desktop system I would not see that as a problem.
If they went this route I could see the base chip having maybe 20 P-cores and a 16 core NPU (and a binned version with maybe 16 P-cores). The second chip could have one of two standard configurations (plus binned versions). The first would be GPU heavy with most of the die for GPU cores (say maybe 80-120). The second would have maybe 8-12 additional P-cores, 16 additional NPU cores and 64-80 GPU cores. My numbers here are just spitballing from wishful drinking and ignorance so feel free to substitute more realistic numbers. My hope is that the second option would allow for more PCIe bandwidth to the expansion slots in the Mac Pro.
We can say that a larger portion of the m4 will be used by the ai units, leaving less room for improvements on the cpu and gpu side.
I expect that improvements for cpu ang gpu compared to m3 will be minor.
I am no expert on ai, are there reasons that larger ai units require or strongly benefit from more ram?
If not i dont think the base ram will stay at 8 GB, may be 12 GB but not more.
It also will get more costly to have asymmetric dies. One die used twice is cheaper than building, testing, qualifying two different ones. If going to build two different dies then get more 'chiplet benefit' traction by being able to use at least one of those dies in a package multiple times. ( e.g., AMD's 2-8 CPU chiplets + 1 I/O die in their 'CPU' packages The I/O die being the only one with connection to memory tends to make the access 'uniform' because everyone else goes about equally slower. ).
We do know that they are considering designs like this that combine an SoC with additional specialized dies. This is from a recent Apple patent:
More (and/or faster) cache.So if they use a 'chiplet' design, would there be any way to mitigate increased memory latency?
We do know that they are considering designs like this that combine an SoC with additional specialized dies. This is from a recent Apple patent:
What was the title (focus) of that patent? That is more dis-integrated and functionally decomposed then Intel's Current Ultra "Meteor Lake". It is a bit excessive if perf/watt is a top primary goal. Is it actually about a 'double sided' interposers or the stuff glued to the interposer? In that first case, this may be just trying to show the wide variety of possible interfaces. That would not necessarily be a highly considered design direction.
It wouldn't be surprising if that was reticle limited also.
In general, it sounds like the chiplet design AMD poineered. Perhaps Apple has some new tricks in its sleeves.This is the patent: https://patentscope.wipo.int/search/en/detail.jsf?docId=US426464916&_cid=P21-LVE0SW-78241-1
The general topic to my layman eyes seems to be combining multiple dies manufactured at different nodes while minimizing overall package area and managing heat. They also mention optical interconnects between the dies. It does not seem to me that perf/watt is the major goal here. Instead, this is about performance and optimal utilization of nodes (e.g. splitting logic and cache into different dies manufactured at different node sizes).
In general, it sounds like the chiplet design AMD poineered. Perhaps Apple has some new tricks in its sleeves.
So if they use a 'chiplet' design, would there be any way to mitigate increased memory latency?
This makes me wonder if this could be the way Apple introduces a kind of quasi-modularity for the Mac Pro where you could swap an SOC package for another if they were placed on a removable daughter card. I doubt they would actually go this direction but if memory serves me correctly, many of the PowerMac G4's could be upgraded as they had their processors on daughter cards.
More likely than that, they could have an in-store service where if you make an appointment they could swap out your old motherboard for an upgraded one (with the benefit of a huge discount over buying an outright new machine) provided they keep all other aspects of the case design the same across generations.
What I think is most likely is that Apple will maintain the status quo where you choose the configuration at purchase, but a man can dream.
This is the patent: https://patentscope.wipo.int/search/en/detail.jsf?docId=US426464916&_cid=P21-LVE0SW-78241-1
The general topic to my layman eyes seems to be combining multiple dies manufactured at different nodes while minimizing overall package area and managing heat. They also mention optical interconnects between the dies.
It does not seem to me that perf/watt is the major goal here.
Instead, this is about performance and optimal utilization of nodes (e.g. splitting logic and cache into different dies manufactured at different node sizes).
In general, it sounds like the chiplet design AMD poineered. Perhaps Apple has some new tricks in its sleeves.
So if they use a 'chiplet' design, would there be any way to mitigate increased memory latency?
The memory latency of Apple Silicon is already very high. Communication between CPU clusters on a single die is comparable to multi-socket x86 systems. Apple fabric is optimized for bandwidth, not latency. And it seems they can already deal with it quite well, so additional latency likely won't hurt them too much.
The CPU clusters are not the primary clusters that the fabric is designed around. Lots of the discussion around these Apple SoC revolves around a pre-integration viewpoint that the whole crux of the system revovles around the CPU cores and the rest is all built around that. Apple more so is tacking the CPU cores onto what is there for the GPU cluster(s). The CPU core clusters are on QoS limiters ( so yes they are structured to deal with some latency). The Max die in particular is closer to a GPU package with some CPU cores tacked on. Pro pretty much the same way. THe plain Mn has far more non-CPU stuff on die than CPU clusters.
And it is not so much 'high' latency are regular latency. Throwning substantively more irregular latency at the GPU processing stack to user experience probably won't be 'smooth sailing'.
What are the fundamental operations involved in evaluating a LLM? I had thought (based on experience long, long ago) that it involved very large matrix operations which are algorithms using block-based memory access patterns. Is this correct?
@leman Thank you. In this case I don't think that building larger caches will help much and that memory bandwidth will be the critical resource. So I would look for structures supporting a large memory throughput even at the cost of an increase in latency.Pretty much, yes. It’s a lot of matrix multiplication with occasional vector normalization.
@leman Thank you. In this case I don't think that building larger caches will help much and that memory bandwidth will be the critical resource. So I would look for structures supporting a large memory throughput even at the cost of an increase in latency.
Is it likeky that we get more information about the m4 at WWDC? I know WWDC is about software, but software depends on the capability of the hardware and the focus on AI Development may be a reason to say something about hardware.